Lost and Found in Translation: An intro to ML assisted decompilation and deobfuscation

In the past few years, few of us could have escaped without at least once hearing the names BERT, RoBERTa, and GPT-3 - the acronyms of names given to large language models that have been making immense strides in their ability to process human language.

AI/ML

In addition to being able to generate text, answer questions, classify sentiment expressed in text and so forth, these models are now making advances in understanding and generating source code - an area of a special interest to security professionals.

As more and more developers are using assisted code generation tools like GitHub’s Copilot, new security risks will be emerging: how do we make sure the code generated is secure? How do we audit these models? In addition to code generation, large language models are making advances in the field of intelligent code search, understanding of large code bases - something that will contribute both to security auditing of code bases as well as to finding exploits.

Finally, an interesting new avenue is emerging: what if techniques from neural machine translation could also be applied to creating new kinds of tools for reverse engineering code? In the final part of the talk, we will take a look at recent advances in applying neural machine translation techniques and large language models to the tasks of decompilation and deobfuscation - both of which are useful in analyzing malware.

This talk will serve the following purposes: to introduce the security community briefly to the natural language processing tools, various large language models, how they are constructed and what big techniques enabled the fast-paced advances that we are seeing now and then build upon this foundation to see how natural language processing techniques and in-particular sequence-to-sequence models can do for reverse engineering applications. The audience will learn about the transformer architecture, the difference between encoder, decoder and encoder-decoder models and what it means to fine tune a model.

Finally, we will take a closer look at applying natural language processing models to the task of decompilation - recovering higher level source code from lower level compiled code and deobfuscation. By the end of this talk, the audience should have a good understanding of natural language processing techniques, how large language models are trained, how they can be applied to the realm of source code in the form of code generation, semantic code search, decompilation and deobfuscation.

NDC { Security }

Lost and Found in Translation: An intro to ML assisted decompilation and deobfuscation

Camilla Montonen