During my presentation “Unveiling Secrets in Binaries using Code Detection Strategies” at REcon 2023 (slides, recording, code), I showcased heuristics how to navigate in unknown binaries in various reverse engineering settings, such as malware analysis, vulnerability discovery and embedded firmware analysis. In this talk, I also presented a simple but powerful technique to identify common API functions in statically-linked executables and in embedded firmware. This blog post will delve deeper into this subject, exploring these API functions, the intuition behind the heuristic, and its additional use cases in the context of malware analysis.
Statistical analysis is a set of methods which analyze and organize data to discover its underlying structure. One of the most common use cases in computer science is machine learning, for which they form the mathematical foundation. However, often, even the simplest analysis techniques are powerful enough to significantly simplify day-to-day tasks. In this blog post, I will show you how such a technique, n-gram analysis, can be used to identify uncommon instruction sequences in binary code. It is not only fun to see what statistics can reveal about assembly patterns, but it is also an effective technique to pinpoint obfuscated code or other obscure computations which might be worth a closer look during reverse engineering.
Let’s say we see the following arithmetic expression:
After I recently gave a workshop on the Analysis of Virtualization-based Obfuscation at r2con2021 (slides, code & samples and the recording are available online), I would like to use this blog post for a brief summary on how to write disassemblers for VM-based obfuscators based on symbolic execution.
In a previous blog post, we already discussed that it is valuable to know which code areas are obfuscated; those areas often guard sensitive code and are worth a closer look. Furthermore, we designed a heuristic to automatically detect control-flow flattening and state machines in binaries by identifying specific loop characteristics in the control-flow graph. However, other code obfuscation techniques such as opaque predicates, complex arithmetic encodings or virtualization are not necessarily covered by this heuristic, especially if the control-flow graph is loop-free. For these cases, we have to develop new heuristics to identify obfuscation.
Automation plays a crucial rule in reverse engineering, no matter whether we search for vulnerabilities in software, analyze malware or remove obfuscated layers from code. Once we manually identify repeating patterns, we try to automate the process as far as possible. For automation, it often doesn’t matter if you use Binary Ninja, IDA Pro or Ghidra, as long as you have the knowledge how to realize it in your tool of choice. As you will see, you don’t have to be an expert to automate tedious reverse engineering tasks; sometimes it just takes a few lines of code to improve your understanding a lot.
Following my last blog post, I got a lot of questions about additional material on control-flow analysis. While most compiler books (such as the Dragon Book) cover related topics in-depth, I decided to publish my own presentation that was initially built for (but never made it into) my training class on software deobfuscation. The slide deck illustrates the theory of control-flow graph construction, dominance relations and loop analysis. In the second part of this post, I would like to show you how to play around with these concepts using the reverse engineering framework Miasm.
Commercial businesses and malware authors often use code obfuscation to protect specific code areas to impede reverse engineering. In my experience, knowing which code areas are obfuscated often pinpoints sensitive code parts that are worth a closer look. For example, the FinSpy samples that were discovered in September 2020 obfuscate their main modules with Obfuscator-LLVM, while the two-staged dropper isn’t obfuscated at all.