Statistical Analysis to Detect Uncommon Code

Statistical Analysis to Detect Uncommon Code

26 Jan 2023 - Tim Blazytko

Statistical analysis is a set of methods which analyze and organize data to discover its underlying structure. One of the most common use cases in computer science is machine learning, for which they form the mathematical foundation. However, often, even the simplest analysis techniques are powerful enough to significantly simplify day-to-day tasks. In this blog post, I will show you how such a technique, n-gram analysis, can be used to identify uncommon instruction sequences in binary code. It is not only fun to see what statistics can reveal about assembly patterns, but it is also an effective technique to pinpoint obfuscated code or other obscure computations which might be worth a closer look during reverse engineering.

Similar to previous blog posts in which we developed a heuristics to pinpoint control-flow flattening and discussed various complexity metrics to pinpoint obfuscation in general, the implementation of these techniques has been added to my Obfuscation Detection plugin for Binary Ninja. As for the other heuristics, the implementation is architecture-agnostic, although it contains some architecture-specific optimizations for x86, x86-64, ARM32 and AARCH64.

In the following, we first familiarize ourselves with the foundations of n-gram analysis and its use cases. Then, we discuss how to perform statistical analysis of assembly code and develop a heuristic to identify uncommon instruction sequences. Afterward, we explore some similarities and differences between the most common CPU architectures. Finally, we evaluate the heuristic by identifying and analyzing obfuscated functions in malware, a Windows kernel module, an anti-cheat software and a mobile DRM system.

n-gram Analysis

The term n-gram originates from the field of computational linguistics. n-grams are used to analyze the underlying structures of a language. For this, we partition a text into fixed-size chunks (so-called grams of size n) and count the number of occurrences of the individual chunks in the text. If the underlying text sample is large enough, we retrieve a representative frequency distribution of the language that gives us insights into common patterns of the underlying language, such as the most common letters or most common words of the language. Furthermore, it allows us to predict which word most likely follows after a given word; a characteristic which is used to support users while typing messages on mobile phones by making suggestions based on the current writing context.

To illustrate this, let us count all 1-grams in the following text:

This is a test.

First, we normalize the text by representing it as a continuous sequence of lower-case letters only:

thisisatest

For the English language, 1-grams are equivalent to individual letters. Therefore, we just count the number of occurrences of all individual letters:

As we can see, the letters ‘t’ and ‘s’ are the most common, ‘a’ and ‘e’ the least common letters. If the sample text was larger—for example, 1,000 words—we would obtain a representative frequency distribution of the English language, in which ‘e’ is the most common letter with 13%, followed by ‘t’ with 9.1% and ‘a’ with 8.2%. These tables are used, for example, in cryptanalysis to break substitution ciphers: By performing a frequency analysis of the ciphertext, we can map the distribution of letters in the ciphertext to the distribution of letters of the English language. As a consequence, we can undo the substitution.

While such 1-grams can already be used to break substitution ciphers, statistical analysis becomes even more powerful if it also considers the context of the letters, for example which letter follows before and afterward in the text. To achieve this, we increase the size of the chunks (the number n) and count these chunks in a sliding window approach. For example, if we want to compute all 2-grams in the text above, we count the chunks as follows:

As we can see, ‘is’ is the most-common 2-gram, while all other 2-grams only occur once.

We could continue this approach and compute all 3-grams, 4-grams, 5-grams etc. Now the question is: What n should be used for statistical analysis? While there is no definite answer to this, we can say that the larger the n, the more context we consider. While 1-grams don’t consider any context, 7-grams and more usually tend to overfit to the input text and lose universality. In my experience, 3-grams or 5-grams provide a good trade-off between considering the context and keeping universality. In the following, we perform all n-gram analyses on the assembly level as 3-grams.

Statistical Analysis of Assembly Code

Similarly to natural languages, we can perform a statistical analysis on machine code and assembly code to identify underlying patterns and make predictions based on these patterns. For example, based on the byte frequency of individual bytes in the binary, we can differentiate between code and data; furthermore, we can identify packed code by computing the entropy of individual sections. Closer to the n-gram analysis discussed above, statistical analysis also supports the reverse engineering of unknown instruction set architectures: By mapping byte distributions to common instruction sequences, it is possible to identify the byte representation of the most common instructions, such as the architectural equivalent of mov or call instructions.

In our case, as stated above, the goal is to identify uncommon instruction sequences in binaries. For this, we rely on the disassembly output of one of the major disassemblers, such as IDA Pro, Ghidra or Binary Ninja. Now, we need to accomplish two things:

  1. We need to compute common 3-gram frequencies of assembly code as ground truth.
  2. We need a heuristic to identify code patterns which deviate from the most common 3-grams.

For the sake of simplicity, we rely on x86-64 bit assembly to illustrate how to achieve these goals. Later on, we discuss how such a heuristic can be implemented in an architecture-agnostic way.

Obtaining a Ground Truth

To obtain a ground truth, we need to compute common 3-grams of the assembly language. However, to avoid overfitting to specific binaries or architectures, we consider instruction opcodes only and omit instruction details such as concrete registers or memory addresses. For example, consider the following code sequence:

push    rbp
mov     rbp, rsp
push    r14
push    rbx
test    rdi, rdi

To perform a 3-gram analysis, we first remove all the instruction details and obtain the following sequence:

push
mov
push
push
test

Based on this, we can compute the following 3-grams:

If we perform this analysis over a large set of binaries from different categories (for example compilers, debug tools, multimedia libraries etc.), we get a representative collection of 3-grams, from which we can use, for instance, the 1,000 most common 3-grams as a ground truth.

Identification of Uncommon Instruction Sequences

To identify uncommon instruction sequences, we intuitively want to find code locations whose instruction sequences highly deviate from the generated ground truth. However, for this, we need a well-defined window in which we perform the analysis. The most straightforward approach is to perform this analysis on the function level. As a consequence, we re-phrase our goal as follows:

Our goal is to identify functions whose instruction sequences deviate the most from the generated ground truth. For this, we perform the following steps for each function:

  1. We count the number of 3-grams in the function which are not part of the ground truth.
  2. We divide this number by the total number of 3-grams in the function.

This way, we can assign a score to each function; this score represents the relative amount of 3-grams which are not contained in the ground truth. The higher the score, the more do the instruction sequences in the function deviate from the ground truth.

Given that each function has a score representing the deviation from the ground truth, we can sort all functions by their score—analogous to the other detection metrics—and manually inspect the top 10% of these functions.

Architecture-agnostic Implementation

To realize this heuristic in an architecture-agnostic way, the implementation relies on Binary Ninja’s low-level intermediate language (LLIL). This intermediate language is a common representation of assembly code for all architectures supported by Binary Ninja. During analysis, Binary Ninja lifts assembly instructions into this LLIL, which will then be used for further analysis passes, up to its decompiler output.

To exemplify, the architecture-dependent x86-64 sequence

mov
mov
mov

translates to LLIL as follows:

LowLevelILOperation.LLIL_SET_REG
LowLevelILOperation.LLIL_SET_REG
LowLevelILOperation.LLIL_SET_REG

where the operation LLIL_SET_REG represents an architecture-independent assignment to a register.

We can use this implementation to detect uncommon instruction sequences on all architectures supported by Binary Ninja. However, this approach has a downside: Since it relies on LLIL, it enforces the lifting of all instructions in the binary, which increases the analysis time. On the contrary, an analysis on the assembly level would be significantly faster, but requires a ground truth for each architecture.

As a consequence, the implementation is a trade-off between universality and speed: For the architectures x86, x86-64, ARM32 and ARM64, there are custom ground truths and analyses based on the disassembly opcodes; for all other architectures, the implementation relies on Binary Ninja’s LLIL.

To build a representative ground truth for a wide variety of common applications, the Linux coreutils & binutils, compiler GCC, debugger GDB, image manipulation program GIMP as well as the multimedia player mplayer were used. To cover 3-gram variations of different architectures, all projects have been compiled for the four architectures mentioned above and the according 3-grams have been computed.

Similarities and Differences between Architectures

Before we evaluate the heuristic on various real-world scenarios, let’s take some time and explore some of the similarities and differences between different architectures for fun. However, keep in mind that the following observations only scratch the surface and try to give a better intuition. That said, let us take a closer look at the computed ground truths.

At a first glance, we can see that, across all architectures, the most common instruction sequences are data movement chains, sometimes combined with arithmetic operations or function calls. Furthermore, we can see that all architectures include 3-grams of architecture-specific push and pop operations, most to be found in function prologues and epilogues. In addition, they all include several architecture-specific add chains, due to offset calculations and pointer arithmetics.

Some of the differences, however, are quite more interesting. For instance, the ARM-based architectures contain significantly more load/store operations; the reason for that is that, in most cases, these architectures read and write memory using dedicated instructions, while the x86 and x86-64 architectures can access memory in nearly all data movement and arithmetic operations. Another difference can be seen between the x86 and the x86-64 architecture: The 32-bit architecture contains significantly more push ; call chains. This is due to differences in the calling conventions: the 32-bit architecture uses the stack to pass function arguments, while the 64-bit architecture mostly uses registers.

In short, we can say that many low-level patterns such as function management, calls, data movement and pointer arithmetic are architecture-agnostic, while there are architecture-specific concepts which lead to differences in the most common instruction patterns. This might also have an impact on the architecture-agnostic ground truth in LLIL, if the boundary for the most common 3-grams is set too low. For this reason, the ground truth in LLIL stores the 3,000 most common LLIL 3-grams obtained for all architectures, while the architecture-specific implementations only store the 1,000 most common 3-grams.

Finally, let us now look at some real-world binaries and evaluate what kind of instruction sequences can be identified by the heuristic.

Evaluation

To understand what kind of code locations these heuristics identify, we take a closer look at several binaries that are only partially obfuscated; this means, we use binaries which contain mostly compiler-generated code, but are obfuscated in specific parts to hide underlying computations. In the following sections, we analyze a malware sample, a module from the Window kernel, an anti-cheat system and a mobile DRM system. Note that some of these binaries will not be specified further; however, although we omit some details, the discussion will still provide some intuition and strategies of what to expect and how to proceed. So, to reproduce the results, feel free to pick the anti-cheat software or DRM system of your choice and follow along. Now, let’s get into this and discuss the results.

Malware

The chosen malware sample belongs to the emotet family and has been already described in a previous blog post, in which we analyzed its control-flow flattening implementation. If we use the heuristic to identify uncommon instructions sequences, it identifies 14 functions; while some of them implement control-flow flattening, the majority (11) of them share a structure similar to the following code snippet:

push    ecx {var_4_6}
mov     dword [esp {var_4}], 0x8f8e
shl     dword [esp {var_4_1}], 0x3  {0x47c70}
xor     dword [esp {var_4_2}], 0xe2f0cbde  {0xe2f4b7ae}
xor     dword [esp {var_4_3}], 0x3a546a44  {0xd8a0ddea}
add     dword [esp {var_4_4}], 0xffff377d  {0xd8a01567}
or      dword [esp {var_4_5}], 0xe5ad4c54  {0xfdad5d77}
xor     dword [esp], 0xfdad5c37  {0x140}
mov     eax, dword [esp]  {0x140}
pop     ecx  {0x140}
retn     {__return_addr}

As we can see, the function initializes some variables on the stack. If we analyze where and how these functions are called, we learn that they are part of the control-flow flattening obfuscation and used to update the state variable which encodes the control flow.

Overall, every function identified by the heuristic is obfuscated or pinpoint a helper function managing the obfuscated state.

Windows Kernel

Next, let us take a closer look at ci.dll, the Windows kernel module responsible for code integrity and file authentication. As documented by the Airbus Security Lab, this kernel module implements an AES white-box, which is protected by a virtualization-based obfuscator, the Warbird virtual machine. Note that the concept of virtualization-based obfuscation (also known as virtual machines or VMs) and how to break them has been covered in a previous blog post.

With our heuristic, overall we identify 206 out of 2,065 functions (the top 10%, as described above). Nearly half of these functions are related to cryptographic operations, as we can deduce from by their symbol names (for example SymCryptSha256AppendBlocks_shani); however, 125 of them follow a different pattern: While their symbol name does not reveal any information, their overall code structure is similar to the following code snippet:

mov     r10, rcx
lea     r9d, [rdx+0x62d7be82]
lea     eax, [rdx+0x3853ae71]
movzx   ecx, r9b
xor     eax, r9d
xor     rcx, 0x76
add     eax, 0x64cf8a2f
mov     r8, rdx
shr     r8, 0x20
sub     r8d, edx
shl     rax, 0x20
mov     rcx, qword [r10+rcx*8]
add     r8d, 0x4547fd6
shl     r8, 0x20
or      rax, 0x5d304937
or      r8, 0x1d570d9f
cmp     qword [r10+0x128], rcx
cmove   rax, r8
retn     {__return_addr}

These functions implement handlers of the Warbird VM. As we can see, the handler encodes its data flow using constant unfolding (e.g., writing x + 5 as x + 10 - 5) and conditional moves. However, if we take a closer look we notice [r10+rcx*8], which might access a 64-bit jump table at address r10 to which the offset rcx is added, probably computing the address of the next VM handler.

In summary, we can say that the heuristic identified uncommon instructions sequences; 61% of these functions are part of the Warbird VM, while the others implement cryptographic operations.

Anti-Cheat

After having examined a malware sample and a Windows kernel module, let us now look into an anti-cheat software, which is used to detect cheating attempts in online games. Due to its size, this time, the heuristic finds 3,433 out of 34,321 functions in total. While we do not want to manually inspect 3,433 functions, we can immediately identify a common structure by exploring some of these:

pop     rsp
lea     rsi, [rel 0x141f2c3e6]
mov     rbx, r11
movq    rdi, xmm6
sub     r9, 0x4
movq    xmm7, rbx
movq    rbp, xmm7
mov     qword [rbp], rdi
movsxd  rax, dword [rsp]
add     rsp, 0x4
add     rsi, rax
jmp     rsi

We have hundreds of identified functions which end in indirect jumps and compute the target address by adding an offset to a fixed constant. This seems to be some kind of control-flow encoding; however, if you spend more time with virtualization-based obfuscation, this pattern might also remind you of a technique which is known as threaded code: At the end of each VM handler, the dispatcher is inlined and computes the address of the next handler. While we cannot know for sure without further analysis, there is a good chance that the anti-cheat software is protected via virtualization-based obfuscation, which has been located by the heuristic.

Mobile DRM

Finally, let us take a look at a mobile DRM system which plays encrypted multi-media content. Given its size, we again identify a large number of functions (approximately 3,000) with our heuristic. This time, we have a hard time manually inspecting the functions; randomly looking at some function does not reveal any patterns with regards to similar code structures, graph layouts or others. Without further knowledge, way more analysis effort is required to get a better understanding.

However, this does not mean that the heuristics did not identify interesting functions. For example, we identified a lot of functions using a wide amount of floating-point instructions, which it not uncommon for multi-media applications. Furthermore, we pinpointed functions making extensive usage of hardware-based encryption instructions, indicating that these functions might perform some kind of content decryption or key derivation. Finally, we also identified some functions protected with arithmetic obfuscation or mixed Boolean-Arithmetic (MBA) expressions, indicating that we are looking at the right places.

Overall, we can summarize that the heuristic reliably pinpoints uncommon instruction sequences; in certain binaries, this is often obfuscated code. Furthermore, we have seen that choosing the top 10% of all functions might be a bit overwhelming for large binaries. In some cases, the sheer amount of functions can be managed by detecting repetitive obfuscation patterns, which allows an effortless classification of the identified functions. In other cases, significantly more effort is required. However, even in these cases, the heuristic can be useful. Digging into large binaries is always like looking for a needle in a haystack; each heuristic performing a reasonably helpful pre-selection speeds up reverse engineering.

Closing Remarks

In this blog post, I tried to give an intuition for the power of statistical analysis, not only in the context of code obfuscation. I’ve always been fascinated by statistical analysis and its ability to reveal underlying structures, which makes it a good fit for reverse engineering. Although I assumed after my last blog post on obfuscation detection that this series would be finished, I came up with the idea in a discussion with friends and implemented a prototype; after some experiments, I implemented it properly in an architecture-agnostic way. While the results certainly overlap (at least partially) with the other heuristics, it is the first which does not rely on complexity metrics, adding another dimension to the analysis. Furthermore, it is fun to explore the differences and oddities between architectures, while learning that overall, they are very similar on a higher level.

A further perspective of this post is to show how to classify results identified by such heuristics. While we stopped analysis rather quickly, the intention was to provide some insights and strategies how one can reason about obfuscation, starting from scratch. Further analysis until deobfuscation and understanding of the underlying protected code is an iterative process, which I partially covered in other blog posts, such as the post on breaking virtualization-based obfuscation and the post on simplifying MBA-based obfuscation. Furthermore, I cover such topics and strategies in-depth in my training classes on code deobfuscation.

Contact

For questions, feel free to reach out via Twitter @mr_phrazer, mail tim@blazytko.to or various other channels.