Desquirr Master Thesis
Desquirr Master Thesis
Software Engineering
Thesis no: MSE-2002:17
June 2002
David Eriksson
Department of
Software Engineering and Computer Science
Blekinge Institute of Technology
Box 520
SE-372 25 Ronneby
This thesis is submitted to the Department of Software Engineering and Computer Science at Blekinge
Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in
Software Engineering. The thesis is equivalent to 20 weeks of full time studies.
Contact Information:
Author:
David Eriksson
Address: Folkparksvägen 14:32, 372 40 Ronneby
E-mail: david@2good.nu
University advisor:
Lars Lundberg
Department of Software Engineering and Computer Science
Decompilation, or reverse compilation, takes a computer program and produces high-level code that
works like the original source code. This makes it easier to understand a computer program when source
code is not available. However, there are very few tools for decompilation available today. This report
describes the design and implementation of Desquirr, a decompilation plug-in for Interactive Disassembler
Pro. Desquirr has an object-oriented design and performs basic decompilation of programs running on
Intel x86 processors.
The low-level analysis uses knowledge about specialized compiler constructs, called idioms, to perform
a more accurate decompilation. Desquirr implements data flow analysis, meaning the conversion from
primitive machine code instructions into code in a high-level language. The major part of the data flow
analysis is the Register Copy Propagation which builds high-level expressions from primitive instructions.
Control flow analysis, meaning to restore high-level language constructs such as if/else and for loops, is
not implemented.
A high level representation of a piece of machine code contains the same information as an assembly
language representation of the same machine code, but in a format that is easier to comprehend. Symbols
such as ’*’ and ’+’ are used in high-level language expressions, compared to instructions such as ”mul”
and ”add” in assembly language. Two small test cases which compares decompiled code with assembly
language shows promising results in reducing the amount of information needed to comprehend a program.
1 Introduction 3
1.1 Reverse compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Goal and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline for remaining chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Method 5
2.1 Make a basic decompiler for 32-bit 80386 machine code . . . . . . . . . . . . . . . . . . 5
2.2 Make the decompiler object-oriented . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Results 13
5.1 Decompiling the Fibonacci calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Decompiling the palindrome test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 Object-orientation of decompiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Discussion 22
7 Conclusions 23
1
List of Figures
1.1 Relations between high-level language, assembly language and machine code. . . . . . . . 3
3.1 Restore stack with one POP for each function call . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Restore stack with one ADD ESP for each function call . . . . . . . . . . . . . . . . . . . 8
3.3 Restore stack for two calls with one ADD ESP for each function call . . . . . . . . . . . . 9
3.4 Memcpy an unknown number of bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Memcpy a constant number of bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Compare and set boolean; assembly language version . . . . . . . . . . . . . . . . . . . . 10
3.7 Compare and set boolean; translated version . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.8 Question-mark-colon operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2
Chapter 1
Introduction
Disassembly
Machine code
Figure 1.1: Relations between high-level language, assembly language and machine code.
The foundations of reverse compilation comes from compilation, especially the optimization part of a
compiler.
Machine code is relatively easy translated into assembly language with a disassembler. There are sim-
3
1.2 Goal and objectives 4
ple disassemblers such as Objdump in GNU Binutils [3] and advanced disassemblers such as Interactive
Disassembler (IDA) Pro [4]. A simple disassembler starts at address zero and disassembles everything se-
quentially. An advanced disassembler begins disassembling at the program starting location and continues
by looking at the execution flow. The assembly language is a long list of primitive instructions, one per
line. In a high-level language such as C, many primitive instructions are joined together into an expression
on (most often) only a single line. This suggests that the greatest benefit of reverse compilation is the
reduced effort required to comprehend a computer program that is only available in machine language. By
converting machine code to a high-level language (such as C) the number of lines of information may be
reduced significantly and the context of the program becomes much clearer.
Decompilers have existed since the early 1960s, but reverse compilation is still a relatively small area of
academic research. It is related to the reverse engineering and reengineering disciplines. One reason for the
limited research in this area may be legal issues, as a reverse compilation tool may be abused. More about
legal issues can be found in [5]; they are not discussed further in this thesis. Today, the only previously
available decompiler for Intel x86 machine code is limited to 16-bit Intel 286 machine code compiled for
MS-DOS [6].
In my personal experience, a more modern implementation of an application should use an object-
oriented development process. When developing complex software, such as decompilers, there are high
requirements of flexibility and extendability. An object oriented development process supports a more
structured approach to solving the problem. The use of object oriented techniques when developing de-
compilers will allow more complex and flexible systems in the future.
This report is intended for people with knowledge of Intel 80386 assembler and basic compiler tech-
niques. Object-oriented design skills are also useful.
The latest information about Desquirr is available at http://www.2good.com/software/desquirr/.
Method
• Data flow analysis covers the conversion from primitive machine code instructions to expressions in
high-level language.
• Control flow analysis convert conditional and non-conditional jump instructions into high-level lan-
guage control constructs such as if/else, switch/case for-loops and while-loops.
Interactive Disassembler (IDA) Pro from DataRescue is a commercial tool for disassembly of binary
programs. It runs on the MS Windows and OS/2 platforms but is capable of loading and disassembling
binaries for many different platforms. The first version of IDA was released in 1991. IDA Pro is currently
on version 4.21 and now supports many common CPUs (such as the Intel x86 series) and virtual machines
(such as Java).
IDA Pro also has a plugin interface which allows developers to add custom processing of loaded ma-
chine code. This plugin interface was used to create the Desquirr decompiler. The benefit of using IDA Pro
as a base for developing a decompiler is that it already handles loading of binary files and identification of
library functions. This saves a lot of work, but creates a dependency on a commercial entity.
The effectiveness of the Desquirr decompiler plugin will be measured by counting non-empty lines
in code listings.1 One line represents one instruction. This kind of comparison may seem rather coarse,
but an high-level instruction is easier to understand than the sequence of assembly language instructions
representing the high-level instruction. The reason for this is that a high level representation of a piece of
machine code contains the same information as an assembly language representation of the same machine
code, but in a format that is easier to comprehend. Symbols such as ’*’ and ’+’ are used in high-level
language expressions, compared to separate instructions such as ”mul” and ”add” in assembly language.
1 Exact line number calculation: grep -cv ’ˆ$’ filename
5
2.2 Make the decompiler object-oriented 6
3.1 Overview
The Decompilation in Desquirr is performed like this, step by step:
1. Perform low-level analysis. Desquirr starts by making a list of machine code instructions for the
current function. This information is gathered from IDA Pro. This list is then iterated to find idioms
and convert low-level instructions into high-level instructions and expressions. See section 3.3 for
details about idioms.
2. Separate list of instructions into basic blocks. These are important elements in compiler and decom-
piler theory [10]. A basic block is a sequence of instructions inside a function. A basic block ends
with a conditional jump, unconditional jump, return from a function or a ”fall through” if the next
basic block is the target of a jump.
3. Calculate uses and definitions for all instructions. A register is said to be Defined when a value is
assigned to this register. A register is said to be Used when it is referenced but not modified by an
instruction.
4. Perform live register analysis. Live registers denote processor registers that are defined in one basic
block and used in another basic block. The live registers are useful to detect whether an optimization
is safe or not. LiveIn for a basic block is a list of Live registers on the entry to this basic block, and
LiveOut is a list of the registers that are Live at the exit from the basic block. The algorithm for
calculating the LiveIn and LiveOut lists is provided by Cifuentes [7].
5. Find DU-chains. DU is an abbreviation of Definition-Use and DU-chains are an important aid in
both compiler optimization and decompilation [10]. A DU-chain is used to connect the definition of
a processor register with the uses of this definition. A DU-chain is created by taking each definition
of a register in an instruction and find every use of the register before it is redefined or the basic block
ends. Cifuentes [7] describes DU-chains in more detail.
6. Register Copy Propagation. This is the first part of the data flow analysis. This requires DU-chains
and Live registers and is a simple but very effective algorithm. Register copy propagation means that
when a definition only has one use, we replace the use with the definition. This algorithm is also
provided by Cifuentes [7].
7. Find function call parameters. This part of the data flow analysis takes stack push instructions and
convert them into function call parameters.
8. Code generation. Print the basic blocks in decompiled format. The format of decompiled code is
made to look like C code.
7
3.2 Static compared to dynamic analysis 8
If control flow analysis had been implemented, this would have occured after data flow analysis and
before code generation. It is also worth noting that no step in the data flow analysis adds more instruction
that it removes. This guarantees that the number of instructions in the decompiled version of a program is
always less than or equal to the number of instructions in the machine code.
• Primarily Desquirr lets IDA Pro decide what is code and what is data. This analysis handles most
cases.
• If the plugin encounters unknown bytes within a function, it asks IDA Pro to try to generate code
from those bytes.
• Due to the interactive nature of IDA Pro, the user is able to manually decide what is code and what
is data.
3.3 Idioms
An idiom is a special way of performing a certain operation. An example of a simple idiom is to use
xor eax, eax to clear register eax and not use mov eax, 0. The number of clock cycles required is two
for both versions on the 80386 CPU. The former instruction requires two bytes of memory, but the latter
requires five bytes, making it clear that this is a size optimization only. The knowledge of idioms is vital to
be able to perform adequate decompilation.
Cifuentes defines an idiom as a sequence of instructions that has a logical meaning which cannot be
derived from the individual instructions [7]. She describes certain idioms that she found during her work
with the dcc decompiler. This section aims at adding to her collection of idioms.
Figure 3.1: Restore stack with one POP for each Figure 3.2: Restore stack with one ADD ESP for
function call each function call
3.3 Memcpy an unknown number of bytes 9
However, an optimized version of this was found in code generated by the gcc compiler, where the
stack was not cleared immediately after each call:
Figure 3.3: Restore stack for two calls with one ADD ESP for each function call
The first two constructions make it very simple to deduce the number of parameters to a function.
The latter, however, makes an unaware decompiler to wrongly assume that the function call before add
esp takes all the parameters restored from the stack and that the previous function call has no parameters.
Desquirr attempts to guess the correct number of parameters, but most correct guess of stack parameter
count comes from analyzing the called functions. Desquirr is capable of reading such information from
IDA Pro, but does not attempt to generate this information on its own.
Figure 3.6: Compare and set boolean; assembly Figure 3.7: Compare and set boolean; translated
language version version
if (eax)
neg eax eax = 8;
sbb eax, eax else
and eax, 8h eax = (eax ? 8 : 0) eax = 0;
Desquirr currently consists of a little more than 5000 lines of C++ code, not counting empty lines or lines
beginning with comments.1
11
4.3 Function calls 12
Assignment (=)
Results
13
5.3 Object-orientation of decompiler 14
#include <stdio.h>
int main()
{
int i, numtimes, number;
unsigned value, fib();
sub 10291:
printf("Input number of iterations: ");
ax = scanf("%d", & var 2);
si = 1;
goto loc 102DD;
loc 102AF:
printf("Input number: ");
scanf("%d", & var 4);
var 6 = sub 102EB(var 4); 10
ax = printf("fibonacci(%d) = %u\n", var 4, var 6);
si = si + 1;
loc 102DD:
if (si <= var 2) goto loc 102AF;
exit(0);
return ax;
20
sub 102EB:
if (arg 0 <= 2) goto loc 10313;
loc 10318:
return ax;
enter 6, 0
push si
push offset aInputNumberOfI ; format
call printf 10
pop cx
lea ax, [bp+var 2]
push ax
push offset aD ; format
call scanf
add sp, 4
mov si, 1
jmp short loc 102DD
20
loc 102AF: ; CODE XREF: sub 10291+4F
push offset aInputNumber ; format
call printf
pop cx
lea ax, [bp+var 4]
push ax
push offset aD 0 ; format
call scanf
add sp, 4
push [bp+var 4] 30
call sub 102EB
pop cx
mov [bp+var 6], ax
push [bp+var 6]
push [bp+var 4]
push offset aFibonacciDU ; format
call printf
add sp, 6
inc si
40
loc 102DD: ; CODE XREF: sub 10291+1C
cmp si, [bp+var 2]
jle loc 102AF
push 0 ; status
call exit
pop cx
pop si
leave
retn 50
sub 10291 endp ; sp = -0Ch
push bp
mov bp, sp
push si
mov si, [bp+arg 0]
cmp si, 2 10
jle loc 10313
mov ax, si
dec ax
push ax
call sub 102EB
pop cx
push ax
mov ax, si
add ax, 0FFFEh
push ax 20
call sub 102EB
pop cx
pop dx
add dx, ax
mov ax, dx
jmp short loc 10318
30
loc 10313: ; CODE XREF: sub 102EB+A
mov ax, 1
jmp short $+2
#include <stdio.h>
#include <string.h>
#include <malloc.h>
if (argc < 2)
{
original = "nitalarbralatin";
} 20
else
{
original = argv[1];
}
reverse = malloc(strlen(original)+1);
rev(original, reverse);
if (0 == strcmp(original, reverse)) 30
{
printf("%s is a palindrome\n", original);
}
else
{
printf("Try again!\n");
}
free(reverse);
40
return 0;
}
sub 401150:
bx = arg 0;
ax = strlen(bx) + arg 4;
* ax = 0;
ax = ax − 1;
goto loc 40116E;
loc 401167:
dl = * bx;
bx = bx + 1; 10
ax = ax − 1;
* (ax + 1) = dl;
loc 40116E:
if ((* bx) != 0) goto loc 401167;
return ax;
main: 20
if (argc >= 2) goto loc 401188;
bx = "nitalarbralatin";
goto loc 40118E;
loc 401188:
bx = * (argv + 4);
loc 40118E:
si = malloc( strlen(bx) + 1); 30
sub 401150(bx, si);
if ( strcmp(bx, si) != 0) goto loc 4011C7;
loc 4011C7:
printf("Try again!\n");
loc 4011D2: 40
free(si);
return 0;
push ebp
mov ebp, esp
push ebx
push esi 10
cmp [ebp+argc], 2
jge short loc 401188
mov ebx, offset aNitalarbralati ; “nitalarbralatin”
jmp short loc 40118E
loc 401188: ; CODE XREF: main+9
mov eax, [ebp+argv]
mov ebx, [eax+4]
loc 40118E: ; CODE XREF: main+10
push ebx ; s
call strlen 20
pop ecx
inc eax
push eax ; size
call malloc
pop ecx
mov esi, eax
push esi ; int
push ebx ; s
call sub 401150
add esp, 8 30
push esi ; s2
push ebx ; s1
call strcmp
add esp, 8
test eax, eax
jnz short loc 4011C7
push ebx
push offset aSIsAPalindrome ; format
call printf
add esp, 8 40
jmp short loc 4011D2
loc 4011C7: ; CODE XREF: main+3F
push offset aTryAgain ; format
call printf
pop ecx
loc 4011D2: ; CODE XREF: main+4F
push esi ; block
call free
pop ecx
xor eax, eax 50
pop esi
pop ebx
pop ebp
retn
main endp
push ebp
mov ebp, esp
push ebx
mov ebx, [ebp+arg 0]
push ebx ; s 10
call strlen
pop ecx
add eax, [ebp+arg 4]
mov byte ptr [eax], 0
dec eax
jmp short loc 40116E
Discussion
The primary source of knowledge that the Desquirr decompiler is based on is the Ph.D thesis [7] by Cristina
Cifuentes. Her thesis is a major work in the field of reverse compilation. As part of the Ph.D thesis a
decompiler called dcc was created [6]. Dcc provides a valuable reference regarding tool support for reverse
compilation.
Cifuentes has continued to work in the field of reverse compilation, including work on the University
of Queensland Binary Translator (UQBT) [11]. UQBT is capable of taking a binary for Solaris running on
a SPARC CPU and translate it into a binary for Linux running on a Pentium CPU. The UQBT project will
be used in a newborn open source decompiler project called Boomerang [12]. Unfortunately, the source
code to UQBT has not yet been released to the public and has therefore not been studied as part of this
thesis. An interesting future task would be to compare Desquirr with UQBT.
In order to support certain compiler optimizations we need to adapt the decompiler for each optimiza-
tion. To implement this support we have to test the decompiler on many programs and see how the compiler
has optimized the machine code and how well the decompiler handles the optimization. C++ templates and
optimizations such as function inlining and loop unrolling may be impossible, or at least very complicated,
to detect. Therefore it would be equally difficult to decompile such these optimizations in a way that is
similar to the original code.
When converting machine code to high-level code we need to detect and propagate data types through
the program. This is not implemented in Desquirr. Single integers are easy to handle, but arrays, struct
and class data types are much harder. There is a risk that different high-level constructions are compiled
into a single construction in machine code. This makes it harder to correctly reverse the compilation.
The lack of data type handling in Desquirr is also the reason that Desquirr does not produce compilable
C code. The format of decompiled code is made to look like C code, although it is not compilable without
editing.
Another problem seen during decompilation is that idioms some times becomes ”scrambled” due to
compiler optimizations, probably because of pipelining. The current implementation of idiom handling in
Desquirr requires that instructions in the idiom are sequential.
IDA Pro is capable of finding and describing certain switch/case idioms. Therefore, switch and case
instructions were added to Desquirr and the switch/case information provided by IDA Pro was used to
create switch and case statements in the decompiled code. Structure data types declared by IDA Pro are
also used by Desquirr, but support for enumerations is not yet implemented.
The RTL System uses Static Single Assignment (SSA) form, but Desquirr does not. The meaning of
SSA form is that a variable is only assigned once and so called φ-instructions (phi-instructions) are used
to join variables at the beginning of a basic block. The purpose is to make optimizations easier. For an
introduction to SSA and references for further reading, see section 8.11 in Advanced Compiler Design and
Implementation [13]. Using SSA form in Desquirr is a possible future project.
The addition of control flow analysis would make Desquirr an even more useful decompiler. The
Analysis abstract base class may have to be refactored in order to support control flow analysis and other
kinds of analyzes not currently implemented in Desquirr. We may also need new methods on other data
structures. Further development may warrant the use of the Visitor pattern from [14] for the data structures
in Desquirr.
22
Chapter 7
Conclusions
The use of Interactive Disassembler Pro as a foundation for this decompiler was a good decision and
allowed me to concentrate on the reverse compilation parts. However, the dependency on a commercial
entity may be undesirable.
With the aid of a few simple algorithms we can significantly improve the possibilities to understand
a program that is only available in machine code, but it would be a monstrous task to make a decompiler
capable of perfectly decompiling most applications.
Desquirr was capable of decompiling two small example programs into about as many lines of decom-
piled code as lines of C code in the original program. The decompiled code does not include curly braces
or variable declarations and only has goto statements instead of loop structures.
No step in the decompilation process adds more instruction lines than it removes. This means that the
number of lines of decompiled code is guaranteed to be less than or equal to the number of machine code
lines.
The framework created for this plugin can serve as a base for future work in making a more sophisti-
cated decompiler. A more advanced compiler may still utilize the same data structures with few modifica-
tions.
23
References
24
References 25
Node
Label Class that represents a label, that is a destination of a Jump. This instruction has no operands.
26
A.3 Expression class hierarchy 27
LowLevel This instruction represents a low-level instructions as it is represented by IDA Pro. It has to be
processed into other instructions before code generation.
UnaryInstruction Abstract parent class for single-operand instructions.
BinaryInstruction Abstract parent class for dual-operand instructions.
Assignment Assigns from any expression to a primitive expression.
ConditionalJump Jumps to a location if a condition is met.
Jump Unconditional jump to a location.
Push Push an operand on the stack. All push instructions should be eliminated before code generation.
Pop Pop an operand from the stack. All pop instructions should be eliminated before code generation.
Return Return a value from a function.
Switch Represents the header of a switch statement. The Switch instruction is present because IDA Pro is
capable of detecting certain types of switch constructions.
Case This is a case label that is part of a switch statement.
Instruction
Expression
BinaryExpression
Call
Dummy
Location
NumericLiteral
Register
Expression
StringLiteral
TernaryExpression
Location
Figure A.3: Class diagram for the Expression Figure A.4: Class diagram for the Location part of
class hierarchy the Expression class hierarchy