Sec24summer Prepub 346 He
Sec24summer Prepub 346 He
Abstract
Binary code similarity detection (BCSD) has garnered sig-
nificant attention in recent years due to its crucial role in
various binary code-related tasks, such as vulnerability search
and software plagiarism detection. Currently, BCSD systems
are typically based on either instruction streams or control
flow graphs (CFGs). However, these approaches have limita- (a) Instruction Stream-based (b) CFG-based
tions. Instruction stream-based approaches treat binary code
as natural languages, overlooking well-defined semantic struc- Figure 1: Two lines of works on code semantics learning.
tures. CFG-based approaches exploit only the control flow
structures, neglecting other essential aspects of code. Our
key insight is that unlike natural languages, binary code has and identifying malware code snippets [4]. Nevertheless, it
well-defined semantic structures, including intra-instruction is very challenging to develop general and efficient detec-
structures, inter-instruction relations (e.g., def-use, branches), tion methods. On one hand, two semantically identical code
and implicit conventions (e.g. calling conventions). Motivated snippets may exhibit entirely different syntax representations,
by that, we carefully examine the necessary relations and particularly when a piece of source code is compiled into
structures required to express the full semantics and expose different instruction set architectures. On the other hand, two
them directly to the deep neural network through a novel semantically different snippets can possess similar syntax
semantics-oriented graph representation. Furthermore, we representations. Thus, understanding the high-level semantic
propose a lightweight multi-head softmax aggregator to effec- features of code is the key to effectively performing BCSD.
tively and efficiently fuse multiple aspects of the binary code. Existing works can generally be classified into two direc-
Extensive experiments show that our method significantly tions, i.e., instruction streams-based and control flow graph
outperforms the state-of-the-art (e.g., in the x64-XC retrieval (CFG)-based, as illustrated in Figure 1.
experiment with a pool size of 10000, our method achieves a Instruction streams-based methods treat instruction
recall score of 184%, 220%, and 153% over Trex, GMN, and streams as if they were natural language sentences and
jTrans, respectively). consecutively introduce natural language processing (NLP)
techniques, such as long short-term memory (LSTM) net-
works [45], self-attentive networks [29], large language mod-
1 Introduction els [1], and sophisticated pre-training techniques [33, 40, 43].
The state-of-the-art methods in this direction leverage spe-
Binary code similarity detection (BCSD) is a fundamen-
cially designed pre-training tasks, such as jump target predic-
tal task that determines the semantic similarity between
tion [40] and block inside graph detection [43], to enable the
two binary functions. It serves as a crucial component in
deep neural network models to grasp code semantics. How-
addressing various important challenges, including retriev-
ever, the pre-training process is expensive as it relies on large
ing known vulnerable functions in third-party libraries or
language models and large-scale datasets. In addition, the
firmware [11, 15, 26, 29, 42], recovering library function sym-
pre-training tasks can usually be solved reliably and fast by
bols in statically linked binaries [9, 10, 27], detecting software
traditional program analysis algorithms. Therefore, instead
plagiarism [25], detecting software license violations [17],
of teaching models to solve these tasks based on a low-level
B Corresponding author. representation, it may be more beneficial to recover these
well-defined semantic structures using traditional methods sesses the aforementioned capabilities. Based on SOG, we
and present them to deep neural networks to learn higher-level adopt a GNN-based approach for semantics learning. While
semantics. Moreover, existing work focuses on learning only replacing the CFG with the SOG in the existing GNN-based
limited aspects of code, such as control flow structures or framework offers certain advantages, it fails to unleash the
contextualized syntax probabilities. full potential of the SOG. We further identify a short slab of
CFG-based methods mainly leverage the graph neural the existing GNN-based framework: the graph aggregation
networks (GNN) and typically take CFGs as the input, which module. SOG encodes multiple aspects of code. Considering
are widely recognized as crucial features in analyzing binary that different aspects of the SOG should be useful to distin-
code [2,10,11,14]. Recent works [18,20,26,28,43,44] in this guish one function from different types of other functions, we
direction combine NLP-based methods to learn basic block propose a novel multi-head softmax aggregator to fuse them
features with GNNs to capture control flow characteristics. simultaneously. This aggregator is powerful yet lightweight.
Some of them have achieved state-of-the-art [26, 44]. How- By integrating the SOG and the multi-head softmax aggre-
ever, the reliance on NLP-based methods makes them suffer gator, we build an effective and efficient binary code similarity
from the same predicament as the previous direction of work. detection solution, HermesSim. HermesSim first lifts a binary
Other works seek to exploit additional code semantics beyond function to the SOG representation and then utilizes the graph
the control flow, such as enhancing the CFG with inter-basic- neural network to aggregate neighbor information for each
block data flow edges [15] and incorporating data flow graphs node. After that, the multi-head softmax module is applied to
(DFGs) [16]. These works have taken an involuntary step generate a graph embedding vector. The similarity between
towards utilizing full semantic structures of code. functions is then approximated by the similarity between their
Overall, the stream representation with NLP-based meth- embedding vectors. Besides, HermesSim adopts the margin-
ods is adopted by state-of-the-art works in both directions. based pairwise loss [21] along with the distance-weighted
Although the instruction streams and natural language sen- negative sampling strategy [41] for training. It is noteworthy
tences are similar in syntax, they differ implicitly: that HermesSim has two orders of magnitude fewer parame-
(i) First, natural languages are ambiguous and weakly struc- ters than previous methods based on large language models.
tured while the binary code has well-defined structures, se- In summary, we make the following contributions:
mantics, and conventions. For instance, accurately parsing • We propose a novel binary code representation named
the syntactic dependency in natural language sentences is semantics-oriented graph (SOG) for BCSD, and we detail
challenging, while it is easy in binary code. its construction. This representation not only reveals com-
(ii) Second, reordering instructions or moving instructions plete semantic structures of binary code but also discards
between basic blocks is feasible, which suggests that binary semantically independent information. To the best of our
code should be represented in a more flexible representation knowledge, this is the first investigation in this direction.
rather than sequences. In fact, instructions are often reordered • We design a novel multi-head softmax aggregator to prop-
and moved around basic blocks by the compiler for optimiza- erly aggregate multiple aspects of SOG. This module sig-
tion. However, there is no well-defined way to reorder words nificantly enhances the performance of our system.
in natural languages while preserving the semantics. • We design and implement HermesSim, an effective yet effi-
(iii) Third, natural languages are designed for effectively cient solution for binary code similarity detection.
exchanging information while binary code is designed to ease • We perform extensive experiments and demonstrate the ef-
the machine execution. For instance, most ‘words’ in assem- fectiveness of SOG over previous mainstream binary code
ble languages are ‘pronouns’. When instructions use registers representations. Moreover, we conduct both laboratory ex-
or memory slots, they indeed use the value temporarily cached periments and real-world 1-day vulnerability searches, es-
inside them. The exact registers or memory slots used to cache tablishing HermesSim’s significantly superior performance
the value are semantics-independent. over the state-of-the-art methods.
These differences suggest that treating code as natural lan- Open Source. All our artifacts are available at https://
guage is suboptimal. In light of this, we suggest the devel- github.com/NSSL-SJTU/HermesSim.
opment of a binary code representation that is capable of (1)
revealing intra-instruction structures that show how operands
are used by operators, (2) revealing inter-instruction relations 2 Background and Motivation
such as def-use, control flow, and necessary execution or-
der, (3) excluding semantics-independent elements such as This section aims to answer two questions: why do we pro-
the registers used to temporarily cache data and unnecessary pose the SOG, and what is it? We first discuss the additional
execution order restriction, and (4) encoding other implicit semantics that the sequence representation imposes on NLP-
knowledge, e.g., calling conventions. (See §2 for detail.) based methods to learn (§2.2). Then we elaborate on the
In this paper, we propose a novel binary code represen- implicit structures of binary code (§2.3). Finally, we present
tation, named semantics-oriented graph (SOG), which pos- an intuitive explanation of the SOG (§2.4).
2.2 Semantically Equivalent Variants
NLP-based methods consume a sequence of tokens. Ab-
stractly, each token contains two aspects of information, i.e.,
(a) Original instructions with positional encodings. its token content and its position in the sequence. Take the
most famous Transformer-based [38] models as an example.
For each input token, its content as well as its position in the
sequence are first transformed into two embedding vectors
respectively and then summed, as shown in Figure 2a. And
(b) Swapping the first two instructions. the resulting embedding set of tokens is fed to the subsequent
networks. Thus, modifying either token content or token posi-
tion results in different inputs of NLP-based models. If such
modification does not change code semantics, these models
need to learn from it.
(c) Shifting instructions.
Figure 2 shows three trivial semantically equivalent trans-
formations that the sequence representation imposes on NLP-
based models to learn. Trans.1, the first two instructions of
this code snippet can be swapped without modifying the code
semantics. However, this transformation results in changes
in the produced embedding sets. For instance, the token em-
(d) Replacing the uses of the register r0 with r2. bedding of the token LOAD is now added with the position
embedding of the index 5 (as shown in Figure 2b) while it
Figure 2: Three semantically equivalent variants of a simple
is originally added with the position embedding of the in-
instruction stream snippet.
dex 2. Thus, the model needs to learn that the order of some
instructions can be adjusted without modifying the seman-
tics while others cannot. Trans.2, the entire code snippet is
To avoid unnecessary complexity, we present ideas on a toy placed at a different position in the sequence, e.g., due to
intermediate representation (IR) (§2.1). The ideas developed the insertion of some dummy instructions at the beginning
can be easily extended to real-world representations. of the sequence. In figure 2c, the position embeddings of all
tokens are changed, resulting in a totally different embedding
set. The model needs to learn that the same sub-sequence of
tokens in different positions are semantically equal. Trans.3,
2.1 A Toy IR all the uses of the register r0 are substituted by a previously
unused register r2. The model needs to learn that the choice
The toy IR is mostly self-explanatory. As shown in Figure of exact registers used are semantics-independent. On the
3a, the toy IR includes four types of tokens, i.e. registers contrary, the data flow passing through registers is an integral
which are of the form ri (e.g. r1), instruction operators which aspect of code semantics.
appear as the first token of the instruction or the first token on For these transformation rules to be learned by models and
the right of the equation symbol (e.g. STORE, CALL), integer extended to real-world cases, a lot of related samples need to
literals and labels (e.g. L1). Labels mark the start of basic be fed. For instance, if only code snippets in Figure 2a and
blocks and are used by branch instructions. Figure 2c are given, models may learn that the given code
snippet at position 0 and position k are semantically equiv-
The semantics of the code snippet in Figure 3a is ex-
alent, while it is not necessary for them to learn if similar
plained below. Labels L1 at line 1, and L2 at line 7 indi-
rules hold when another code snippet is given or the snippet is
cate the start of two basic blocks separately. The instruction
placed at another position. In addition, it costs extra network
r0 = LOAD 0x1000 loads the value of the memory slot at
parameters and layers. Overall, learning these additional se-
address 0x1000 and places it into the register r0. Following
mantics makes the NLP-based approaches more expensive
that, r1 = ADD r1, 2 stores the value of expression r1 + 2
and more difficult to generalize.
into the register r1. STORE r0, r1 stores the value of r1
into the memory address indicated by r0. The instruction
r3 = CALL Foo, r0 invokes the subroutine Foo with r0 2.3 Implicit Structures of Binary Code
as the only argument and places the returned value into r3.
BR L2 jumps to the basic block marked by the label L2 di- The transformation 3, mentioned in Section 2.2, suggests
rectly. Finally, RET r3 returns the control to the caller with that the def-use relations between instructions constitute the
the return value set as the value stored in r3. entities of the code semantics. Apart from these relations,
Foo may use or modify the same memory slot as the STORE.
To express the third type of relations, an alternative method
that mimics the traditional control flow graph representation
sequences the instructions in each basic block to restrict the
execution order. However, this representation imposes exces-
sive restrictions. For example, in Figure 3a, the instruction on
line 4 cannot be swapped with any of the preceding two due
to the constraint of the def-use relations (the instruction on
(a) Linear Representation in (b) ISCG: Instruction-based line 4 uses r0 and r1, which are modified by the instructions
the toy IR Semantics-Complete Graph
on line 2 and line 3 respectively). Meanwhile, the instructions
on line 2 and line 3 can be swapped without affecting the code
semantics, which suggests a more flexible representation.
For the pursuit of both effectiveness and efficiency, our
goal is to incorporate only the necessary execution order
into our final representation. Thus, we adopt the effect model,
which models the additional execution order restriction as
potential data flows. In the example given above, the STORE
instruction modifies a memory slot, while the subroutine Foo
invoked by the CALL instruction may read from or write to the
same memory slot, which composes a potential data flow. To
(c) TSCG: Token-based (d) SOG: Semantics-Oriented reveal such relations, inspired by the Click’s IR [5, 6], we use
Semantics-Complete Graph Graph an abstract temporary to represent each set of memory slots
concerned. Instructions that may read values from or write
Figure 3: A proof-of-concept example of the step-by-step values to that set of memory slots are considered to use or
lifting of code from the linear representation to the Semantics- define the corresponding temporary.
Oriented Graph. It is worth noting that the constructed effect flows are re-
lated to the analysis ability. For instance, in the last example,
if we figure out that the invoked subroutine does not interact
binary code contains other semantic structures, which can with the memory or that the STORE instruction only modifies
be grouped into three categories: intra-instruction structures, the current stack frame which will not be accessed by any sub-
inter-instruction relations, and function-level conventions. routines, no effect relations needed to be introduced between
Intra-instruction Structures. Instructions have internal them.
structures. For instance, the instruction add r1, r2, r3 Function Level Conventions. Most real-world binary func-
in MIPS can be interpreted as the add operator uses the tions adhere to various conventions, including calling conven-
value cached in the register r2 and r3 and stores the pro- tions, the layout of stack frames, and other details needed for
duced value into the register r1. Meanwhile, the instruction a binary function to work properly on a certain system. These
add rax, rdx in x86-64 can be interpreted as the add op- conventions are also indispensable parts of code semantics.
erator uses the value cached in the register rax and rdx, but Specifically, calling conventions define which registers are
stores its outputs in both rax and eflags registers. served as the call arguments and the return values, which aid
Inter-instruction Relations. Inspired by the ideas devel- in the recovery of the def-use relations of call instructions. Be-
oped in the sea of nodes IR [37], we divide relations between sides, functions store temporary variables in their own stack
instructions into three categories: data (def-use relations), frames which will not be accessed by other functions. This
control (branches), and effect (execution order). understanding contributes to refining the effect relations.
Data relations reveal the fact that some instructions use
values defined by others. Control relations define the control 2.4 Step to the Semantics-Oriented Graph
flow on the basic block level. Effect relations, which have
been overlooked by previous work, establish restrictions on The semantics-preserving transformations listed in section 2.2
the execution order between instructions. Effect relations are provide hints on properties that an ideal binary code repre-
necessary since data and control relations do not comprise sentation should possess. First, the representation should not
the full code semantics. For instance, in the code snippet assign a position identifier to each instruction or token, as
depicted in Figure 3a, no (explicit) data or control relations the semantics is independent of position. Therefore, a graph
exist between the instructions on line 4 and line 5. If not representation would be more appropriate. Second, the repre-
introducing additional restrictions, we could swap these two sentation should not fully adopt the execution order to restrict
instructions. However, this is dangerous, as the invocation of instructions since a significant portion of execution order re-
strictions is machine-dependent but semantics-independent. stores actually carry semantic information and are preserved.
Thus, we adopt the effect model to restrict only the neces- SOG is capable of encoding various function-level con-
sary execution order. Third, some token contents are indepen- ventions. For example, identified calling arguments can be
dent of semantics, while the relations between instructions revealed by simply adding data edges from the calling node to
or tokens carry the semantics. We thus aim to eliminate the the argument nodes. To reveal the conventions of stack frames,
semantics-independent tokens and model semantic relations. we can model the effect using two abstract temporaries. One
Based on these analyses, we propose a graph-based bi- stands for the stack frame of current invocation and the other
nary code representation, named semantics-oriented graph stands for all other memory regions. Thus, no effect relations
(SOG). SOG reveals well-defined semantic structures, purges will be built between the instructions that access only the
semantics-independent elements, and is capable of encoding current stack frame and the instructions that do not access it.
implicit conventions of binary functions. Intuitively, SOG can
be constructed from a linear representation in three steps:
First, we lift the sequence of instructions into a graph rep- 3 Details of SOG and its Construction 1
resentation and reveal the relations between instructions as
The construction of the SOG can be finished in a similar
edges. Figure 3b shows an example of such a representation
procedure that is used to convert linear IRs to the SSA form.
lifted from the code snippet in Figure 3a. This graph can be
To promote reader understanding, we divide the SOG into
obtained by enhancing the data flow graph (DFG) with addi-
three subgraphs and detail their construction separately.
tional control flow and effect flow edges. We name this rep-
resentation the instruction-based semantics-complete Graph To construct the control subgraph, dummy BR instructions
(ISCG), as it carries exactly the same semantic information are inserted at the end of basic blocks when necessary to
as the original linear representation and takes instructions as ensure that each basic block ends with a branch instruction.
nodes. One interesting thing is that we do not mark which Branch instructions are then treated as prolocutors for basic
basic block an instruction belongs to, because instructions blocks, and are lifted to the nodes of the graph. Control flow
can actually float over basic blocks without changing the se- edges are added from branch targets to branches. To avoid
mantics, as long as these three relations are respected. ambiguity, each node is treated as defining only one value for
each type of relation. For each conditional or indirect branch
Second, we reveal intra-instruction structures by splitting
instruction that has multiple successors, we treat its outputs
instructions into tokens. This step simplifies the generation
as a concatenated value and incorporate a PROJ node for each
of node embeddings and eases the elimination of semantics-
successor to project the concatenated value into a unit one for
independent elements. The intra-instruction structures can be
that particular successor. Figure 4a shows an example of such
interpreted as another kind of data relations. For example, we
a case, where CBR denotes a conditional branch.
can interpret the instruction r0 = LOAD 0x1000 as the LOAD
The data subgraph can be constructed in the procedure of
token using the 0x1000 token and the r0 token using the pro-
def-use analysis. During the analysis of the def-use relations
cessed result of the LOAD token. By further refining the inter-
of each instruction, we first construct a node with the instruc-
instruction relations on the exact instruction tokens which
tion operator as the node type, and then add directed edges,
introduce them (i.e., control and effect relations are defined
one per operand, from this node to some preceding node that
on instruction operators, while data relations are defined on
defines the value of that operand. If such a preceding node
the register tokens or others that temporarily cache data), and
cannot be found, that is, the operand is an integer literal or
labeling the positions of the operands on the edges, we obtain
a register that has not been defined before, a new node that
the graph shown in Figure 3c. Edges with omitted labels in
represents the integer literal or the uninitialized register is
this graph have the default label 1 (i.e., these relations are built
introduced. After that, we mark the corresponding operand
on their first operands). Figure 3c reveals semantically related
as defined. In this way, all instructions referring to the same
inter-instruction relations and intra-instruction structures. We
undefined operand connect to the same operand node.
name this representation the token-based semantics-complete
graph (TSCG). Analyzing the def-use relations at the binary code level
faces the alignment challenge. That is, the value in one
Third, since the choice of temporary stores (e.g., registers
operand may be defined by multiple instructions simulta-
or stack slots) to cache data between instructions is semanti-
neously. For instance, consider the following code snip-
cally independent, we further remove these nodes and connect
pet: mov eax, 0xAAA; mov al, 1; mov edx, eax. The
their inputs and outputs directly, forming Figure 3d, the final
value of the eax register referred by the third instruction is
representation that this study targets. It is worth noting that we
defined by both the preceding two instructions (in that code
preserve the use of uninitialized stores. Uninitialized stores
site, eax=0xA01). To address this issue, another type of node,
mostly carry the value passed from the caller routine. If the
named PIECE, is introduced to abstractly concatenate multi-
calling convention of the current function is known, the names
of uninitialized stores can be used to infer the position of the 1 Inspiredby the sea of nodes IR [37] (the PROJ node), the P-code of
argument among the argument list. Thus, the uninitialized Ghidra [31] (the PIECE node), and the Click’s IR [6] (structure of PHIs).
we combine our graph embedding model into a Siamese net-
work [3], enabling it to approximate the similarity score of a
pair of functions by the similarity of their graph embeddings.
Training We adopt margin-based pairwise loss [21] with the
(a) CBR+PROJ (b) PIECE (c) PHI distance-weighted negative sampling strategy [41] for train-
ing. Mini-batches are gathered by first sampling N different
Figure 4: Examples of special nodes in SOG. function symbols and then sampling 2 functions with different
compilation settings for each symbol. The negative sampling
process of each function gba is formulated below, where a or k
ple values. The lifted graph of the above code snippet can be indicates the index of a function symbol in the mini-batch, b
found in Figure 4b. or l indicates the index of a binary function of that symbol,
The effect subgraph can be constructed in the same way as ωb,a
l,k denotes the probability of choosing the sample gk as the
l
memory effect. And instructions that may alter the state of the b,a b,a
[ωl,k ]l∈{1,2}, k̸=a = softmax([wl,k ]l∈{1,2}, k̸=a ) (3)
memory are considered as both using and defining the mem-
ory effect. According to this definition, STORE and CALL are We use the negative of cosine similarity as the distance met-
the only instructions in the toy IR that both use and define the ric (d in Eq. 2). The training loss is calculated as follows:
memory effect and the LOAD instructions also use the memory
pos j 3− j
effect. The I/O effect can be constructed in a similar manner. si, j = 1 − m − sim(gi , gi ) (4)
Phi Nodes. It is worth noting that the SOG is naturally of neg
si, j
j
= sim(gi ,
j
neg_sampling(gi )) − m (5)
the static single assignment (SSA) form, as it defines one
N 2
node for each instruction. To keep the conciseness, we need L = ∑ ∑ ( max(si,posj , 0) + max(sneg
i, j , 0) ) (6)
to introduce necessary phi instructions for both data values i=1 j=1
and effect values. Since the semantics of the phi instructions
depends on the control flow, the first use of the phi node where m ∈ [0, 1] is a margin parameter, j ∈ {1, 2}.
is set as the branch node of that basic block where the phi
instruction resides in. Other uses of the phi node are set in 4.2 Graph Normalization and Encoding
the same order as the order of outgoing edges of the branch
node. Thus, when a branch node receives control flow from its Before we can harness the power of the graph neural network,
i-th outgoing edge, the phi node which uses that branch node we first need to transform node and edge attributes into em-
can select the i+1-th input as the output. An example of phi bedding vectors. In SOG, each node is attributed with a token.
node is shown in Figure 4c. The corresponding code snippet We can directly map each token to a learnable embedding vec-
is r0=0x1; CBR A, L3; L2: r0=0x2; BR L3; L3: XXX. tor. Each edge has a type attribute (data, control, or effect) and
Refer to Appendix A for the pseudo-code of our graph a position attribute (the index of the corresponding operand).
construction algorithm. We separately transform the type and the position attributes
into two learnable embeddings and add them to form the final
edge embedding. The model needs to learn a vocabulary of
4 BCSD With SOG tokens and labels.
However, too many different tokens (e.g. different integer
4.1 Framework literals) and position labels exist in binary code so it is nearly
impossible to include all of them in the vocabulary and to
Figure 5 illustrates the overall framework of our system. Each require them to be presented in the training dataset. Thus,
input binary function is first lifted to the proposed semantics- some tokens or labels are actually out of the vocabulary.
oriented graph, in which each node is attributed with a token To address this problem, we assign each token a token type.
and each edge has a type label and a position label. After Specifically, we identify three types of tokens: instruction
normalization, we transform each node and edge into learn- tokens, integer literals, and register tokens. Since different
able embeddings. Following the common practice of previous architectures have different conventions in register use, we
work [16, 21, 44], we use GGNN [22] to aggregate neigh- further divide register tokens into several sub-types accord-
bor structures of each node. Next, we employ the multi-head ing to the architectures. For each type of token, we identify
softmax aggregator to generate the graph embedding. Finally, the most common tokens and include them in the vocabulary
Figure 5: The overall framework of HermesSim.
XA XO XC XM x64-XO x64-XC
N Params
100 100 10000 100 100 10000
SAFE 13.4/26.4 21.1/27.5 20.1/27.6 9.9/18.9 1.4/2.32 18.4/26.2 17.2/24.9 8.1/9.5 8.93M
Asm2Vec - 24.6/30.1 25.8/31.7 - - 31.8/37.7 29.0/35.0 13.5/16.6 -
Trex 31.2/42.1 46.8/53.1 45.4/52.5 24.4/34.4 8.6/11.1 51.5/57.7 45.9/53.2 26.2/30.1 61.8M
GMN 72.6/81.7 50.3/58.1 52.3/59.8 44.7/53.7 10.5/15.9 52.4/60.2 48.0/56.2 21.9/26.7 60.5K
jTrans - - - - - 66.9/76.0 65.0/73.8 31.4/37.4 87.9M
HermesSim 95.5/97.5 81.0/85.3 78.0/83.2 74.5/80.2 43.8/50.8 81.9/86.0 75.6/80.7 48.1/54.6 388K
study [27]. To compare with jTrans, we finetune the released in the XA subtask. GMN generally outperforms SAFE,
pre-trained model on our dataset using the default settings. Asm2Vec, and Trex, especially in the XA subtask. This can
be attributed to the cross-architecture nature of CFGs. How-
Results. As shown in Table 1, HermesSim outperforms all ever, GMN performs worse than Trex in XO and XC subtasks
baselines by a large margin in all settings. Specifically, on when applying large pool sizes. One reason could be that
the x64 dataset, HermesSim achieves a recall rate that is CFGs only explicitly encode the control flow aspect, and are
22%, 16%, and 53% higher than the state-of-the-art approach, thus likely to collide when searching in large pools.
jTrans, in XO, XC (poolsize=100), and XC (poolsize=10000) The last column of Table 1 shows the number of total pa-
experiments, respectively. As illustrated in Figure 7e and 7f, rameters of each approach (excluding parameters of optimiz-
jTrans can only achieve comparative results to HermesSim ers). NLP-based methods, namely, jTrans, Trex, and SAFE
when the pool size is extremely small (i.e., 16), which is not have at least one order of magnitude more parameters than
common in real-world applications. On the full dataset, Her- graph representation-based methods. Trex and jTrans, which
mesSim improves the recall by 132%, 161%, 149%, 167%, are based on large language models, have 159 and 226 times
and 417% over GMN in XA, XO, XC, XM (poolsize=100), more parameters than our model, respectively.
and XM (poolsize=10000) experiments, respectively. Figure 7 demonstrates that HermesSim’s performance re-
NLP-based methods (SAFE and Trex) perform poorly mains more stable than baseline approaches as the pool size
Table 2: Results of the ablation study on the full dataset for XA, XO, XC, and XM subtasks, and on the x64 dataset for XO and
XC subtasks. The second row lists the pool sizes. The scores (%) are RECALL@1/MRR.
XA XO XC XM x64-XO x64-XC
100 100 10000 100 100 10000
CFG-OPC200 92.8/95.6 67.7/73.9 67.1/73.2 62.9/69.3 32.0/38.7 69.9/75.7 64.2/70.3 36.8/42.9
MSoft CFG-PalmTree - - - - - 70.8/76.4 66.1/72.3 36.3/43.0
CFG-HBMP 94.3/96.6 69.5/75.8 68.9/74.9 65.3/71.9 33.7/40.6 72.0/77.4 67.2/73.0 39.0/45.5
P-DFG 93.4/96.2 76.4/81.6 74.1/79.8 69.7/76.0 37.4/44.7 76.3/81.6 69.9/75.7 42.8/49.3
P-CDFG 94.6/96.9 77.2/82.5 75.3/80.8 71.1/77.3 38.6/45.9 77.2/82.4 72.4/77.8 43.6/50.5
MSoft
P-ISCG 95.1/97.2 78.0/82.9 74.8/80.3 71.3/77.2 40.1/47.2 78.1/82.6 71.9/77.1 44.2/50.9
P-TSCG 95.5/97.4 79.2/83.6 76.2/81.7 73.2/78.9 41.3/48.8 79.1/83.7 73.5/78.7 46.4/52.9
Set2Set 89.5/93.0 67.2/72.2 64.5/69.9 61.2/66.9 29.8/36.6 69.2/73.7 60.7/66.1 37.3/43.1
Softmax P-SOG 90.9/94.3 72.9/78.1 71.0/76.6 64.3/71.2 32.6/39.4 74.0/78.7 66.9/72.6 41.2/47.2
Gated 93.5/96.1 75.6/80.3 72.5/77.6 68.3/74.2 38.4/44.9 76.3/80.7 69.0/74.3 43.9/49.9
MSoft P-SOG 95.5/97.5 81.0/85.3 78.0/83.2 74.5/80.2 43.8/50.8 81.9/86.0 75.6/80.7 48.1/54.6
increases from 2 to 8192. The relative performance of Hermes- through a bidirectional GRU layer end-to-end.
Sim becomes even better than the state-of-the-art baselines • P-CDFG is similar to the P-DFG representation except that
(i.e., jTrans and GMN) as the pool size increases (e.g., from it integrates the control flow relations as additional edges.
4.5 to 35.0 in XM subtask compared to the GMN as the pool • P-ISCG is the first semantics-complete graph proposed in
size increases from 2 to 8192). Section 2.4. The difference between ISCG and CDFG lies
in the introduction of effect flow edges. An example of
ISCG can be found in Figure 2b.
5.3 Ablation Study • P-TSCG is the second semantics-complete graph pro-
Baselines. To demonstrate the efficiency of the SOG repre- posed. Compared to the ISCG, it additionally reveals intra-
sentation for BCSD, we select several promising representa- instruction structures. Figure 2c is an example of TSCG.
tions from previous work that are not surpassed by others: Our multi-head softmax aggregator is compared with:
• CFG-opc200 is a CFG based binary function representation • Softmax Aggregator [19] is the base model of our proposed
using the opc200 manually crafted features as basic block aggregator. It is a generalized version of both the Mean
attributes. This representation is proposed by Marcelli et Aggregator and the Max Aggregator.
al. [27] and shows advantages over previous Word2vec [30] • Gated Aggregator is proposed along with the GGNN by Li
based methods [28]. et al. [22] and is used by several previous studies [21, 27].
• CFG-PalmTree aggregates unsupervised instruction embed- • Set2set Aggregator is proposed by Vinyals et al. [39] and is
dings generated by the PalmTree model [20] as the basic used by the related work [44]. This aggregator is based on
block embedding. PalmTree is the state-of-the-art unsu- the iterative attention mechanism.
pervised instruction embedding network. We follow the To compare with those baselines, we keep other parts of
practice of the original paper to use mean pooling as the HermesSim, tune only the hyper-parameters tightly related to
aggregator to obtain basic block embedding. This baseline these methods, and report the best results found. For repre-
only supports the X86 architecture. sentations, we tune the hyper-parameters of graph encoders
• CFG-HBMP use HBMP model [36] to compute the block and the batch sizes within the constraint of GPU memory.
embedding end-to-end. This method is proposed by Yu For aggregators, we tune the hyper-parameters inside these
et al. [44] and performs better than previous pre-training modules, the final graph embedding size, and the batch sizes.
methods [28, 43] using Bert [7] or Word2vec [30]. Observing that the result scores of some baselines are close,
we repeat the experiments in this subsection 10 times to miti-
In addition, we compare the SOG representation with straw-
gate the effect of randomness. Mean values are reported.
man representations mentioned in Section 2.4:
• P-DFG3 takes instructions as nodes and def-use relations Results. Table 2 shows the results of the ablation study. The
between instructions as edges. We build the DFG based first section of the table contains the results of baseline repre-
on Ghidra’s Pcode IR and generate instruction embeddings sentations, in which the CFG-HBMP method stably outper-
forms the other two methods. The second section shows the
3 The ‘P-’ prefix refers to ‘Pcode-based’. results of the straw-man representations mentioned in Sec-
tion 2.4. The third section tests the effectiveness of baseline
aggregators on the proposed SOG representation. The last
section shows the results of our proposed method.
Compared to the baseline representation in the first section
of Table 2, the proposed SOG representation improves the
recall by around 10 percent in all subtasks except XA. XA is
the easiest subtask for the GNN and the graph representation-
based methods, in which even the CFG-opc200 method can
achieve a recall rate as high as 92.8%. And we observe that
the relative performance of the SOG over the CFG-HBMP
becomes better as the pool size increases from 100 to 10000
in both the XM subtask (from 9.2% to 10.1% in RECALL@1)
and the x64-XC subtask (from 8.4% to 9.1% in RECALL@1).
In the second section, the RECALL@1 and MRR scores of
P-DFG, P-CDFG, P-ISCG, and P-TSCG generally increase
successively, demonstrating that the reveal of control flow
relations, effect flow relations, and intra-instruction structures Figure 8: Validation MRR scores of different representations
can indeed improve the efficacy. The introduction of effect on the XM task.
flow edges slightly hurt the performance in the XC subtask
on both datasets when setting the pool size as 100. This may Table 3: Efficiency of HermesSim in terms of average time
be attributed to the dirty effect problem which we will discuss cost per 103 functions in our testing dataset. Lifting and
later in Section 6. The performance in all subtasks benefits searching time is evaluated on Intel(R) Xeon(R) Silver 4214
from the introduction of control flow edges and the reveal of @2.20GHz CPU but in a single process. Training and infer-
the intra-instruction structures. ring time are evaluated on one NVIDIA RTX3080 GPU.
The P-DFG method outperforms the CFG-HBMP method
in nearly all subtasks except XA, which demonstrates the
Training Lifting Inferring Searching
superiority of CFGs in the cross architectures scenario and
the superiority of DFGs in the cross-optimization and the 62mins (20 epochs) 3.20s 0.35s 1.50ms
cross-compiler subtasks. In addition, the superiority of the
SOG over the TSCG supports the hypothesis that keeping the
semantics-independent elements in the representation hurts Table 4: Average number of nodes and edges in different
the generalization ability of the model. graph representations and the inferring time cost.
The multi-head softmax aggregator achieves the RE-
CALL@1 and MRR scores by 134% and 129% respectively Num. Nodes Num. Edges Inferring
over the original softmax aggregator on the XM subtask with CFG(-HBMP) 30.6 40.8 0.14s
a pool size of 10000. Additionally, the multi-head softmax ag-
gregator significantly outperforms other baseline aggregators. P-DFG 442.7 0.20s
Figure 8 illustrates how the MRR scores of different repre- P-CDFG 427.7 551.7 0.25s
sentations in the validation dataset increase during the training P-ISCG 688.9 0.27s
campaigns. The SOG representation achieves the best MRR P-TSCG 803.2 1310.7 0.54s
score throughout the training campaigns. The ISCG achieves P-SOG 542.7 1010.7 0.35s
better scores than TSCG in the early stages but fails to retain
its advantage later. The CFG-HBMP method, which utilizes
the HBMP NLP model to learn basic block attributes, achieves
comparable validation scores as the P-CDFG method but fails against a pool of other embeddings.
to generalize as well to the larger testing dataset. In addition, we show the sizes of different graph represen-
tations as well as their inferring time cost in Table 4. CFGs
5.4 Runtime Efficiency and ISCGs are smaller than SOGs on average, which implies
both lower memory and time costs during the local structure
Table 3 shows the runtime cost of HermesSim. The lifting capture stage, but at the cost of complicating the node at-
time is the time cost to lift a binary function from the lin- tributes extraction. SOGs have 32% fewer nodes and 23%
ear representation to the SOG. The inferring time consists fewer edges than TSCGs and are 35% faster in terms of the
of encoding the SOG to a numeric embedding vector. The inferring time, showing the purge of semantics-independent
searching time is the cost of comparing a function embedding elements meaningfully improves the efficiency as well.
5.5 Real-world Vulnerability Search Table 5: Results of real-world vulnerability searching exper-
iments. The Tot. columns show the number of ground truth
In this experiment, we collect 12 RTOS firmware images from functions in two categories respectively for each query func-
three vendors (TP-Link, Mercury, and Fast) and perform the tion. The SAFE, Trex, GMN, and Our columns represent the
1-day vulnerability search task. We first build a repository that number of ground truth functions that fail to be recalled by
consists of all functions in these images and manually identify each respective method.
5 CVEs and 10 related vulnerable functions in the TP-Link
WDR7620 firmware. Then we use these vulnerable functions Tot. SAFE Trex GMN Oura
as queries to search for similar functions in the repository. #
c1 c2 c1 c2 c1 c2 c1 c2 c1 c2
Our repository contains 62605 functions in total, which are
of two different architectures, i.e., ARM32 and MIPS32. 0 7 1 3 1 3 0 3 0 0.0 0.0
For each query function, we categorize the functions in the 1 1 8 0 8 0 4 0 5 0.0 0.0
repository into three groups: c1: functions built from exactly 2 9 0 5 0 4 0 4 0 0.0 0.0
the same source code as the query function. These functions 3 3 2 2 2 1 2 1 2 0.0 0.0
are previously unknown vulnerable functions that need to be 4 4 7 2 7 1 5 2 7 0.5 0.1
retrieved. c2: functions built from a slightly different source 5 4 7 2 7 1 7 2 7 0.4 0.5
code (e.g., from different versions) but are of the same symbol 6 6 4 2 4 2 4 1 4 0.2 0.0
as the query function (when the function symbol is available). 7 7 5 4 4 3 4 3 3 3.0 0.0
Functions in this categorization are potentially vulnerable 8 2 3 0 3 0 3 0 3 0.0 0.1
functions that need manual identification. c3: other functions 9 3 2 1 2 1 2 1 0 0.0 0.0
(i.e., functions that are compiled from different source code). XAb 15 28 15 28 15 28 15 26 4.1 0.7
We identify the ground truth by performing similarity searches SAb 31 11 6 10 1 3 2 5 0.0 0.0
using HermesSim and other compared baselines and manually Tot. 46 39 21 38 16 31 17 31 4.1 0.7
examine the top 20 results of all methods.
R1c 54 3 65 21 63 21 91 98
We evaluate the results using RECALL@1 and MRR as MRR 55 3 67 22 73 21 94 99
well. Our goal is to retrieve all functions in c1 and c2. Ideally,
the BCSD system should rank functions in c1 before functions a As with previous experiments in Section 5.3, the results of our method
in c2 and c3, and rank functions in c2 before those in c3. is derived from averaging the results of 10 independent runs.
b XA: Cross-architecture. SA: Single Architecture.
Thus, we calculate the rank of each ground truth function c RECALL@1 (%)
by adding the number of more dissimilar functions that have
higher similarity scores by 1. For example, for a ground truth
function in c1, its rank is the number of functions that have 6 Discussion
higher similarity scores but are categorized in c2 and c3 plus
one. The ideal rank of each ground truth function is 1. The semantics-oriented graph is a concise and semantics-
Table 5 shows the results of HermesSim and other base- complete graph representation for binary code. Although we
lines that support the cross-architecture task. HermesSim only focus on utilizing this representation for BCSD in this
outperforms other baselines in terms of the number of fail- paper, it should also be applied to other binary code-related
ures, RECALL@1, and MRR by a large margin. HermesSim tasks that require understanding the code semantics. Besides,
can handle most cases except that it ranks a slightly modified this representation is very suitable for encoding multiple extra
version of the function 7 before three functions in c1. analysis abilities. We discuss potential improvements to this
In the SA scenario, SAFE, Trex, and GMN perform poorer representation in more detail below.
on recalling functions in c2. For instance, GMN can recall Dirty Effect Problem. The effect flow in the source code is
29 out of 31 functions in c1, but can only recall 6 out of 11 mostly kept as is during compilation and optimization because
functions that have slight modifications at the source code the compiler does not know what will happen if the execution
level. This indicates that baseline methods are ineffective in order of the effect-related instructions is changed. Thus, due
ranking functions according to the semantics similarity. In to its cross-optimization and cross-architecture nature, such
contrast, HermesSim can recall all of them. effect flow should be useful for BCSD.
Furthermore, NLP methods, i.e., SAFE and Trex, are unable However, some effect-related instructions are introduced
to recall any of the 43 ground truth functions that are of an during compilation, such as the use of the stack frame, which
architecture different from the query function (i.e., ineffective is transparent to the compiler and can be manipulated. Mean-
in the XA scenarios). Even GMN can only recall 2 of these while, the number of stack temporary variables is architecture
functions due to the large pool size. In contrast, our proposed and optimization level dependent. Thus, including stack mem-
method can recall on average 38.3 such functions. ory accesses in the effect flow may not be beneficial. An
See Appendix B for details about the vulnerabilities. additional pass of the load-store elimination analysis can be
applied to clean up this pollution, which is left as future work. Recently, some researchers leverage more advanced NLP
I/O Effect Model. We only investigate the memory effect techniques and pre-trained models to automatically extract
model in this paper. The I/O effect model pertains to instruc- latent representations of binary code at either the basic block
tions that interact with input/output (I/O) devices, such as the level or function level. Zuo et al. [45] model the basic
LOAD instructions that read from the memory-mapped I/O blocks as natural sentences and design a cross-assemble-
regions. Different from the memory effect model, receiving lingual basic block embedding model based on word embed-
data from I/O devices can change the states of these devices ding and LSTM. Perdisci et al. [29] combine the skip-gram
as well. The I/O effect model should be useful when the tested method [30] and the self-attentive network [23] to generate
functions directly communicate with I/O devices. function embeddings. Yu et al. [43], Guo et al. [16] and Luo
et al. [26] specially design several pre-training tasks for bi-
Extra Information and Encoding Ability. References to
nary code and train a BERT-based [8] large language model
strings, integers, external function symbols, and other related
to generate basic block embeddings. Pei et al. [33], Ahn et
entities are not handled by this study. They can be integrated
al. [1], and Wang et al. [40] use a Transformer-based model to
by properly developing modules that encode these features
generate function-level embeddings directly. However, these
into a unified embedding space as other node tokens. NLP-
methods treat the disassembled binary code as natural lan-
based techniques should be suitable for encoding these ele-
guages and fail to exploit the well-defined code semantics.
ments, as demonstrated in previous studies [18, 44].
Besides, the use of increasingly large models results in signif-
Addressing Analysis Failures. Currently, traditional pro- icantly higher training and inference costs.
gram analysis algorithms may fail to recover the full control
flow relations of the indirect branches, which is still an open Binary Code Representation for BCSD. The raw repre-
problem. This limitation affects not only our SOG represen- sentation of a binary function is a stream of bytes, which is
tation but also the linear representation used by NLP-based featureless. Thus, it is crucial to design an effective repre-
methods and the CFG representation. In one scenario, due sentation for BCSD. In the early stage, CFG is widely used
to control flow recovery failures, basic blocks at the poten- for binary code representation due to its cross-architecture
tial indirect branch targets are not included in the scope of nature [2, 10, 14]. Later, Gao et al. [15] enhance the CFG
the function. This causes all representations to be equally with inter-basic-block data flow edges. Guo et al. [16] man-
incomplete. In the other scenario, only the control flow rela- age to learn an embedding from DFG along with CFG and
tions are lost while the affected basic blocks are detected (e.g. make use of the def-use relations of code. Wang et al. [40]
because other paths lead to them). The linear representation embed control flow information into Transformer through pa-
is insensible to such failures because it does not exploit the rameter sharing and get promising results. Previous work has
control flow relations, while the CFG and the SOG primarily taken an involuntary step toward exploiting full code struc-
miss some edges. Nevertheless, this does not imply that the tures. Our work examines full code semantics and reveals
linear representation is superior because it offers no more it through a novel graph representation. Similar representa-
information than graph representations to neural networks for tions [5, 6, 12, 37] exist in compiler research to ease the op-
inferring the missed control flow. timization, but they are built from the source code and thus
We argue that it is impossible to understand the code seman- cannot be directly employed in binary code related tasks.
tics without first figuring out the control flow. And therefore, BCSD beyond the Code. Other studies attempt to integrate
it may be worthwhile to explore how to provide the necessary related information beyond the code semantics. For instance,
information (e.g., referred data) to deep neural networks to Yu et al. [44] embed string and integer references separately
enable them to deduce the control flow of indirect branches. and concatenate them into the final function embedding. Kim
et al. [18] introduce the binary disassembly graph to include
external function references and string literal references as
7 Related Work features. Luo et al. [26] first predict the compiler and the op-
timization level using an entropy-based technique, and then
This section surveys the learning-based BCSD approaches.
transfer function embeddings from different compilation set-
We divide the development in this direction into three lines.
tings into a unified embedding space. While our work focuses
Deep learning (DL) for BCSD. With the development of on exploiting the code semantics, techniques developed in
artificial intelligence techniques, DL algorithms with more this line can be integrated into our system as well (see §6).
powerful feature extraction capabilities are applied to BCSD.
For example, GNN is widely used due to its ability to effec-
tively capture structural information [15, 21, 24, 42, 44]. Xu et 8 Conclusion
al. [42] introduce the Structure2vec model to learn features
from control flow structures and achieve significantly better In this paper, we propose a semantics-complete binary
results than previous methods. Li et al. [21] propose the use of code representation, the semantics-oriented graph (SOG), for
more advanced GNN models, i.e., the GGNN and the GMN. BCSD. This representation not only exploits the well-defined
code structures, semantics, and conventions but also purges [4] Silvio Cesare, Yang Xiang, and Wanlei Zhou. Control
the semantics-independent elements embedded in the low- flow-based malware variantdetection. IEEE Transac-
level machine code. We detail the construction of the SOG tions on Dependable and Secure Computing, 11(4):307–
and discuss potential improvements to this representation. 317, 2014.
To unleash the potential of the SOG for BCSD, we pro-
pose a novel multi-head softmax aggregator, which allows [5] Cliff Click. From Quads to Graphs: An Intermedi-
for the effective fusion of multiple aspects of the graph. By ate Representation’s Journey. Technical Report CRPC-
integrating the proposed techniques, we build an effective TR93366-S, Center for Resesearch on Parallel Compu-
and efficient BCSD solution, HermesSim, which relies on the tation, Rice University, 1993.
GNN model to capture structural information of the SOG and
[6] Cliff Click and Michael Paleczny. A simple graph-
adopts advanced training strategies.
based intermediate representation. ACM Sigplan No-
Extensive experiments demonstrate that HermesSim signif-
tices, 30(3):35–49, 1995.
icantly outperforms state-of-the-art approaches in both lab-
oratory experiments and real-world vulnerability searches. [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
In addition, our evaluation proves the value of revealing full Kristina Toutanova. Bert: Pre-training of deep bidirec-
semantic structures of binary code and cleaning up semantics- tional transformers for language understanding. arXiv
independent elements. We also demonstrate the effectiveness preprint arXiv:1810.04805, 2018.
of the proposed aggregator and the efficiency of HermesSim.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. BERT: pre-training of deep bidirec-
Acknowledgments tional transformers for language understanding. In Jill
Burstein, Christy Doran, and Thamar Solorio, editors,
We thank the anonymous reviewers of this work for their Proceedings of the 2019 Conference of the North Amer-
helpful feedback. We thank Marcelli et al. [27] for their valu- ican Chapter of the Association for Computational Lin-
able work in reviewing and reproducing the previous work. guistics: Human Language Technologies, NAACL-HLT
This research was supported, in part, by National Natural Sci- 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1
ence Foundation of China under Grant No. 62372297, Ant (Long and Short Papers), pages 4171–4186. Association
Group Research Fund, Science and Technology Commission for Computational Linguistics, 2019.
of Shanghai Municipality Research Program under Grant No.
20511102002, National Radio and Television Administration [9] Steven H. H. Ding, Benjamin C. M. Fung, and Philippe
Laboratory Program (TXX20220001ZSB002). Yuede Ji was Charland. Asm2Vec: Boosting Static Representation
supported by the University of North Texas faculty startup Robustness for Binary Clone Search against Code Ob-
funding. fuscation and Compiler Optimization. In 2019 IEEE
Symposium on Security and Privacy (SP), pages 472–
489, San Francisco, CA, USA, May 2019. IEEE.
References
[10] Thomas Dullien and Rolf Rolles. Graph-based compari-
[1] Sunwoo Ahn, Seonggwan Ahn, Hyungjoon Koo, and son of executable objects. Sstic, 5(1):3, 2005.
Yunheung Paek. Practical Binary Code Similarity Detec-
[11] Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng,
tion with BERT-based Transferable Similarity Learning.
Brian Testa, and Heng Yin. Scalable Graph-based Bug
In Proceedings of the 38th Annual Computer Security
Search for Firmware Images. In Proceedings of the
Applications Conference, pages 361–374, Austin TX
2016 ACM SIGSAC Conference on Computer and Com-
USA, December 2022. ACM.
munications Security, pages 480–491, Vienna Austria,
October 2016. ACM.
[2] Saed Alrabaee, Paria Shirani, Lingyu Wang, and Mourad
Debbabi. SIGMA: A Semantic Integrated Graph Match- [12] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren.
ing Approach for identifying reused functions in binary The program dependence graph and its use in optimiza-
code. Digital Investigation, 12:S61–S71, March 2015. tion. ACM Transactions on Programming Languages
and Systems, 9(3):319–349, July 1987.
[3] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard
Säckinger, and Roopak Shah. Signature Verification [13] Matthias Fey and Jan E. Lenssen. Fast graph repre-
using a "Siamese" Time Delay Neural Network. In sentation learning with PyTorch Geometric. In ICLR
Advances in Neural Information Processing Systems, Workshop on Representation Learning on Graphs and
volume 6. Morgan-Kaufmann, 1993. Manifolds, 2019.
[14] Debin Gao, Michael K. Reiter, and Dawn Song. Bin- [23] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-
Hunt: Automatically Finding Semantic Differences in tos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua
Binary Programs. In Liqun Chen, Mark D. Ryan, and Bengio. A STRUCTURED SELF-ATTENTIVE SEN-
Guilin Wang, editors, Information and Communications TENCE EMBEDDING. In International Conference
Security, volume 5308, pages 238–255. Springer Berlin on Learning Representations, 2017.
Heidelberg, Berlin, Heidelberg, 2008. Series Title: Lec-
ture Notes in Computer Science. [24] Shangqing Liu. A unified framework to learn program
semantics with graph neural networks. In Proceedings of
[15] Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang the 35th IEEE/ACM International Conference on Auto-
Sun. VulSeeker: a semantic learning based vulnerabil- mated Software Engineering, pages 1364–1366, Virtual
ity seeker for cross-platform binary. In Proceedings of Event Australia, December 2020. ACM.
the 33rd ACM/IEEE International Conference on Auto-
mated Software Engineering, pages 896–899, Montpel- [25] Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and
lier France, September 2018. ACM. Sencun Zhu. Semantics-based obfuscation-resilient bi-
nary code similarity comparison with applications to
[16] Yixin Guo, Pengcheng Li, Yingwei Luo, Xiaolin Wang, software and algorithm plagiarism detection. IEEE
and Zhenlin Wang. Exploring GNN based program Transactions on Software Engineering, 43(12):1157–
embedding technologies for binary related tasks. In 1177, 2017.
Proceedings of the 30th IEEE/ACM International Con-
ference on Program Comprehension, pages 366–377, [26] Zhenhao Luo, Pengfei Wang, Baosheng Wang, Yong
Virtual Event, May 2022. ACM. Tang, Wei Xie, Xu Zhou, Danjun Liu, and Kai Lu. Vul-
Hawk: Cross-architecture Vulnerability Detection with
[17] Armijn Hemel, Karl Trygve Kalleberg, Rob Vermaas, Entropy-based Binary Code Search. In Proceedings
and Eelco Dolstra. Finding software license violations 2023 Network and Distributed System Security Sympo-
through binary code clone detection. In Proceedings of sium, San Diego, CA, USA, 2023. Internet Society.
the 8th Working Conference on Mining Software Repos-
itories, MSR ’11, page 63–72, New York, NY, USA, [27] Andrea Marcelli, Mariano Graziano, Xabier Ugarte-
2011. Association for Computing Machinery. Pedrero, Yanick Fratantonio, Mohamad Mansouri, and
Davide Balzarotti. How machine learning is solving
[18] Geunwoo Kim, Sanghyun Hong, Michael Franz, and the binary function similarity problem. In 31st USENIX
Dokyung Song. Improving cross-platform binary anal- Security Symposium (USENIX Security 22), pages 2099–
ysis using representation learning via graph alignment. 2116, Boston, MA, August 2022. USENIX Association.
In Proceedings of the 31st ACM SIGSOFT International
Symposium on Software Testing and Analysis, pages [28] Luca Massarelli, Giuseppe A. Di Luna, Fabio Petroni,
151–163, Virtual South Korea, July 2022. ACM. Leonardo Querzoni, and Roberto Baldoni. Investigating
Graph Embedding Neural Networks with Unsupervised
[19] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Features Extraction for Binary Analysis. In Proceedings
Ghanem. Deepergcn: All you need to train deeper gcns. 2019 Workshop on Binary Analysis Research, San Diego,
arXiv preprint arXiv:2006.07739, 2020. CA, 2019. Internet Society.
[20] Xuezixiang Li, Yu Qu, and Heng Yin. PalmTree: Learn- [29] Luca Massarelli, Giuseppe Antonio Di Luna, Fabio
ing an Assembly Language Model for Instruction Em- Petroni, Roberto Baldoni, and Leonardo Querzoni.
bedding. In Proceedings of the 2021 ACM SIGSAC SAFE: Self-Attentive Function Embeddings for Binary
Conference on Computer and Communications Secu- Similarity. In Roberto Perdisci, Clémentine Maurice,
rity, pages 3236–3251, Virtual Event Republic of Korea, Giorgio Giacinto, and Magnus Almgren, editors, De-
November 2021. ACM. tection of Intrusions and Malware, and Vulnerability
Assessment, volume 11543, pages 309–329. Springer
[21] Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, International Publishing, Cham, 2019. Series Title: Lec-
and Pushmeet Kohli. Graph Matching Networks for ture Notes in Computer Science.
Learning the Similarity of Graph Structured Objects.
In Proceedings of the 36th International Conference [30] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
on Machine Learning, pages 3835–3845. PMLR, May rado, and Jeff Dean. Distributed Representations of
2019. ISSN: 2640-3498. Words and Phrases and their Compositionality. In Ad-
vances in Neural Information Processing Systems, vol-
[22] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and ume 26. Curran Associates, Inc., 2013.
Richard Zemel. Gated graph sequence neural networks.
arXiv preprint arXiv:1511.05493, 2015. [31] NSA. Ghidra. https://ghidra-sre.org/.
[32] Adam Paszke, Sam Gross, Francisco Massa, Adam [42] Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song,
Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, and Dawn Song. Neural Network-based Graph Embed-
Zeming Lin, Natalia Gimelshein, Luca Antiga, Al- ding for Cross-Platform Binary Code Similarity Detec-
ban Desmaison, Andreas Kopf, Edward Yang, Zachary tion. In Proceedings of the 2017 ACM SIGSAC Confer-
DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- ence on Computer and Communications Security, pages
amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and 363–376, Dallas Texas USA, October 2017. ACM.
Soumith Chintala. PyTorch: An Imperative Style, High-
Performance Deep Learning Library. In H. Wallach, [43] Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou
H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, Huang, and Shi Wu. Order Matters: Semantic-Aware
and R. Garnett, editors, Advances in Neural Informa- Neural Networks for Binary Code Similarity Detection.
tion Processing Systems 32, pages 8024–8035. Curran Proceedings of the AAAI Conference on Artificial Intel-
Associates, Inc., 2019. ligence, 34(01):1145–1152, April 2020.
[44] Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang,
[33] Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana,
Sen Nie, and Shi Wu. CodeCMR: Cross-Modal Re-
and Baishakhi Ray. Learning Approximate Execu-
trieval For Function-Level Binary Source Code Match-
tion Semantics From Traces for Binary Function Sim-
ing. NeurIPS, page 12, 2020.
ilarity. IEEE Transactions on Software Engineering,
49(4):2776–2790, April 2023. Conference Name: IEEE [45] Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo,
Transactions on Software Engineering. Qiang Zeng, and Zhexin Zhang. Neural Machine Trans-
lation Inspired Binary Code Similarity Comparison be-
[34] Nguyen Anh Quynh. Capstone - the ultimate disas- yond Function Pairs. In Proceedings 2019 Network and
sembly framework. https://www.capstone-engine. Distributed System Security Symposium, San Diego, CA,
org/. 2019. Internet Society.
[35] Hex Rays. Ida pro. https://hex-rays.com/
ida-pro/. A Algorithm for the Construction of SOG
[36] Aarne Talman, Anssi Yli-Jyrä, and Jörg Tiedemann. Sen- Listing 1 shows the overview structure of our graph construc-
tence Embeddings in NLI with Iterative Refinement En- tion algorithm, where the de f State is an elaborate interface
coders. Natural Language Engineering, 25(4):467–482, that provides all the control, data, and effect definition states
July 2019. at the current code point. The de f State supports putting vari-
ables with partial overlap, in which case it divides the vari-
[37] Google V8 Team. Turbofan. https://v8.dev/docs/
ables into smaller units that either do not overlap or are exactly
turbofan.
the same. And it is able to piece multiple defined variables to
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob form a larger requested variable. For each defined variable,
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, the de f State maintains a stack to record recent definitions
and Illia Polosukhin. Attention is All you Need. In on that variable. Meanwhile, it logs all definition operations
Advances in Neural Information Processing Systems, to support commit and revert. The MayWrapWithProj proce-
volume 30. Curran Associates, Inc., 2017. dure wraps the use of conditional or indirect branches that
have multiple successors, as discussed in Section 3.
[39] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur.
Order matters: Sequence to sequence for sets. arXiv
preprint arXiv:1511.06391, 2015.
B Details about the Vulnerability Searching
[40] Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Table 6 details the vulnerable functions found and examined
Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang. jTrans: in the real-world vulnerability search experiment (§5.5). The
jump-aware transformer for binary code similarity de- results show that even models from different vendors share a
tection. In Proceedings of the 31st ACM SIGSOFT Inter- significant amount of closed-source code bases. These find-
national Symposium on Software Testing and Analysis, ings highlight the pressing need for reliable binary code simi-
pages 1–13, Virtual South Korea, July 2022. ACM. larity detection techniques.
Furthermore, we find the function symbols in the TL-
[41] Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, and WDR7620 firmware. While other firmware images such as
Philipp Krahenbuhl. Sampling Matters in Deep Embed- TL-SG2206 and FAST-FAC1200RQ do not contain these sym-
ding Learning. In 2017 IEEE International Conference bols. With the help of the BCSD techniques, we can transfer
on Computer Vision (ICCV), pages 2859–2867, Venice, the function symbols found in one firmware image to others,
October 2017. IEEE. making reverse engineering much easier.
/ / Original Vulnerabile Function .
Algorithm 1 SOG Construction Algorithm i n t sub_404D34C4 ( i n t a1 , i n t a2 ,
1: procedure C ONSTRUCT SOG( f ) u n s i g n e d _ _ i n t 1 6 a3 )
2: ▷ f : A binary function in assembly languages or linear IRs. {
3: de f State ← new DefState(); i f ( ! a1 | | ! a2 | | ! a3 )
4: g ← new SOGConstructor(); return 1;
5: e f f ectNodes ← GetAbstractEffectNodes();
......
6: dt ← GetDominatorTree( f .c f g);
7: f ← InsertPhiNodes( f , dt);
}
8: f ← PrepareControl( f ); / / The s l i g h t l y d i f f e r e n t v e r s i o n .
9: ProcessBlock(g, de f State, f .start, e f f ectNodes); i n t sub_40477EB0 ( i n t a1 , i n t a2 ,
10: return g; u n s i g n e d _ _ i n t 1 6 a3 ,
11: i n t a4 , i n t a5 )
12: procedure P REPARE C ONTROL( f , de f State) {
13: for bb in f do i f ( ! a1 | | ! a2 | | ! a3 | | ! a4 | | ! a5 )
14: if not bb ends with a BRANCH instruction then return 1;
15: bb.append(new DummyBranch()); ......
16: node ← SOGNodeFromInst(bb.getLastInst()); }
17: de f State.put(bb, node);
18:
Figure 9: False Positive Samples.
19: procedure P ROCESS B LOCK(g, de f State, bb, e f f ectNodes)
20: de f State.commit(); ▷ Commit current definition state.
21: BuildInBlock(g, de f State, bb, e f f ectNodes);
22: for child in dt.getChildren(bb) do
23: ProcessBlock(child);
24: de f State.revert(); ▷ Revert to the last commit point.
25: return ;
26:
27: procedure B UILD I N B LOCK(g, de f State, bb, e f f ectNodes)
28: for inst in bb do
29: node ← SOGNodeFromInst(inst);
30: g.add(node);
31: ▷ Recover data flow relations.
32: for input in inst.inputs do
33: inpNode ← de f State.peekOrNew(input);
34: node.addDataUse(inpNode);
35: for out put in inst.out puts do
36: de f State.put(out put, node);
37: ▷ Recover effect flow relations.
38: for e f f ect in e f f ectNodes do
39: if UseEffect(node, e f f ect) then
40: node.addEffectUse(de f State.peek(e f f ect)); Figure 10: Kernel density estimation of the number of SOG
41: if DefineEffect(node, e f f ect) then nodes in the training and testing sets respectively.
42: de f State.put(e f f ect, node);
43: ▷ Handle control flow relations.
44: if IsControlNode(node) then
45: for pred in bb.predecessors do
46: prenode ← de f State.peek(pred); False Positive Case Figure 9 shows the only example that
47: prenode ← MayWrapWithProj(prenode, bb);
consistently appears as a false positive in 10 independent
48: node.addControlUse(prenode);
runs of HermesSim. The slightly modified version function
49: for succ in bb.successors do is ranked above the function of the same source code but of
50: inOrder ← succ.getPrecedingIndex(bb);
a different architecture. All compared baselines, i.e., SAFE,
51: for phi in succ.phiNodes do
52: phi.setPhiUse(inOrder, state.peekOrNew(phi.store);
TREX, and GMN fail as well in this case. This problem should
be mitigated by additionally introducing slightly modified
53: return ;
function pairs in the training dataset (e.g., by mutating the
source code).
Table 6: Details of 1-day vulnerability search.
a Identified functions in these firmware images are in category c2, i.e., they are from slightly different versions of source code.
C Impact of Function Sizes Table 7: Results of the study on the impact of function sizes.
Queries and pools are separately sampled from three test-
Distribution of function sizes. Figure 10 exhibits the dis- ing sub-datasets: x64-XC, x64-XC-Small, and x64-XC-Large.
tribution of function sizes in the training and testing sets in The scores (%) are MRR.
terms of the number of SOG nodes. The peaks of the two
KDEs are both around the 25% thresholds, with values of 148 Quries sampled from xXC a xXC-S b xXC-L xXC-S xXC-L
and 168 for the training and testing sets, respectively. Since Pools sampled from xXC xXC xXC xXC-S xXC-L
the functions of the training and testing sets come from dif- SAFE 24.9 45.4 37.4 27.8 24.4
ferent projects, these curves should resemble the real-world Asm2Vec 35.0 69.0 76.8 60.7 56.2
distribution of functions. Trex 53.2 91.9 72.6 81.9 67.0
Performance on sets with extremely large and extremely GMN 56.2 89.9 86.3 65.0 58.4
small functions. Another two sub-datasets are created for jTrans 73.8 93.4 87.8 79.9 82.4
testing: x64-XC-Small, and x64-XC-Large, which include HermesSim 80.7 97.8 96.2 85.2 92.6
only the top 1% small functions (have less than 71 nodes) and
a xXC stands for ‘x64-XC’ in this table, we omit x64 for conciseness.
only the top 1% large functions (have more than 3676 nodes) bS and L are abbreviations for ‘Small’ and ‘Large’.
in the testing set. Due to the limited number of functions in
these sub-datasets, we sample only 200 functions as queries
for each task and use a pool size of 100. tions in general pools but struggles to do so in pools with
As demonstrated in Table 7, a significant performance im- similarly sized functions. jTrans exhibits the opposite behav-
provement can be observed for all approaches when limiting ior of Asm2Vec. HermesSim behaves similarly to jTrans, but
the queries to be extremely small or large functions. This with better scalability in handling large functions. We believe
seems intuitive since distinguishing those outlier functions the improved ability of HermesSim to handle large functions
from others is easy. When further applying this restriction (relative to GMN) should be owed to the abundant semantics
to functions in pools, the performance suffers due to the in- of SOG and the powerful aggregator.
creased similarity between negative functions and the queries. The insufficiency of NLP-based methods (e.g., Trex and
SAFE, Trex, and GMN all demonstrate inadequacies in jTrans) in handling large functions may stem from their trun-
handling large functions. Asm2Vec shows improved perfor- cation mechanism (they receive only fixed-length input and
mance in recalling large functions compared to small func- discard any portion of the sequence that exceeds this limit).