Graph-Based Threat Hunting
Graph-Based Threat Hunting
Abstract—To defend against Advanced Persistent Threats on and learning-based [11]–[17]. The policies presented by rule-
the endpoint, threat hunting employs security knowledge such based systems are difficult to sustain in a constantly changing
as cyber threat intelligence to continuously analyze system audit system environment, analysts must frequently update the rule
arXiv:2501.05793v1 [cs.CR] 10 Jan 2025
of alerts, thereby hindering its ability to effectively identify The results demonstrate the efficiency of ACTMINER in
and mitigate truly malicious activities (e.g., discerning the capturing the attack chain and highlight its resistance to
semantic differences in the access to sensitive files such adversarial attacks.
as /etc/passwd between normal processes and malicious
processes requires consideration of causal relationships). This II. BACKGROUND K NOWLEDGE
disregard may lead to more imprecise detection, rendering the
system ineffective in countering sophisticated threats. A. Provenance Graph
- Data explosion Dilemmas (C3). How to minimize mem- Provenance graphs possess potent semantic expressiveness
ory overhead and enhance the efficiency of threat hunting. and contextual association capabilities, embodying the con-
Existing solutions assume an ideal scenario for datasets, that crete manifestation of kernel audit logs. They model all system
is, researchers assume that a complete attack can be discovered entities within the logs as nodes and the interactions between
within a single batch of data. However, APT exhibits persis- entities as edges, where both nodes and edges bear attribute
tence, and attack chains may span across different batches of information. The nodes within the provenance graph are cate-
data (e.g., data from the first and third days). Repeated scans gorized into subjects and objects based on the direction of data
on ever-expanding datasets introduce significant overhead. movement. Edges represent the causal relationships between
In this work, we propose ACTMINER, a threat hunting sys- system entities, such as read/write file operations, execute exe-
tem that combines causality tracking and incremental aligning cutable file operations, create/clone process operations, and so
to efficiently and accurately dig attack chains. To tackle C1, forth. By leveraging provenance graphs, security professionals
ACTMINER constructs a heuristic search strategy based on can associate malicious entities with attack behaviors through
equivalent semantic transfer to counter phenomena such as causal analysis, unveiling the complete picture of an attack.
attack camouflage, persistence, and evasion. We fuse the data
information of inter-entity interactions through entities and
B. CTI Report and Query Graph
their contextual semantics in order to achieve the accurate
capture of malicious behaviors. To address C2, ACTMINER Cyber threat intelligence (CTI) reports [22]–[24] encompass
constructs a filtering mechanism based on the causal rela- comprehensive information related to cyber attacks and attack-
tionships of attack behaviors, and ignores unreasonable entity ers, with a particular emphasis on capturing detailed attack
context relationships. ACTMINER employs the causal motiva- procedures - the intricate sequences of steps and techniques
tion behind attacks to guide threat hunting, ensuring the inter- employed in multi-stage attacks. These reports provide in-
pretability of hunting results and minimizing false positives. depth representations of attack scenarios, potential impacts
In other words, we provide a more accurate hunting result by on target hosts, as well as the complex chains of causally-
excluding unreasonable (attack-irrelevant) paths based on the linked events that characterize APTs. Security professionals
causal relationship through temporal sequences. To deal with leverage CTI reports to formulate more targeted defense rules
C3, ACTMINER construct a tree structure to incrementally for preventing and identifying malicious attack behaviors. In
update the alignment results, thereby avoiding the significant recent years, substantial research [2], [11], [25]–[27], has
overhead caused by rescanning multiple batches of redundant demonstrated the successful application of CTI reports in
data. threat detection and threat hunting. In this paper, we construct
We evaluate the effectiveness and efficiency of ACTMINER a directed graph, termed the query graph, from the offensive
on the dataset provided by Darpa TC program [20], [21]. Our and defensive knowledge [28](attack entities and their causal
results reveal that ACTMINER surpasses existing provenance- relationships) extracted from CTI reports. Similar to prove-
based threat hunting system in terms of detection precision and nance graphs, query graphs are directed graphs with attribute
recall. Moreover, ACTMINER can reduce the computational information.
overhead and eliminate redundant searches. By deploying
ACTMINER, security analysts are able to effectively analyze C. Graph Alignment
attack chains and formulate countermeasures, significantly
Graph alignment refers to the problem of detecting potential
alleviating the workload. In summary, the main contributions
cyber intrusion behaviors by establishing an optimal subgraph
of our work are as follows:
mapping between a provenance graph (Gp ) representing sys-
• Unlike traditional attack detection methods, we propose
tem activities across the entire system, and a query graph (Gq )
a provenance-based threat hunting system ACTMINER to
representing attack pattern activities. The provenance graph
accurately capture attack chains.
Gp = (Vp , Ep ) consists of a node set Vp representing system
• We introduce a heuristic search strategy based on equiv-
entities and events, and an edge set Ep . The query graph Gq =
alent semantic transfer and a filtering mechanism based
(Vq , Eq ) comprises a node set Vq representing attack patterns
on causal relationships of attack behaviors to ensure the
and an edge set Eq . The goal of graph alignment is to find a
precision and recall of ACTMINER.
subgraph Gm in Gp that maximizes the matching degree with
• We propose a tree structure to incrementally update the
Gq :
alignment results, effectively addressing persistent APT
attacks and the continuous growth of graph data.
• We comprehensively evaluate our system and SOTA
POIROT [2] on the dataset from DARPA TC program. Gm = argmax(M (G′ , Gq)) (1)
G′ ⊆Gp
3
Here, M is the Matches function that calculates the matching processes to read files, followed by another process reading
degree between Gq and a subgraph G′ of Gp . By solving this those files, and finally transmitting them over the network.
optimization problem, the best mapping from the attack query Overreliance on the simplistic approach of threat hunting
graph to the activity graph is obtained, enabling the detection based on a predetermined number of hops may inadvertently
and tracking of cyber intrusion behaviors. overlook malicious activities that align with the intrinsic char-
acteristics of attacks, ultimately resulting in detection efforts
III. M OTIVATION failure. As shown in Figure 1, in a simulated scenario, as
described in [2] Section 5, we set Cthr to 3, but find that
A. Motivating Example
this limitation resulted in an incomplete capture of the attack
Scenario: Consider the following scenario where an attacker chain. Consequently, it was unable to detect the art bat file.
exploits the feature of automatically executing login scripts Moreover, with such hop count restrictions, attackers aware
(Reg.exe and %temp% \art.bat 2) during login initialization of the imposed limits could potentially evade hunting more
(mal) to establish persistence by adding the malicious script easily across different scenarios, resulting in potential harm.
path to the registry (HKCU \Environment R2). Subsequently, The manual adjustment of the hop limit according to different
the attacker searches for network shares on the compromised scenarios poses significant challenges. Similarly, the same
computer to locate files and then collects sensitive data problem exists in other path-based detection [29]–[31] efforts.
(/etc/passwd) from remote locations via shared network drives
(host shared directories, network file servers, etc.). Finally, the False Positive. Within real organizations, extensive legiti-
data is transmitted over the network (162.66.239.75). mate user operations exhibit similarities with attack behaviors
As illustrated in Figure 1, this example includes two graphs: in log data. If hunting rules are overly broad or incomplete,
the top-left depicts the attack query graph manually extracted normal behaviors may be misclassified as malicious. For
from a cyber threat intelligence (CTI) report, following the example, when a user downloads network files through a
approach outlined in the POIROT. browser, the browser collects user data and transmits it to
its cloud server, while the downloaded network files may
The attack initiates by leveraging a malicious executable,
be flagged as ”suspicious files” by the system, resembling
malicious.exe, to obtain unauthorized system access. It then
malicious attack behaviors and triggering false alarms from
employs Registry modifications to establish persistence. Once
the hunting system. Additionally, attackers may leverage
a foothold is secured, the attacker can remotely issue com-
tools/techniques to deceive hunting systems, also leading to
mands and execute them on the system, executing an art.bat2
false positives. As highlighted in Figure 1, a suspicious process
file in the temporary folder and facilitating actions such as
Mal.exe exhibits two paths for reading/writing sensitive files.
exfiltrating sensitive data to an external IP, exemplified by the
According to the attack query graph, sensitive file access
transfer of /etc/passwd containing user account information.
should occur before network transmission, while path1 occurs
On the right side is the provenance graph constructed from
after transmission. Therefore, path2 represents the attacker’s
actual system logs capturing the observed execution behavior.
actual operations in the environment.
Due to the fragmented nature of attack scenarios, coupled
with the constraint of limiting the hop count to existing High overhead and inefficiency. Government and enter-
threat hunting approaches, can lead to imprecise or incomplete prise organizations typically need to collect data simultane-
results during the threat hunting process. In this paper, we ously from thousands of machines, easily amassing petabyte-
transform the threat-hunting problem into finding the attack scale data volumes. This massive data not only imposes
query graph within the provenance graph. substantial storage overhead but also significantly reduces
Threat hunting methods centered around POIROT encounter hunting efficiency. Traditional hunting methods require offline
several significant challenges: storage and continuous backscanning of system log data,
False Negative. Due to the complexity of real enterprise resulting in immense computational overhead for each hunting
environments, semantic gaps exist between provenance graphs operation. Referring to Figure 1, assume that all operations
and attack query graphs. The manifestations of the same attack before node E23.txt at time 114 have already occurred. When
type may differ across systems, and attackers may utilize the a security analyst attempts to hunt for threats solely based
same tools in diverse ways. For instance, entity names in on the data collected after this time, the incremental data
the attack query graph may have varying representations in segment alone cannot effectively support the reconstruction
the underlying logs of different systems. In POIROT, regular of the complete attack chain represented by its query graph.
expressions are employed to instantiate node names from the When examining the issue holistically, the newly acquired data
attack query graph for hunting searches in the provenance lacks the necessary evidence to capture the earlier stages of the
graph. However, if attackers modify their tactics, introducing multi-step intrusion. Consequently, subsequent data collection
technical variations, threat hunting systems struggle to detect would necessitate rescanning the previously available infor-
different mutated attack samples (e.g., over 100 versions of mation, redundantly recomputing the provenance of data prior
the Carbanak malware were described in CTI reports). Fur- to a specific timeframe. These redundant computations across
thermore, attackers can evade security detection through ob- multiple hunting activities introduce an unsustainable over-
fuscation, persistence, and evasion techniques. As illustrated, head, hindering the system’s efficiency and scalability. And
the attack query graph only describes data exfiltration over the potentially allowing malicious activities to persist undetected
network, whereas in the real environment, the attacker creates for extended periods.
4
118.send
send Remote 162.66.23
fork
malicious *.exe IP 9.75
Server.exe
receive
fork 119.receive
write 117.fork
*.exe
/etc/passwd E23.txt
exec 114.read
write
%temp%\art.bat_2 Mal.exe
120.write Path_1
/etc/passwd
Registry
Query Graph 115.fork
113.write
116.write
……
REG.exe Copy-item Path_2
Explore.exe mal
112.fork
103.exec
104.write 102.fork 101.fork
Svo.exe 106.load 192.168.74.136
105.exec
malicious
107.send
H*.reg Tmp.exe Sys.dll
102.fork inted ……
105.exec
104.fork 106.fork 106.exec
tps impad Apk.exe art_bat_2.exe Provenance Graph
Fig. 1: Motivating Example. The red nodes and edges depict the truly malicious behavior. In contrast, the blue outlines
encompass false positive detection, where POIROT incorrectly identified benign system entities as malicious. The specific
nodes are the points with green borders. Furthermore, the orange outlines highlight the instances of missed detection or false
negatives, where POIROT failed to identify nodes that were indeed part of the attack chain.
IV. S YSTEM D ESIGN attacks. And to control the memory consumption, ACTMINER
This section first introduces the overall architecture of the will store the unupdated tree branch to the database unless
ACTMINER system, followed by a detailed description of each certain behavior related to this branch.
module presented in ACTMINER. The basic architecture of ACTMINER is shown in Figure 2,
which can be divided into three modules: (I) the Data Prepa-
ration Module, (II) Casual Relation and Semantic Processing
A. System Overview Module, and (III) Threat Hunting and Incremental Aligning
Data Preparation Module (§ IV-B). The attack query Module. It is important to note that ACTMINER continuously
graphs are extracted from threat intelligence reports, and runs the above three modules as the time progresses. Details
provenance graphs are constructed based on extensive un- of system design for each module are given in Section IV-B,
derlying logs. Duplicate events and orphan nodes within the Section IV-C, Section IV-D, respectively.
provenance graphs undergo filtering, which is a necessary and
common practice in existing work [32]. B. Data Preparation Module
Casual Relation and Semantic Processing Module This section describes the data preprocessing module for
(§ IV-C). When a new attack query graph or provenance provenance graphs and query graphs.
graph generated, ACTMINER will first categorize the entities 1) Provenance Preparation: Provenance graphs are com-
into four classes. Then the provenance graph will delivery to posed of log data collected from various platforms by
the next module while the attack query graph still need to data collectors. In this work, we employ open-source tools
be processed. Next, ACTMINER merges analogous actions in such as eAuditd [33], Kellect [34], and Event Tracing for
the attack query graph. Finally, ACTMINER employs Equiv- Windows (ETW) [35] to gather relevant system logs from
alent Semantic Transfer which traces potentially overlooked Linux and Windows environments. ACTMINER transforms
attack chains by tracking malicious semantics, to identify the each event into a directed, time-stamped edge, in which
suspicious actions in the next module. the source node represents the object being acted upon. For
Threat Hunting and Incremental Aligning Module any event et ∈ E, ACTMINER represents it as a quintuple
(§ IV-D). ACTMINER will hunt attack-related scenario by ⟨U IDs , U IDo , OP, T ii ⟩. U IDs and U IDo are unique identi-
chaining suspicious semantic nodes and generating suspicious fiers for the subject and object of et, respectively. OP denotes
semantic tree. As time progresses, batch data is continuously the type of et, and T ii denotes the time when et occurred.
inputted into the ACTMINER, persistently updating our sus- Directly processing such massive raw log data is extremely
picious semantic trees and unveiling more latent malicious challenging. To address this, we perform pruning operations on
5
I.Data Preparation Module II. Causal Relation and Semantic III. Threat Hunting and Incremental Aligning Module
Processing Module
Fig. 2: The architecture of ACTMINER, which consists of three core modules that synergistically facilitate comprehensive threat
hunting and attack chain construction capabilities.
the low-level log data. Specifically, redundant events without TABLE I: A categorization of distinct entities and their
context are removed [9], [36]. This means if the subject UID corresponding label assignments.
(U IDs ), object UID (U IDo ), and operation (OP ) are identi- Entity Tag Description
Process P Processes, threads spawned by system calls
cal, and the timestamps (Ti ) are consecutive, the most recent User Configuration Sensitive files containing user configuration
Fa
Sensitive Files information, such as /etc/passwd
timestamp will be preserved. Furthermore, our methodology Application
Sensitive files that contain configuration information
includes the removal of isolated nodes within the provenance Configuration Fb
about the application, such as /etc/mysql/my.cnf
Sensitive Files
graph. The isolated nodes refer to entities that lack any Log-sensitive
Fc
Sensitive files containing logging information,
documents e.g. /etc/httpd/logs, e.g. etc/httpd/logs
incoming or outgoing edges. For example, we find that the A collection of pre-compiled methods with extensions
Library file Fd
data contains many these nodes manifest without any parent or such as .lib, .a, .dll, .so, etc.
Files that can be loaded and executed by the operating system,
Executable file Fe
child nodes, none of the events contain a subject UID (U IDs ) with extensions such as: .exe, .vbs, etc.
Temporary document Ff Temporary files generated by the system, e.g., /tmp/*
or object UID (U IDo ) matching the UID of the node, lack A collection of other types of files, such as plain text files,
Other documents Fg
plain graph files, plain zip files, etc.
contextual information and fail to provide meaningful insights. Unified management of hardware and software configurations,
Registration form R
So remove them do not compromise the integrity of the graph including HKLM, HKCU, HKCR, HKCC, HKU, etc.
Refers to a host on the Internet or
representation. Socket S
a process in a host, e.g. 127.0.0.1
Concurrently, during the process of constructing provenance
graphs, the node types are categorized based on the entity type
contained within the logs. For instance, Fa represents files in previous studies such as Holme [3], Sleuth [10], and Morse
involving user and system-sensitive information, such as the [4]. The details are show in Table II.
boot.ini file on Windows and /etc/passwd file on Linux. The 2) Query Graph Preparation.: CTI reports describe attacks
specific categorization is detailed in Table I. that have already occurred. We collect the latest threat in-
As show in Table I, we are inspired by APTShield [9], telligence from websites such as Microsoft and Symantec.
Conan [8] and refined the classification heuristics optimized Leveraging open-source tools like Extractor [28], we extract
the classification method we obtained from POIROT. Further- attack query graphs (Gq ) from cyber threat intelligence (CTI)
more, we extensively gather CTI reports from various online reports. Analogous to provenance graphs, upon extracting
sources and network channels, identifying specific file paths attack query graphs, the constituent entities undergo a cor-
that exhibit heightened susceptibility to attacks. Consequently, responding mapping process.
we adapt the importance degree of these paths based on their
frequency of occurrence to finally obtain ten distinct labels. C. Casual Relation and Semantic Processing Module
By analyzing the CDM18 and CDM19 which refer to the In this section, we sequentially describe the process of
data definitions for DARPA’s E3 and E4 programs, respec- Module II in Figure 2, i.e., merging analogous actions and
tively, we adopt events based on a few general fields in equivalent semantic transfer. In a nutshell, analogous actions
the CDM (i.e., the events of read, write, fork, clone, create, in attack query graphs are merged while employing an equiv-
execute, load, and inject), which were most commonly used alent semantic transfer strategy. This process enhances attack
6
TABLE III: Equivalent Semantic Transitivity Policies in the To optimize the traversal and analysis process, the algorithm
Context of Generalized Attack Pattern Identification and stores context information for each node (Lines 11-15). This
Matching. step avoids redundant traversals of the graph and enables the
Subject Object Direction
∃p.semanticsϵ{SuspiciousLabel}
Requisites algorithm to order the edges based on timestamps, resulting in
P P forward ∧[Event Fork(p, p’) | Event Create(p, p’)|
Event Clone(p, p’)]: p’.semantics.add(”SSuspiciousLabel”)
an ordered hunting sequence (Lines 16-19). This sequence is
P P forward
∃p.semanticsϵ{SuspiciousLabel}
∧Event Inject(p, p’): p’.semantics.add(”SuspiciousLabel”)
instrumental in guiding the subsequent steps of the algorithm.
P F forward
∃p.semanticsϵ{SuspiciousLabel}
∧Event Write(p, f): f.semantics.add(”SuspiciousLabel”)
It will calculate the reciprocal of the length of the shortest
P F backward
∃f.semanticsϵ{SuspiciousLabel}∧f.tagϵ{Fd, Fe}
∧Event Execute(p, f)|Event Load(p, f): p.semantics.add(”SuspiciousLabel”)
path between nodes in Gq and Gp as the path score. For each
P F backward
∃f.semanticsϵ{SuspiciousLabel}
∧Event Read(p, f): p.semantics.add(”SupiciousLabel”)
node in Gq, the candidate node with the highest contribution
value will be selected as the fixed node.
Step 3: Creating Tree Nodes. The final step of the
Section IV-D. algorithm involves the creation of tree nodes (Lines 19-27).
Given the disparity in size between the query graph and the
D. Threat Hunting and Incremental Aligning Module provenance graph, the CreateTreeNode function (Line 31) is
1) Suspicious Semantic Tree Construction: An event con- utilized to map each query graph to one or more subgraphs
tains the interaction information between entities and can be in the provenance graph that exhibit similar patterns. The
transformed into an information flow, which can be further function specifically creates a branch in the tree for the
classified into data flows and control flows. Data flows indicate attack entry point. This mapping is facilitated by the hunting
dependencies in data content, reflecting the data propagation sequence obtained in the previous step, enabling efficient
path (e.g., a process reading a file), while control flows detection navigation against the large and complex dataset.
primarily refer to process creation relationships (e.g., a parent After creating the tree nodes, when the attack progression
process creating a child process). In the threat hunting mod- exceeds the total attack sequence, the system raises an alert to
ule, data flows and control flows will be jointly abstracted notify security analysts of this anomalous situation.
into a suspicious semantic tree. The process of generating a To illustrate the aforementioned methodology, we present an
suspicious semantic tree is detailed as following three steps, example for better comprehension. We aim to find suspicious
as shown in algorithm 1: subgraphs similar to the query graph in the provenance graph
Step 1: Finding Candidate Nodes. To capture malicious shown in Figure 4. As observed, the representation above
behaviors in the provenance graph constructed from low-level illustrates a concrete instantiation of the graph, whereas the
system logs that match the patterns in the corresponding query depiction below presents an abstraction of the model. Assum-
graph, our system first searches for all nodes in the provenance ing we start from the process P1 within the red box as the
graph with attributes identical to those of entity node in the starting point of the attack chain (corresponds to the above is
query graph. These candidate nodes are collected into a list, powershell1 .exe), according to the query graph, the next step
referred to as the candidate set FC(i), which is associated with should be to find an executable file associated with P1, with
the query node (Line 7). the edge semantics being a write event. In the graph, only the
Step 2: Confirming Attack Intent. As the query graph Fe1 node (corresponds to the above is update.ps1) satisfies
Gq carries clear temporal features and causal relationships, this condition, so we can generate a tree node that stores a
we leverage these information to guide the attack detection variety of data, including the unique identifier of the query
reconstruction and reconstruction processes. This enables us to graph, event type, and relevant temporal information to assist
quickly determine the initial intrusion location, relevant entity subsequent hunting tasks. This step is largely analogous for
nodes, and the sequence of attack events. Such a query graph both approaches, with only marginal differences in storage
can assist analysts in searching malicious behaviors effectively. efficiency and temporal performance, the detail can see in
The ExtractRelevantNodes function is subsequently invoked Section V-D.
to identify the critical nodes within the query graph that are Similarly, P 2 will also be identified and generate a tree
essential for comprehending the attack methodology (Line node (corresponds to the above is powershell2 .exe). However,
8). This function operates by determining the next potential when the flows in Gq come to be associated with the socket,
action nodes in Gp based on the preceding step Fc in Gq . multiple similar scenarios may arise. In the Gp , the path from
The ReconstructAttackSequence algorithm reconstructs the at- P2 to the socket IP2 (indicated by the red arrow) satisfies
tack sequence (Line 9), carefully aligning with the temporal the previously defined equivalent semantic transitivity policy,
and causal patterns within Gq . Unlike indiscriminate search indicating that P2 and P3 share the same semantic information.
methods, the approach is strategically guided by specific target Therefore, IP2 can be retained as a suspicious node, while
nodes, precisely capturing the attacker’s intended progression. generating a tree node and preserving the temporal relationship
The function determines the matching order by meticulously from P3 to IP2. But for the above scenario, the absence of
tracing the sequence of attacks delineated in the graph. ACT- the strategy and rigid aligning rules, a false positive result
MINER evaluates whether nodes introduce malicious seman- is produced. Likewise, the path from P2 to IP3 also satisfies
tics, which is identified by first fixing the target nodes and then the equivalent semantic transitivity policy (It is important to
analyzing behavior within Gq . This process involves tracing note that intermediate nodes will be represented in the form
actions from the fixed nodes to determine patterns indicating of equivalent semantic attributes within the initial P1 node’s
malicious intent, forming the basis for further analysis. properties). However, for the path from P2 to IP1, although
8
111.connect 123.fork
P1 101.write Fe1 Fa1 IP3 P6 P5
109.fork
104.Load 113.read 124.connect
125.Recv
Ff1 116.write P2
108.fork
P 112.execute
Fe2 IP4
121.load 120.execute 105.read 107.connect
115.fork 1.write Fe
P4 IP2 P Fa
102.fork
Fg1 2.load 4.read ……
122.execute
106.read
107.connect 6.write P Fa
P Ff
P3
P 8.load 3.connect 5.read Fg
114.read 116.read 7.execute
……
Fa2 103.connect IP1 Fg P IP
Fg
Provenance graph Query graph
Fig. 4: A case study of strategies such as POIROT and the underlying scenarios that our ACTMINER faces in the same scenario.
Where the dashed part indicates that the middle contains the multi-step behaviour whether black or red color, and the red line
and red box part indicate the real captured entities and events.
it also represents P→IP with the edge semantics of a connect allows for the efficient retrieval and reinstatement of these
operation, it occurs before the previous node (Fe1→P2 at time nodes into memory.
104), violating the sequence of the attack, and thus, this path For the former, through the affected tree nodes, we can
is ignored. Fa can be found as the same. obtain the current attack progress and their mapped nodes
Next, we need to find P→Ff (with the edge semantics of in the provenance graph and query graph. Then, through the
a write operation), but no files of the Ff type in the current sequence of suspicious candidate nodes, we can determine
provenance graph (assume the current time has not yet reached the next suspicious entity to hunt. Finally, we judge whether
116), so the system needs to wait for new data to arrive. the suspicious states are met in the candidate node set of the
2) Incremental Aligning: As time progresses, the log data suspicious entity, and if so, we construct a new tree node.
generated by hosts will continue to increase. For traditional For the latter, we first determine whether the node has
threat hunting systems, for any newly added logs after a candidate nodes. If there are no candidate nodes, it means
period of time, they need to re-scan the entire dataset with that this part of the data does not have an entry point for
a larger volume. To address the inefficiency of traditional attacks, indicating that this part of the data is considered
threat hunting in analyzing incremental streaming data, we benign. However, if there are candidate nodes, we will start
adopt an incremental graph computation method to hunt for rebuilding the suspicious subtree from its candidate nodes.
attacks and update suspicious semantic trees. First, we search As an example, the shaded part in Figure 4 represents the
for new candidate nodes in the newly arrived provenance graph newly added data. For the new data, we determine whether it
based on the attributes of nodes in the query graph. Next, we affects the existing results. The query graph (Gq ) awaits the
divide the impact of the new data on the suspicious subtrees arrival of a pattern where process P2 writes the file Ff1. If so,
into two parts: the new data affects the existing suspicious a new node representing Ff1 is added to the graph. According
semantic subtrees, and the new data is unrelated to the existing to the query graph, this P→Ff1 (write) is the desired one-hop
suspicious subtrees. Furthermore, to manage memory overhead attack path. For the above graph is powershell.exe →profile
effectively, we implement a forgetting rate to reduce memory (write). Although in the query graph, P←Fg (read) occurs
consumption. We transfer the nodes that remain un-updated before P→Ff (write), since the previous operation read an
for a period of 6 hours (can be adjusted according to different ordinary file with weak attack relevance, if the provenance
circumstances) into the database and create a corresponding graph does not contain the corresponding related connected
index for them. The index includes the attributes of the flow and nodes, it indicates that the operation did not introduce
node itself and its parent node, enabling rapid localization of new suspicious attack semantics, and the next attack target
relevant nodes in the event of a subsequent occurrence. This should be further explored. Then, we search for the target
9
Algorithm 1 Threat Hunting Algorithm TABLE IV: The detail of the attack and benign datasets.
Require: Gq , Gp Scenario Behavior
E4-Trace case1 Malicious file download and execute
Ensure: Suspicious Subgraph Gs E4-Trace case2 Information gather and exfiltration
1: /*DataProcessing*/ E4-Trace case3 Malicious file download and sensitive file exfiltration
E4-Trace case4 In-memory attack with firefox
2: Gq ← M ergeSimilarEntity(Gq ) E3-FiveDir case1 Pine backdoor
E3-FiveDir case2 Phishing E-mail Link with macro viruses
3: LSe ← GetQuerySequences(Gq ) E3-Trace case1 Firefox backdoor and load malicious software
4: f ← GetSetQueryGraphF low(Gq ) E3-Trace case2 Firefox backdoor and deploy malicious programme
E3-Trace case3 Phishing E-mail
5: Gp ← M ergeSimilarEvent(Gp ) E3-Theia case1 Firefox Backdoor and privilege escalation
6: /*ThreatHunting and Incremental Aligning*/ Win Benign Data Account operation, network communication and application activity
Linux Benign Data User Login, application operation and network interaction
7: F C ← F indCandidateN odes(Gq , Gp )
8: relevant nodes ← ExtractRelevantN odes(Gq ) TABLE V: The summary of the experimental dataset. Column
9: attack sequence ← ReconstructAttackSequence(Gq ) 1 specifies the name of dataset, and Column 2 denotes the
10: hunting sequence ← ∅ corresponding duration. Columns 3 and 4 indicate the number
11: for n ∈ relevant nodes do of nodes and edges, respectively. Column 5 represents the
12: context ← RetrieveN odeContext(n) number of attack nodes.
13: hunting sequence.add(context)
Datasets Duration Time #N #E % of Attack Nodes
14: end for E3-Trace 310h 1.950M 9.053M 37890
15: hunting sequence.sort(key = lambdax : E3-FiveDirections 210h 1.287M 2.577M 2956
E3-THEIA 168h 960.357K 2.352M 14781
x.timestamp) E4-Trace 8h 3.035M 13.586M 39582
Benign Linux 240h 2.385M 3.891M 0
16: visited ← ∅ Benign Windows 192h 5.324M 12.856M 0
17: f ixed candidates ← ∅ Avg 234.667h 2.490M 7.386M 0.669%
18: seqN um ← 0
19: for context ∈ hunting sequence do
20: q node ← context.corresponding query node • RQ2: How robust is ACTMINER against adversarial at-
21: if q node ∈ / visited then tacks?
22: candidates ← F C[q node] • RQ3: How important are the components we design for
POIROT? There are already several threat hunting studies on name and simple types as query graph’s features, which
[2], [11], [25], [39], [40]. We broadly categorise them into is insufficient. On the other hand, the entities attributes and
the following two based on the techniques they use: machine action semantic employed by ACTMINER offer more semantic
learning-based approaches [11], [39], [40] and search-based information for each node and less sensitive for nodes with
approaches [2], [25]. While both DeepHunter [39] and ProvG- certain types of characteristics, making it harder for attack
Searcher [11] conduct model training by constructing positive nodes to conceal themselves.
(attack graphs extracted manually from the provenance graph) At first sight, ACTMINER shows incremental improvement
and crafted negative samples. MEGR-APT [40] utilizes a in comparison to POIROT in terms of FN. This is attributed
graph matching model to compute similarity scores between to ACTMINER considering the correlation between semantics
the query graph’s embedding vector and the embedding vectors of multi-hop nodes. Table VI shows the performance of ACT-
of detected subgraphs. However, a fundamental limitation of MINER and POIROT on all datasets. ACTMINER does not miss
these coarse-grained methods is their inability to consider the any malicious nodes, i.e., ACTMINER’s average false negatives
relationships between nodes in the attack chain (e.g., Deep- (FNs) are 0, reduced by 61 compared to POIROT. On average,
Hunter solely considers the relationships between IOCs rather the false postive nodes generated by ACTMINER (∼ 389 nodes)
than the comprehensive association information of all attack is 1.91 × less than POIROT. ACTMINER demonstrates a
nodes). In other words, the results obtained from DeepHunter notable improvement in precision over POIROT, achieving
and ProvG-Searcher may not necessarily represent complete a 2.96% higher precision score, while also outperforming
attack chains. Furthermore, our initial attempt to re-implement POIROT in terms of recall with a 1.94% increase.
the ProvG-Searcher revealed that the core component respon-
sible for processing both provenance graphs and query graphs B. RQ2: How robust is ACTMINER against adversarial at-
is not available as an open-source solution. Hence, we do not tacks?
compare our work with them. ThreatRaptor [25] uses NLP When an attack occurs, the attacker’s behavior pattern may
technology to extract threat behaviour graphs from CTI reports be highly similar or even nearly identical to the normal system
and transforms the graphs into TBQL query language using behavior in a regular environment. From a technical perspec-
specific algorithms. Unlike ACTMINER, it stores audit log tive, attackers can mimic normal processes at the API call level
data in a database, allowing for the retrieval of individual or employ code injection techniques to make their behavior
attack behaviors. The single-point matching results obtained patterns indistinguishable from normal processes at the low-
from the TBQL query statement do not correspond well to level system log. This poses a challenge for provenance-
the contextual content of the attack process described in the based threat hunting. However, by considering richer con-
CTI report. So we do not compare with it. textual semantics, differentiation can still be achieved. Goyal
Here, we compare the performance of ACTMINER with et al. [42] devise three strategies for adversarial detection of
POIROT [2], which are the most relevant in term of level anomaly detection systems based on graph-level granularity.
and methodology for our evaluation. POIROT aligns the query Enlightened by their work, we design experiments to assess
graph with the provenance graph based on node type and name ACTMINER’s resilience against adversarial attacks.
regularization. Due to the unavailability of the query graphs To access ACTMINER’s resilience against adversarial at-
manually extracted by the authors in POIROT, for fairness, our tacks, we perform adversarial mimicry attacks on provenance-
query graph is uniformly extracted by the Extractor [28] in our based graph alignment threat hunting system. To evalu-
experimental setting. However, during the extraction process, ate ACTMINER’s anti-attack capability, based on the Darpa
we encounter graph disconnections, incomplete attack chains, datasets as a reference, we modify and add attack steps in the
etc. Therefore, we use the state-of-the-art method CRUcialG provenance graph, primarily considering two scenarios.
[41] to assist in obtaining the query graph. Scenario I: The attacker inserts a large number of invalid
attack paths into the actual attack chain, attempting to disrupt
and mislead the detection algorithm. Test results show that
A. RQ1: How effectively can ACTMINER detect the attacks POIROT suffered severe missed detections in this scenario,
especially in terms of false alarms? while ACTMINER successfully detected all real attack nodes,
Table VI presents the performance of ACTMINER and resisting the attacker’s disruptive attacks.
POIROT on our evaluation datasets. ACTMINER consistently Scenario II: The attacker uses normal programs to perform
surpass POIROT, achieving superior precision, recall values operations similar to attack patterns, attempting to introduce
and lower number of FN/FP. In comparison to POIROT, false positives. Tests found that the POIROT exhibits varying
ACTMINER utilizes entities attributes and action semantic to degrees of false positives, marking normal processes as attack
generate a more generalizable query graph. This provides brief nodes. Our system, on the other hand, effectively distinguishes
abstracted entity information to taking a graph alignment, the true intent of normal programs through behavior associ-
subsequently reducing false negative and enhancing precision ation and intent analysis, avoiding false positives. Based on
and recall. As the POIROT paper lacks evaluation on E4 the above two scenarios, we designed the following three
dataset, we execute POIROT on E4 to obtain evaluation strategies:
results. The findings demonstrate that ACTMINER significantly 1) Strategy I: Insert additional unrelated benign process
outperforms POIROT, as E4 attacks more challenging to detect read/write file flows between process and file read/write
due to well-blended malicious activity. POIROT relies solely operations. As shown in Figure 5;
11
TABLE VI: Performance of ACTMINER and POIROT. FN denotes the false negative, which occurs when a genuine attack
pattern is incorrectly classified as benign. Conversely, FP represents the false positive, where a benign event or data point is
mistakenly identified as an attack one. The notation Prec. denotes precision.
read/ POIROT ActMiner without EST ActMiner
P F
ATTACK write
FN/FP Recall Prec. FN/FP Recall Prec. FN/FP Recall Prec.
Darpa4 Trace case1 19/89 99.79 99.00 21/43 99.76 99.51 0/79 100.00 99.11
Darpa4 Trace case2fork/ 58/1005 99.62
fork/ 93.89 37/532
fork/ 99.76 read/ 96.62 0/781 100.00 95.18
P
Darpa4 Trace case3clone 77/82 P clone
94.58 P
94.24 clone
63/41 ......
95.51 write 97.03 F
0/65 100.00 95.39
Darpa4 Trace case4 96/1352 98.51 88.61 74/775 99.30 93.16 0/912 100.00 92.15
Darpa3 FiveDir case1 18/254 98.52 82.49 7/88 99.42 93.19 0/142 100.00 89.40
Darpa3 FiveDir fork/
case2 49/49 P fork/
96.36 96.36 fork/
32/30 97.59 read/97.74 0/37 100.00 97.32
P P ...... F
Darpa3 Trace case1clone 77/1146 clone
99.47 92.62 clone
54/412 99.62 write 97.20 0/613 100.00 95.91
Darpa3 Trace case2 59/1952
read/ 99.67 read/
90.20 35/742 99.80 95.95 0/996 100.00 94.80
Darpa3 Trace case3 write
78/233 94.72 write
85.72 62/103 95.76 93.14 0/154 100.00 90.08
Darpa3 Theia 85/1263 99.37 91.38 63/578 99.53 95.85 0/746 100.00 94.72
Avg FP/FN/Recall/prec. 61/742 F 98.06 F
91.45 45/334 98.60 95.94 0/452 100.00 94.41
† EST: Equivalent Semantic Transfer
F F
with a decreasing trend approaching approximately 1%. This
fork/ fork/ fork/
by the design of our system. In characteristic read/
is determined
P P P ...... F
clone clone connect/ clone write
Fig. 5: Strategy I. Read/Write operations Insert. contrast, POIROT exhibits a diminishing recall rate as the
P sendto/ S
P execute P recvfrom
proportion increases, with the rate of decline accelerating with
The experimental results, as illustrated fork/
fork/ in Table VII, depictfork/higher proportions read/ added. This can be observed from the AVG
P P P ...... fork/ F connect/
the percentage
P clone and edges incrementally
fork/
of nodes
clone
P clone
...... added
execute to thecloneof R., where
P P write
the clone
P ......
descent shifts from an initial 1.5% at full scale sendto/ S
recvfrom
connect/
attack graph using the three aforementioned
execute
read/ strategies
read/on the to around 4% towards the latter end of the scale. Regarding
sendto/
2.Execute
/tmp/ex.txt Firefox2 4.load /usr/lib
SubProcess 6.connect Socket 1.load
Firefox /usr/lib
5.execute
TABLE VII: The performance comparison between the ACTMINER and POIROT across four scenarios (i.e., four query graphs)
from the E4-Trace dataset, where varying proportions of edges and nodes are added in the provenance graph. Each addition
consists of an equal number of three types of nodes/edges. The percentage added represents the percentage of the number of
nodes in the provenance graph relative to the attack node.
0% 25% 50%
Query
POIROT ACTMINER POIROT ACTMINER POIROT ACTMINER
Graph
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
case1 99.00 99.79 100.00 100.00 98.34 99.23 100.00 100.00 96.24 97.35 99.26 100.00
case2 93.98 99.62 97.37 100.00 93.28 96.55 97.58 100.00 92.26 94.98 96.33 100.00
case3 94.24 94.58 98.32 100.00 93.21 94.52 98.42 100.00 91.35 92.25 96.47 100.00
case4 88.61 98.51 95.08 100.00 87.25 94.31 95.52 100.00 83.21 90.44 94.35 100.00
AVG 93.96 98.13 97.69 100.00 93.02 96.13 97.88 100.00 90.77 93.76 96.60 100.00
75% 100% 125%
Query
POIROT ACTMINER POIROT ACTMINER POIROT ACTMINER
Graph
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
case1 93.55 95.26 97.56 100.00 91.41 94.32 96.46 99.98 89.74 90.25 95.84 99.98
case2 90.04 93.52 94.32 100.00 87.43 90.78 94.16 100.00 82.36 87.32 92.52 99.98
case3 89.21 91.56 95.36 100.00 88.47 90.24 95.13 99.97 86.25 84.33 94.78 99.97
case4 80.32 87.03 93.42 100.00 77.22 85.41 92.74 100.00 74.32 81.36 92.04 99.97
AVG 88.28 91.84 95.17 100.00 86.13 90.19 94.62 99.99 83.17 85.82 93.80 99.98
150% 175% 200%
Query
POIROT ACTMINER POIROT ACTMINER POIROT ACTMINER
Graph
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
case1 87.21 88.41 92.36 99.98 82.34 81.64 91.50 99.96 77.28 76.88 90.03 98.74
case2 77.23 83.42 89.64 99.47 72.14 80.21 88.75 99.48 61.45 76.33 87.25 99.52
case3 83.52 81.02 93.74 99.86 79.42 77.31 92.36 99.23 75.04 73.98 91.11 99.13
case4 70.30 79.52 91.14 99.97 68.23 75.21 90.02 99.95 65.23 72.06 88.94 99.32
AVG 79.57 83.10 91.72 99.82 75.53 78.59 90.66 99.66 69.75 74.81 89.33 99.17
behavior, and that behavior may be mistaken as benign. The and efficiency, we conduct ablation experiments on them sepa-
attacker can also insert a small number of irrelevant nodes rately. The QGP step is responsible for generating the required
between attack stages, keeping the hop number of the attack query graphs, and facilitating the subsequent operations of our
chain for each stage within POIROT’s detection threshold of system. On the other hand, the EST step plays a crucial role
three. Although the inserted nodes themselves are harmless, in threat hunting module.
the edges connecting them to the real attack nodes cause
1) Efficacy of Equivalent Semantic Transfer. We test
POIROT to erroneously judge them as part of the attack,
the performance of ACTMINER without EST using the same
resulting in false negatives. Through our further investigation,
settings in the RQ1. As demonstrated in Table VI, without
we analyze the reasons behind the superior performance of
the EST component the number of FN enhanced by 1.4%,
ACTMINER: POIROT uses the query graph directly for regular
the number of false positives is 83.96% of POIROT’s, while
matching of node types and their attribute names, and selects
it represented an increase compared to ACTMINER. It is
appropriate nodes based on path scores, neglecting to consider
attributed to the following reasons: The process of equivalent
temporal and causal issues between nodes, resulting in many
semantic transfer also inherently involves the causal logic and
false alarms.
temporal relationships of the attack to some extent. Therefore,
The robustness of ACTMINER results from two reasons. removing this component may introduce a small portion of
First of all, during the construction of suspicious semantic tree, false positives.
attack intent information would be confirmed. The asserted
behaviors like fork/clone will not influence the score of the 2) Effectiveness of Query Graph Processing. As part of
certain attack path, because the multiple-hops strategy will our study, we directly utilize the query graphs extracted by the
consider these nodes to contain the same suspicious semantic. Extractor and CRUcialG on the same provenance graphs used
Second, ACTMINER do not raise alarms unless nodes exceed- in RQ1 for threat hunting. The size of query graph before QGP
ing the attack sequence length are detected. This ensures that and after QGP for all situations are presented in Table VIII,
our system does not generate excessive false positives when along with the results and the corresponding analysis. In the
attackers intend to conduct multi-point blasting to affect the QGP process, we uses the method of attribute abstraction for
hunting system, while also enhancing its resilience against nodes, so the number of nodes would be reduced. Our analysis
Scenario I. explores the impact of incorporating Query Graph Processing
(QGP) on the system’s accuracy and timeliness. We use the
Trace dataset from E3 to study the effect of QGP. The result,
C. RQ3: How important are the components we design for presented in Figure 8, shows a significant reduction in time
assisting threat hunting? consumption when considering QGP. This improvement re-
To demonstrate the impact of our system components, sults from QGP further merging the extracted nodes. Ignoring
particularly the query graph processing (QGP) and equivalent query graph processing makes efficiently hunting these nodes
semantic transfer (EST) components on system performance challenging. However, incorporating QGP allows the system
45
40
13
35
30
100%
25 TABLE IX: Overhead of ACTMINER and POIROT.
20
Percentage of runtime
E3 DataSet 300
Poirot Actminer
Fig. 8: Ratio of running time between GQ and OGQ. OGQ 250
Memory Usage(MB)
denotes the original query graph, the GQ denotes the query 200
graphs we use in ACTMINER.
150
TABLE VIII: The edges and nodes in the query graph, OV
100
and OE denote the number of nodes and edges in the graph
before QGP, and V and E denote the number of nodes and 50
edges after processing, respectively.
0
E4 E3 0 2 4 6 8 10 12 14 16 18 20 22 24 48 72
Scenario
T-1 T-2 T-3 T-4 Tr-1 Tr-2 Tr-3 Th-1 W-1 W-2
Time (h)
OV 9 13 16 9 14 10 15 12 13 10
OE 8 13 17 8 16 11 14 14 14 11
V 6 4 6 8 6 8 5 6 5 7 Fig. 9: Average Memory consumption in different time states.
E 6 3 5 7 5 9 4 5 4 6
TABLE X: Experimental results on different benign datasets. with widespread applications (e.g., database analysis, search
DataSet Test Duration Platform Hosts FP engines, software plagiarism detection, and social networks).
OpTC 01d00h00m Window 3 0 Based on whether the matching results are completely con-
Our Lab 01d08h13m Ubuntu 12.04 x64 5 0
sistent, algorithms can be divided into two categories: graph
isomorphism matching and graph approximate matching. The
main idea of graph isomorphism matching algorithms is to
reports provided by DARPA. We achieved results superior to
iteratively map nodes from the query graph to the target graph
the SOTA on well-known APT attack datasets (i.e., E3 and
one by one. Ullmann et al. [43] proposes a backtracking-
E4). However, limited by scarce APT attack samples, we were
based algorithm that enumerates all subgraphs satisfying the
unable to conduct large-scale experiments and analyses. In
matching requirements using a depth-first search approach.
practice, accurately extracting query graphs from numerous
However, as the scale of the graph gradually increases, the
CTI reports and automatically generating diverse and rational
enumeration range also expands, resulting in low algorithm
ones can improve threat hunting accuracy in the future.
efficiency. To address this issue, Cordella et al. [44] proposes
The interpretability of ACTMINER. The outputs of ACT-
the VF2 algorithm, which improves algorithm efficiency by
MINER inherently provide interpretability, as analysts can de-
incorporating the verification order of query nodes. However,
rive attack-related information from the query graphs reports.
in practical applications, its time complexity is superlinear,
However, due to the dynamic nature of APT attacks, the
and the need for secondary filtering consumes a significant
query graphs and hunting results are not entirely consistent.
amount of time. Shasha et al. [45] introduces the GraphGrep
Therefore, in the future, we can utilize generative artificial
algorithm, which achieves fast matching by encoding node se-
intelligence (e.g., LLM), to automatically transform the attack
mantic information. Several other studies [46]–[48] use graph
chains obtained from threat hunting into CTI reports that
mining techniques to find subgraphs from databases and then
analysts can clearly understand, aiding in better response.
employ filtering and optimization strategies to prune incorrect
The robustness of ACTMINER. The use of graph process-
nodes. However, due to the diversity and complexity of APT
ing approaches may lead to the loss of fine-grained details.
techniques, we cannot always assume that the nodes and edges
To validate the robustness of our ACTMINER, we conducted
of the query graph can be fully mapped to the target graph.
cross-validation by employing query graphs extracted from
Graph approximate matching algorithms often rely on heuristic
diverse CTI reports against distinct provenance graphs. The
methods to identify important nodes and then gradually expand
experimental findings consistently demonstrate that our system
to neighboring nodes [2], [49]–[51]. Tian et al. [49] uses
effectively avoids generating false alarms across all tested
a graph distance model to measure the similarity between
origin query graphs. The identified suspicious nodes do not
graphs. He et al. [52] introduces an index-based algorithm to
meet the threshold for triggering an alert.
support subgraph queries and similarity queries. Several other
works [51], [53] consider the shape and edge attributes of the
VII. R ELATEDW ORK query graph. However, the aforementioned research overlooks
A. Provenance Graph-based Threat Hunting adversarial knowledge. Milajerdi et al. [2] proposes POIROT,
The provenance graph contains causal relationships between which is similar to our work and utilizes node attributes and
system events, and it is a data structure that can be effectively information flows between nodes for approximate matching.
utilized for cyber threat hunting. DeepHunter [39] utilizes However, if an attacker intentionally takes a detour to achieve
Neural Tensor Networks (NTN) to judge the subgraph re- their goal, it may result in missed detections. Furthermore,
lationship through graph embedding. Due to the need for if the similarity score exceeds the threshold, POIROT stops
multiple-to-multiple traversal comparisons between subgraphs, hunting, and the obtained attack subgraph may not represent
the efficiency of DeepHunter will decrease as the size of the the optimal attack behavior. Therefore, we develop a new
provenance graph increases. ProvG-Searcher [11] establishes matching technique to address these challenges.
a coarse-grained threat hunting by employing a graph repre-
sentation learning method on the subgraph, aiming to enhance C. Incremental Graph Computation
hunting efficiency. ThreatRaptor [25] achieves extracting struc-
tured threat behaviors from OSCTI and automatically syn- In real-world scenarios, graphs are typically large in scale
thesizing query statements to search for malicious activities. and frequently updated over time. When graphs are updated,
Unlike the above studies, ACTMINER, inspired by POIROT traditional batch processing methods require starting the com-
[2], tries to align attack graphs extracted from CTI reports with putation from scratch, which is extremely time-consuming. In
provenance graphs from system logs to find complete attack contrast to traditional batch processing algorithms, incremental
chain. However, ACTMINER ia able to significantly reduces matching only analyzes and matches the updated portion, uti-
false positives, false negatives, and system overhead through lizing previous matching results to maximize the reduction of
an optimized graph alignment algorithm. redundant computations. Fan et al. proposes the IncSIMMatch
[54] algorithm, which effectively reduces redundant computa-
tions by creating an index for the pattern graph. They then
B. Graph Matching Algorithms introduces an incremental computation method IncISO [55],
Graph pattern matching refers to the problem of finding sim- where only the set of nodes within d hops of the updated node
ilarities between a small query graph and a large target graph, d in the data graph needs to be rematched as the affected region
15
when nodes and edges change in the graph. Subsequently, [20] A. D. Keromytis., “Transparent computing engagement 3 data release.”
researchers [56]–[62] introduce the idea of incremental com- 2018, https://github.com/darpa-i2o/Transparent-Computing/blob/master/
README-E3.md.
putation and incorporated query search optimization strategies. [21] “Darpa transparent computing engagement,” 2020, https://www.darpa.
For example, Sutanay et al. [57] constructs the pattern graph as mil/program/transparent-computing.
a binary tree and decomposes it for storage in tree nodes, with [22] “mandiant/openioc 1.1,” https://github.com/mandiant/OpenIOC/.
[23] “Introduction to stix,” https://oasis-open.github.io/cti-documentation/
data updates performed by searching leaf nodes. However, stix/intro.html.
storing a large number of indexes consumes memory. Sun et [24] “Misp - open source threat intelligence platform & open standards for
al. [63] proposes an exploration-based approximate matching threat information sharing,” https://www.misp-project.org.
[25] P. Gao et al., “Enabling efficient cyber threat hunting with cyber threat
technique, but if the initially selected node is inappropriate, intelligence,” in ICDE. IEEE, 2021, pp. 193–204.
a large number of useless intermediate result values will be [26] G. Husari et al., “Ttpdrill: Automatic and accurate extraction of threat
generated. ACTMINER constructs a tree structure to incremen- actions from unstructured text of cti sources,” in ACSAC, 2017.
[27] X. Liao et al., “Acing the ioc game: Toward automatic discovery and
tally update the alignment results and introduces a forgetting analysis of open-source cyber threat intelligence,” in CCS, 2016.
rate to maintain stable memory overhead. [28] K. Satvat et al., “Extractor: Extracting attack behavior from threat
reports,” in EuroS&P. IEEE, 2021, pp. 598–615.
[29] W. U. Hassan et al., “Nodoze: Combatting threat alert fatigue with
VIII. C ONCLUSION automated provenance triage,” in NDSS, 2019.
[30] Hassan, Wajih Ul and others, “Tactical provenance analysis for endpoint
We propose ACTMINER, a system that enables the detection detection and response systems,” in S&P. IEEE, 2020, pp. 1172–1189.
[31] Q. Wang et al., “You are what you do: Hunting stealthy malware via
of the complete APT attacks chains by applying causality data provenance analysis.” in NDSS, 2020.
tracking and increment aligning. It overcomes the issues of [32] Inam, Muhammad Adil and others, “Sok: History is a vast early warning
low precision, low recall, and low efficiency that existed in system: Auditing the provenance of system intrusions,” in S&P. IEEE,
2023, pp. 2620–2638.
previous work. Experimental results show that ACTMINER [33] R. Sekar et al., “eaudit: A fast, scalable and deployable audit data
exhibits better detection rates and resilience against adversarial collection system,” in S&P. IEEE, 2023, pp. 87–87.
attacks compared to the SOTA. [34] T. Chen et al., “Kellect: a kernel-based efficient and lossless event log
collector,” arXiv preprint arXiv:2207.11530, 2022.
[35] “Windows event tracing,” https://docs.microsoft.com/en-us/windows/
desktop/ETW/event-tracing-portal/. [Online]. Available: https://docs.
R EFERENCES microsoft.com/en-us/windows/desktop/ETW/event-tracing-portal
[36] T. Zhu et al., “General, efficient, and real-time data compaction strategy
[1] A. Bates et al., “Trustworthy {Whole-System} provenance for the linux
for apt forensic analysis,” TIFS, vol. 16, pp. 3312–3325, 2021.
kernel,” in USENIX, 2015, pp. 319–334.
[37] “Darpa3-cdm.” [Online]. Available: https://drive.google.com/drive/
[2] S. M. Milajerdi et al., “Poirot: Aligning attack behavior with kernel
folders/1gwm2gAlKHQnFvETgPA8kJXLLm3L-Z3H1
audit records for cyber threat hunting,” in CCS, 2019, pp. 1795–1812.
[38] A. Gehani et al., “Spade: Support for provenance auditing in distributed
[3] Milajerdi, Sadegh M and others, “Holmes: real-time apt detection
environments,” in Middleware 2012: ACM/IFIP/USENIX 13th Interna-
through correlation of suspicious information flows,” in S&P. IEEE,
tional Middleware Conference, Montreal, QC, Canada, December 3-7,
2019, pp. 1137–1152.
2012. Proceedings 13. Springer, 2012, pp. 101–120.
[4] M. N. Hossain et al., “Combating dependence explosion in forensic [39] R. Wei et al., “Deephunter: A graph neural network based approach for
analysis using alternative tag propagation semantics,” in S&P. IEEE, robust cyber threat hunting,” in SecureComm. Springer, 2021, pp. 3–24.
2020, pp. 1139–1155. [40] A. Aly, S. Iqbal, A. Youssef, and E. Mansour, “Megr-apt: A memory-
[5] Y. Xie et al., “Pagoda: A hybrid approach to enable efficient real-time efficient apt hunting system based on attack representation learning,”
provenance based intrusion detection in big data environments,” TDSC, IEEE Transactions on Information Forensics and Security, vol. 19, pp.
2018. 5257–5271, 2024.
[6] M. Bishop et al., Introduction to computer security. Addison-Wesley [41] W. Cheng, T. Zhu, T. Chen, Q. Yuan, J. Ying, H. Li, C. Xiong,
Boston, 2005, vol. 50. M. Li, M. Lv, and Y. Chen, “Crucialg: Reconstruct integrated attack
[7] C. Kruegel et al., Intrusion detection and correlation: challenges and scenario graphs by cyber threat intelligence reports,” arXiv preprint
solutions. Springer Science & Business Media, 2004, vol. 14. arXiv:2410.11209, 2024.
[8] C. Xiong et al., “Conan: A practical real-time apt detection system with [42] A. Goyal et al., “Sometimes, you aren’t what you do: Mimicry attacks
high accuracy and efficiency,” TDSC, 2020. against provenance graph host intrusion detection systems,” in NDSS,
[9] T. Zhu et al., “Aptshield: A stable, efficient and real-time apt detection 2023.
system for linux hosts,” TDSC, 2023. [43] J. R. Ullmann, “An algorithm for subgraph isomorphism,” JACM,
[10] M. N. Hossain et al., “{SLEUTH}: Real-time attack scenario recon- vol. 23, no. 1, pp. 31–42, 1976.
struction from {COTS} audit data,” in USENIX, 2017. [44] L. P. Cordella et al., “A (sub) graph isomorphism algorithm for matching
[11] E. Altinisik et al., “Provg-searcher: A graph representation learning large graphs,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 26, no. 10,
approach for efficient provenance graph search,” in CCS, 2023. pp. 1367–1372, 2004.
[12] J. Zengy et al., “Shadewatcher: Recommendation-guided cyber threat [45] D. Shasha et al., “Algorithmics and applications of tree and graph
analysis using system audit records,” in S&P. IEEE, 2022. searching,” in PODS, 2002, pp. 39–52.
[13] S. Wang et al., “Threatrace: Detecting and tracing host-based threats [46] J. Cheng et al., “Fg-index: towards verification-free query processing on
in node level through provenance graph learning,” TIFS, vol. 17, pp. graph databases,” in SIGMOD, 2007, pp. 857–872.
3972–3987, 2022. [47] P. Zhao et al., “Graph indexing: tree+ delta¡= graph,” in VLDB. Citeseer,
[14] X. Han, et al., “Unicorn: Runtime provenance-based detector for ad- 2007, pp. 938–949.
vanced persistent threats,” pp. 1–18, 2020. [48] X. Yan et al., “Graph indexing: a frequent structure-based approach,” in
[15] E. Manzoor et al., “Fast memory-efficient anomaly detection in stream- SIGMOD, 2004, pp. 335–346.
ing heterogeneous graphs,” in Dblp, 2016, pp. 1035–1044. [49] Y. Tian et al., “Saga: a subgraph matching tool for biological graphs,”
[16] M. U. Rehman et al., “Flash: A comprehensive approach to intrusion Bioinformatics, vol. 23, no. 2, pp. 232–239, 2007.
detection via provenance graph representation learning,” in S&P. IEEE, [50] Tian, Yuanyuan and others, “Tale: A tool for approximate large graph
2024, pp. 139–139. matching,” in ICDE. IEEE, 2008, pp. 963–972.
[17] F. Yang et al., “{PROGRAPHER}: An anomaly detection system based [51] H. Tong et al., “Fast best-effort pattern matching in large attributed
on provenance graph embedding,” in USENIX, 2023, pp. 4355–4372. graphs,” in Proceedings of the 13th ACM SIGKDD international con-
[18] “Threat report,” https://www.crowdstrike.com/global-threat-report/. ference on Knowledge discovery and data mining, 2007, pp. 737–746.
[19] “Lateral movement,” https://www.crowdstrike.com/cybersecurity-101/ [52] H. He et al., “Closure-tree: An index structure for graph queries,” in
lateral-movement/. ICDE. IEEE, 2006, pp. 38–38.
16