0% found this document useful (0 votes)
63 views16 pages

Graph-Based Threat Hunting

The document presents ACTMINER, a novel threat hunting system designed to improve the detection of Advanced Persistent Threats (APTs) by addressing high false negatives and false positives in existing systems. ACTMINER utilizes a heuristic search strategy and a filtering mechanism based on causal relationships to enhance detection accuracy and efficiency, while also employing a tree structure for incremental updates to combat persistent attacks. Evaluation results indicate that ACTMINER significantly outperforms existing methods, reducing false positives and eliminating false negatives in threat detection.

Uploaded by

nagarjuna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views16 pages

Graph-Based Threat Hunting

The document presents ACTMINER, a novel threat hunting system designed to improve the detection of Advanced Persistent Threats (APTs) by addressing high false negatives and false positives in existing systems. ACTMINER utilizes a heuristic search strategy and a filtering mechanism based on causal relationships to enhance detection accuracy and efficiency, while also employing a tree structure for incremental updates to combat persistent attacks. Evaluation results indicate that ACTMINER significantly outperforms existing methods, reducing false positives and eliminating false negatives in threat detection.

Uploaded by

nagarjuna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1

ACTMINER: Applying Causality Tracking and


Increment Aligning for Graph-based Threat Hunting
Mingjun Ma, Tiantian Zhu*, Tieming Chen, Shuang Li, Jie Ying, Aohan Zheng, Chunlin Xiong, Mingqi Lv, and
Yan Chen, IEEE Fellow

Abstract—To defend against Advanced Persistent Threats on and learning-based [11]–[17]. The policies presented by rule-
the endpoint, threat hunting employs security knowledge such based systems are difficult to sustain in a constantly changing
as cyber threat intelligence to continuously analyze system audit system environment, analysts must frequently update the rule
arXiv:2501.05793v1 [cs.CR] 10 Jan 2025

logs through retrospective scanning, querying, or pattern match-


ing, aiming to uncover attack patterns/graphs that traditional base to adapt to new types of attacks, and the lag in defense
detection methods (e.g., recognition for Point of Interest) fail results in frequent occurrences of false negatives. From the
to capture. However, existing threat hunting systems based on perspective of detection granularity, learning-based detection
provenance graphs face challenges of high false negatives, high methods can be categorized into graph-level [11], [14], [15],
false positives, and low efficiency when confronted with diverse [17], node/edge-level [12], [13], [16]. Graph-level detection
attack tactics and voluminous audit logs.
To address these issues, we propose a system called ACTMINER, typically involves learning the characteristics of benign graphs
which constructs query graphs from descriptive relationships and use certain technique like clustering to discern abnormal
in cyber threat intelligence reports for precise threat hunting ones. However, in the anomalous subgraph [14], [15], [17],
(i.e., graph alignment) on provenance graphs. First, we present a not all nodes/edges are necessarily associated with the attack.
heuristic search strategy based on equivalent semantic transfer to Experts still need further analysis to pinpoint the malicious
reduce false negatives. Second, we establish a filtering mechanism
based on causal relationships of attack behaviors to mitigate attack path accurately. In contrast, node/edge-level detection
false positives. Finally, we design a tree structure to incre- methods are able to obtain the point of interesting (POI), which
mentally update the alignment results, significantly improving makes it more direct and effective than graph-level ones in
hunting efficiency. Evaluation on the DARPA Engagement dataset locating the attack candidate. However, the alerts that solely
demonstrates that compared to the SOTA POIROT, ACTMINER focus on nodes and edges do not fully reveal the panorama
reduces false positives by 39.1%, eliminates all false negatives,
and effectively counters adversarial attacks. of the attack, and analysts still need to spend a significant
amount of time evaluating whether the generated POIs are
Index Terms—Threat Hunting, Advanced Persistent Threat, false positives [12], [13], [16]. According to CrowdStrike’s
Attack Scenario Graph, Data Provenance.
2024 Global Threat Report [18], the time to compromise a host
has gone from 84 minutes in 2022 to 62 minutes in 2023. This
I. I NTRODUCTION means that if an attack is not detected and responded to in a
Advanced Persistent Threats (APTs) aim to infiltrate specific timely manner, the attacker will likely have a lateral movement
institutions to obtain critical asset information and sensitive [19] that will cause more hazards.
data, posing immense threats and impacts. In order to detect Considering the shortcomings of the above detection sys-
and investigate APT attacks on hosts, data provenance [1] tems, POIROT [2] has proposed a threat hunting approach
is widely used to analyze basic events (e.g., a sensitive file based on graph alignment. It extracts the query graph from
written by a malicious process) step-by-step. Nowadays, there Cyber threat intelligence reports and then performs graph
are increasingly researchers have begun to employ provenance matching on the provenance graph to capture malicious behav-
graphs in the field of APT attack detection on hosts. iors. Due to the inherent interpretability of the query graph,
Existing provenance-based systems for detecting APTs fall its matching results can reflect comprehensive information
into the following two main categories: rule-based [2]–[10], about the attack, enabling analysts to respond accurately and
promptly to the attack.
This work is supported by the following grants: National Natural Science But POIROT still faces the following three limitations:
Foundation of China under Grant No. U22B2028 and 62372410. The
Fundamental Research Funds for the Provincial Universities of Zhejiang - Semantic Gap (C1). How to apply the attack knowl-
under Grant No. RF-A2023009. Zhejiang Provincial Natural Science edge extracted from CTI reports to solve the problems of
Foundation of China under Grant No. LZ23F020011. attack camouflage and persistence. The extracted query graphs
M. Ma, T. Zhu, T. Chen, S. Li, J. Ying, Q. Yuan and M. Lv are with
the College of Computer Science and Technology, Zhejiang University of are often difficult to be directly mapped one-to-one to the
Technology, Hangzhou 310023, China. E-mail: zjutmmj@zjut.edu.cn, provenance graph (e.g., files with the same type but different
ttzhu@zjut.edu.cn*, tmchen@zjut.edu.cn, lish@zjut.edu.cn, jiey- names). These mismatches between the two graphs can lead
ing@zjut.edu.cn, zjutzah@zjut.edu.cn, mingqilv@zjut.edu.cn. *corresponding
author to inaccurate detection outcomes.
C. Xiong is with the China Unicom (Guangdong) Industrial Internet Co., - Temporal & Causality Missing (C2). How to de-
Ltd., Guangzhou 510555, China. E-mail: chunlinxiong@gmail.com. tect causal relationships of attacks in dynamic scenarios to
Y. Chen is with Department of Electrical Engineering and Computer
Science, Northwestern University, Evanston, IL 60208, USA. E-mail: minimize erroneous alerts. Solely concentrating on a single
ychen@northwestern.edu. behavior can overwhelm the hunting system with an influx
2

of alerts, thereby hindering its ability to effectively identify The results demonstrate the efficiency of ACTMINER in
and mitigate truly malicious activities (e.g., discerning the capturing the attack chain and highlight its resistance to
semantic differences in the access to sensitive files such adversarial attacks.
as /etc/passwd between normal processes and malicious
processes requires consideration of causal relationships). This II. BACKGROUND K NOWLEDGE
disregard may lead to more imprecise detection, rendering the
system ineffective in countering sophisticated threats. A. Provenance Graph
- Data explosion Dilemmas (C3). How to minimize mem- Provenance graphs possess potent semantic expressiveness
ory overhead and enhance the efficiency of threat hunting. and contextual association capabilities, embodying the con-
Existing solutions assume an ideal scenario for datasets, that crete manifestation of kernel audit logs. They model all system
is, researchers assume that a complete attack can be discovered entities within the logs as nodes and the interactions between
within a single batch of data. However, APT exhibits persis- entities as edges, where both nodes and edges bear attribute
tence, and attack chains may span across different batches of information. The nodes within the provenance graph are cate-
data (e.g., data from the first and third days). Repeated scans gorized into subjects and objects based on the direction of data
on ever-expanding datasets introduce significant overhead. movement. Edges represent the causal relationships between
In this work, we propose ACTMINER, a threat hunting sys- system entities, such as read/write file operations, execute exe-
tem that combines causality tracking and incremental aligning cutable file operations, create/clone process operations, and so
to efficiently and accurately dig attack chains. To tackle C1, forth. By leveraging provenance graphs, security professionals
ACTMINER constructs a heuristic search strategy based on can associate malicious entities with attack behaviors through
equivalent semantic transfer to counter phenomena such as causal analysis, unveiling the complete picture of an attack.
attack camouflage, persistence, and evasion. We fuse the data
information of inter-entity interactions through entities and
B. CTI Report and Query Graph
their contextual semantics in order to achieve the accurate
capture of malicious behaviors. To address C2, ACTMINER Cyber threat intelligence (CTI) reports [22]–[24] encompass
constructs a filtering mechanism based on the causal rela- comprehensive information related to cyber attacks and attack-
tionships of attack behaviors, and ignores unreasonable entity ers, with a particular emphasis on capturing detailed attack
context relationships. ACTMINER employs the causal motiva- procedures - the intricate sequences of steps and techniques
tion behind attacks to guide threat hunting, ensuring the inter- employed in multi-stage attacks. These reports provide in-
pretability of hunting results and minimizing false positives. depth representations of attack scenarios, potential impacts
In other words, we provide a more accurate hunting result by on target hosts, as well as the complex chains of causally-
excluding unreasonable (attack-irrelevant) paths based on the linked events that characterize APTs. Security professionals
causal relationship through temporal sequences. To deal with leverage CTI reports to formulate more targeted defense rules
C3, ACTMINER construct a tree structure to incrementally for preventing and identifying malicious attack behaviors. In
update the alignment results, thereby avoiding the significant recent years, substantial research [2], [11], [25]–[27], has
overhead caused by rescanning multiple batches of redundant demonstrated the successful application of CTI reports in
data. threat detection and threat hunting. In this paper, we construct
We evaluate the effectiveness and efficiency of ACTMINER a directed graph, termed the query graph, from the offensive
on the dataset provided by Darpa TC program [20], [21]. Our and defensive knowledge [28](attack entities and their causal
results reveal that ACTMINER surpasses existing provenance- relationships) extracted from CTI reports. Similar to prove-
based threat hunting system in terms of detection precision and nance graphs, query graphs are directed graphs with attribute
recall. Moreover, ACTMINER can reduce the computational information.
overhead and eliminate redundant searches. By deploying
ACTMINER, security analysts are able to effectively analyze C. Graph Alignment
attack chains and formulate countermeasures, significantly
Graph alignment refers to the problem of detecting potential
alleviating the workload. In summary, the main contributions
cyber intrusion behaviors by establishing an optimal subgraph
of our work are as follows:
mapping between a provenance graph (Gp ) representing sys-
• Unlike traditional attack detection methods, we propose
tem activities across the entire system, and a query graph (Gq )
a provenance-based threat hunting system ACTMINER to
representing attack pattern activities. The provenance graph
accurately capture attack chains.
Gp = (Vp , Ep ) consists of a node set Vp representing system
• We introduce a heuristic search strategy based on equiv-
entities and events, and an edge set Ep . The query graph Gq =
alent semantic transfer and a filtering mechanism based
(Vq , Eq ) comprises a node set Vq representing attack patterns
on causal relationships of attack behaviors to ensure the
and an edge set Eq . The goal of graph alignment is to find a
precision and recall of ACTMINER.
subgraph Gm in Gp that maximizes the matching degree with
• We propose a tree structure to incrementally update the
Gq :
alignment results, effectively addressing persistent APT
attacks and the continuous growth of graph data.
• We comprehensively evaluate our system and SOTA
POIROT [2] on the dataset from DARPA TC program. Gm = argmax(M (G′ , Gq)) (1)
G′ ⊆Gp
3

Here, M is the Matches function that calculates the matching processes to read files, followed by another process reading
degree between Gq and a subgraph G′ of Gp . By solving this those files, and finally transmitting them over the network.
optimization problem, the best mapping from the attack query Overreliance on the simplistic approach of threat hunting
graph to the activity graph is obtained, enabling the detection based on a predetermined number of hops may inadvertently
and tracking of cyber intrusion behaviors. overlook malicious activities that align with the intrinsic char-
acteristics of attacks, ultimately resulting in detection efforts
III. M OTIVATION failure. As shown in Figure 1, in a simulated scenario, as
described in [2] Section 5, we set Cthr to 3, but find that
A. Motivating Example
this limitation resulted in an incomplete capture of the attack
Scenario: Consider the following scenario where an attacker chain. Consequently, it was unable to detect the art bat file.
exploits the feature of automatically executing login scripts Moreover, with such hop count restrictions, attackers aware
(Reg.exe and %temp% \art.bat 2) during login initialization of the imposed limits could potentially evade hunting more
(mal) to establish persistence by adding the malicious script easily across different scenarios, resulting in potential harm.
path to the registry (HKCU \Environment R2). Subsequently, The manual adjustment of the hop limit according to different
the attacker searches for network shares on the compromised scenarios poses significant challenges. Similarly, the same
computer to locate files and then collects sensitive data problem exists in other path-based detection [29]–[31] efforts.
(/etc/passwd) from remote locations via shared network drives
(host shared directories, network file servers, etc.). Finally, the False Positive. Within real organizations, extensive legiti-
data is transmitted over the network (162.66.239.75). mate user operations exhibit similarities with attack behaviors
As illustrated in Figure 1, this example includes two graphs: in log data. If hunting rules are overly broad or incomplete,
the top-left depicts the attack query graph manually extracted normal behaviors may be misclassified as malicious. For
from a cyber threat intelligence (CTI) report, following the example, when a user downloads network files through a
approach outlined in the POIROT. browser, the browser collects user data and transmits it to
its cloud server, while the downloaded network files may
The attack initiates by leveraging a malicious executable,
be flagged as ”suspicious files” by the system, resembling
malicious.exe, to obtain unauthorized system access. It then
malicious attack behaviors and triggering false alarms from
employs Registry modifications to establish persistence. Once
the hunting system. Additionally, attackers may leverage
a foothold is secured, the attacker can remotely issue com-
tools/techniques to deceive hunting systems, also leading to
mands and execute them on the system, executing an art.bat2
false positives. As highlighted in Figure 1, a suspicious process
file in the temporary folder and facilitating actions such as
Mal.exe exhibits two paths for reading/writing sensitive files.
exfiltrating sensitive data to an external IP, exemplified by the
According to the attack query graph, sensitive file access
transfer of /etc/passwd containing user account information.
should occur before network transmission, while path1 occurs
On the right side is the provenance graph constructed from
after transmission. Therefore, path2 represents the attacker’s
actual system logs capturing the observed execution behavior.
actual operations in the environment.
Due to the fragmented nature of attack scenarios, coupled
with the constraint of limiting the hop count to existing High overhead and inefficiency. Government and enter-
threat hunting approaches, can lead to imprecise or incomplete prise organizations typically need to collect data simultane-
results during the threat hunting process. In this paper, we ously from thousands of machines, easily amassing petabyte-
transform the threat-hunting problem into finding the attack scale data volumes. This massive data not only imposes
query graph within the provenance graph. substantial storage overhead but also significantly reduces
Threat hunting methods centered around POIROT encounter hunting efficiency. Traditional hunting methods require offline
several significant challenges: storage and continuous backscanning of system log data,
False Negative. Due to the complexity of real enterprise resulting in immense computational overhead for each hunting
environments, semantic gaps exist between provenance graphs operation. Referring to Figure 1, assume that all operations
and attack query graphs. The manifestations of the same attack before node E23.txt at time 114 have already occurred. When
type may differ across systems, and attackers may utilize the a security analyst attempts to hunt for threats solely based
same tools in diverse ways. For instance, entity names in on the data collected after this time, the incremental data
the attack query graph may have varying representations in segment alone cannot effectively support the reconstruction
the underlying logs of different systems. In POIROT, regular of the complete attack chain represented by its query graph.
expressions are employed to instantiate node names from the When examining the issue holistically, the newly acquired data
attack query graph for hunting searches in the provenance lacks the necessary evidence to capture the earlier stages of the
graph. However, if attackers modify their tactics, introducing multi-step intrusion. Consequently, subsequent data collection
technical variations, threat hunting systems struggle to detect would necessitate rescanning the previously available infor-
different mutated attack samples (e.g., over 100 versions of mation, redundantly recomputing the provenance of data prior
the Carbanak malware were described in CTI reports). Fur- to a specific timeframe. These redundant computations across
thermore, attackers can evade security detection through ob- multiple hunting activities introduce an unsustainable over-
fuscation, persistence, and evasion techniques. As illustrated, head, hindering the system’s efficiency and scalability. And
the attack query graph only describes data exfiltration over the potentially allowing malicious activities to persist undetected
network, whereas in the real environment, the attacker creates for extended periods.
4

118.send
send Remote 162.66.23
fork
malicious *.exe IP 9.75
Server.exe
receive
fork 119.receive
write 117.fork
*.exe
/etc/passwd E23.txt
exec 114.read
write
%temp%\art.bat_2 Mal.exe
120.write Path_1
/etc/passwd
Registry
Query Graph 115.fork
113.write
116.write
……
REG.exe Copy-item Path_2
Explore.exe mal
112.fork
103.exec
104.write 102.fork 101.fork
Svo.exe 106.load 192.168.74.136
105.exec
malicious
107.send
H*.reg Tmp.exe Sys.dll
102.fork inted ……
105.exec
104.fork 106.fork 106.exec
tps impad Apk.exe art_bat_2.exe Provenance Graph

Fig. 1: Motivating Example. The red nodes and edges depict the truly malicious behavior. In contrast, the blue outlines
encompass false positive detection, where POIROT incorrectly identified benign system entities as malicious. The specific
nodes are the points with green borders. Furthermore, the orange outlines highlight the instances of missed detection or false
negatives, where POIROT failed to identify nodes that were indeed part of the attack chain.

IV. S YSTEM D ESIGN attacks. And to control the memory consumption, ACTMINER
This section first introduces the overall architecture of the will store the unupdated tree branch to the database unless
ACTMINER system, followed by a detailed description of each certain behavior related to this branch.
module presented in ACTMINER. The basic architecture of ACTMINER is shown in Figure 2,
which can be divided into three modules: (I) the Data Prepa-
ration Module, (II) Casual Relation and Semantic Processing
A. System Overview Module, and (III) Threat Hunting and Incremental Aligning
Data Preparation Module (§ IV-B). The attack query Module. It is important to note that ACTMINER continuously
graphs are extracted from threat intelligence reports, and runs the above three modules as the time progresses. Details
provenance graphs are constructed based on extensive un- of system design for each module are given in Section IV-B,
derlying logs. Duplicate events and orphan nodes within the Section IV-C, Section IV-D, respectively.
provenance graphs undergo filtering, which is a necessary and
common practice in existing work [32]. B. Data Preparation Module
Casual Relation and Semantic Processing Module This section describes the data preprocessing module for
(§ IV-C). When a new attack query graph or provenance provenance graphs and query graphs.
graph generated, ACTMINER will first categorize the entities 1) Provenance Preparation: Provenance graphs are com-
into four classes. Then the provenance graph will delivery to posed of log data collected from various platforms by
the next module while the attack query graph still need to data collectors. In this work, we employ open-source tools
be processed. Next, ACTMINER merges analogous actions in such as eAuditd [33], Kellect [34], and Event Tracing for
the attack query graph. Finally, ACTMINER employs Equiv- Windows (ETW) [35] to gather relevant system logs from
alent Semantic Transfer which traces potentially overlooked Linux and Windows environments. ACTMINER transforms
attack chains by tracking malicious semantics, to identify the each event into a directed, time-stamped edge, in which
suspicious actions in the next module. the source node represents the object being acted upon. For
Threat Hunting and Incremental Aligning Module any event et ∈ E, ACTMINER represents it as a quintuple
(§ IV-D). ACTMINER will hunt attack-related scenario by ⟨U IDs , U IDo , OP, T ii ⟩. U IDs and U IDo are unique identi-
chaining suspicious semantic nodes and generating suspicious fiers for the subject and object of et, respectively. OP denotes
semantic tree. As time progresses, batch data is continuously the type of et, and T ii denotes the time when et occurred.
inputted into the ACTMINER, persistently updating our sus- Directly processing such massive raw log data is extremely
picious semantic trees and unveiling more latent malicious challenging. To address this, we perform pruning operations on
5

I.Data Preparation Module II. Causal Relation and Semantic III. Threat Hunting and Incremental Aligning Module
Processing Module

Suspicious Semantic Tree


Query Graph Attribute P S
Construction
Abstraction File …
Database
Microsoft
Symantec Extractor
Cisco Unupdated
Open Merging Candidate Nodes Tree
Identification Branch
Source Analogous
Tools Actions Attack Intent
eAuditd Confirmation
Dead
Windows Kellect Attack Chain Pruning
Linux S P
Tree Node
BSD Filtering Equivalent Creation
Redundant Semantic P File P File

𝑩𝒂𝒕𝒄𝒉 New SUB


Events Transfer System
𝑫𝒂𝒕𝒂 𝑻 File S File EVENT
P P Logs
OBJ
𝑩𝒂𝒕𝒄𝒉 Provenance P
𝑫𝒂𝒕𝒂 𝑻+𝟏 Graph P P P File 𝑩𝒂𝒕𝒄𝒉
File Alert & Investigate
𝑫𝒂𝒕𝒂 𝑻+𝟐
Insert

Fig. 2: The architecture of ACTMINER, which consists of three core modules that synergistically facilitate comprehensive threat
hunting and attack chain construction capabilities.

the low-level log data. Specifically, redundant events without TABLE I: A categorization of distinct entities and their
context are removed [9], [36]. This means if the subject UID corresponding label assignments.
(U IDs ), object UID (U IDo ), and operation (OP ) are identi- Entity Tag Description
Process P Processes, threads spawned by system calls
cal, and the timestamps (Ti ) are consecutive, the most recent User Configuration Sensitive files containing user configuration
Fa
Sensitive Files information, such as /etc/passwd
timestamp will be preserved. Furthermore, our methodology Application
Sensitive files that contain configuration information
includes the removal of isolated nodes within the provenance Configuration Fb
about the application, such as /etc/mysql/my.cnf
Sensitive Files
graph. The isolated nodes refer to entities that lack any Log-sensitive
Fc
Sensitive files containing logging information,
documents e.g. /etc/httpd/logs, e.g. etc/httpd/logs
incoming or outgoing edges. For example, we find that the A collection of pre-compiled methods with extensions
Library file Fd
data contains many these nodes manifest without any parent or such as .lib, .a, .dll, .so, etc.
Files that can be loaded and executed by the operating system,
Executable file Fe
child nodes, none of the events contain a subject UID (U IDs ) with extensions such as: .exe, .vbs, etc.
Temporary document Ff Temporary files generated by the system, e.g., /tmp/*
or object UID (U IDo ) matching the UID of the node, lack A collection of other types of files, such as plain text files,
Other documents Fg
plain graph files, plain zip files, etc.
contextual information and fail to provide meaningful insights. Unified management of hardware and software configurations,
Registration form R
So remove them do not compromise the integrity of the graph including HKLM, HKCU, HKCR, HKCC, HKU, etc.
Refers to a host on the Internet or
representation. Socket S
a process in a host, e.g. 127.0.0.1
Concurrently, during the process of constructing provenance
graphs, the node types are categorized based on the entity type
contained within the logs. For instance, Fa represents files in previous studies such as Holme [3], Sleuth [10], and Morse
involving user and system-sensitive information, such as the [4]. The details are show in Table II.
boot.ini file on Windows and /etc/passwd file on Linux. The 2) Query Graph Preparation.: CTI reports describe attacks
specific categorization is detailed in Table I. that have already occurred. We collect the latest threat in-
As show in Table I, we are inspired by APTShield [9], telligence from websites such as Microsoft and Symantec.
Conan [8] and refined the classification heuristics optimized Leveraging open-source tools like Extractor [28], we extract
the classification method we obtained from POIROT. Further- attack query graphs (Gq ) from cyber threat intelligence (CTI)
more, we extensively gather CTI reports from various online reports. Analogous to provenance graphs, upon extracting
sources and network channels, identifying specific file paths attack query graphs, the constituent entities undergo a cor-
that exhibit heightened susceptibility to attacks. Consequently, responding mapping process.
we adapt the importance degree of these paths based on their
frequency of occurrence to finally obtain ten distinct labels. C. Casual Relation and Semantic Processing Module
By analyzing the CDM18 and CDM19 which refer to the In this section, we sequentially describe the process of
data definitions for DARPA’s E3 and E4 programs, respec- Module II in Figure 2, i.e., merging analogous actions and
tively, we adopt events based on a few general fields in equivalent semantic transfer. In a nutshell, analogous actions
the CDM (i.e., the events of read, write, fork, clone, create, in attack query graphs are merged while employing an equiv-
execute, load, and inject), which were most commonly used alent semantic transfer strategy. This process enhances attack
6

TABLE II: Classification of different event types.


Number EventType Subject Object Direction Description
1 accept P S backward Accepting socket connections
2 inject P P forward Running arbitrary code in the address space of independent activity processes.
3 clone P P forward Cloned subject
4 connect P S forward Connect to a socket type guest
5 execute P F forward The subject invokes and executes the object
6 fork P P forward Creating the process body
7 load P F backward Load file to current workspace data
8 write P F forward Open a file or directory (Object) and write information
9 receive P S backward Receive data through a connected socket port
10 send P S forward Sending data through a connected socket port
11 exit P P forward Process exit
12 unlink P F forward Deletion of individual files 然而如果攻击者更改了手段,产生技术变异, 【例如】图中红色三角形标记的节点,现有
方法则会舍近求远,忽略此节点,那么这种攻击突变会损害现有的狩猎方法。

hunting capabilities while simultaneously preparing the data write


(a)
P Fa
for input to Module III.
1) Merging Analogous Actions: When ACTMINER adopts
the attack query graphs to perform the threat hunting tasks, it P1
inject
P2
write
Fe
read
P3
fork
P4
write
Fa (b)
encounters the issue of rigid matching. This situation leads the
alignment between attack query graphs and provenance graphs Fig. 3: An Example of equivalent semantic transitivity.
become hard to achieve. Finally, it would result in high false
negative and positive rate in the Module III.
Additionally, the directly extracted Gq is fixed, which limits
its ability to generalize and effectively defend against variants finally, process P4 with suspicious semantics performs a write
of known attacks, as it is specifically tailored to documented operation on sensitive files. Through this series of operations,
attack patterns. To address this challenge, there is a key in- the ultimate behavior of a suspicious process tampering with
sights of ACTMINER: merging analogous actions. This method sensitive files Fa is achieved. Although the types of controlled
integrates nodes with similar operations. For instance, instead entities change during the attack process, the ultimate goal
of representing each individual file with a distinct node, the remains unchanged. The operations in (b) are essentially the
generalized Gq may group files based on their types (e.g., same as (a) and align with the attacker’s intent, making the
Fa , Fb , Fc in Table I) in the attack chain. Simultaneously, we two paths semantically equivalent.
consider the temporal relationships of attack events. Processes To comprehensively capture attack intents and mitigate
can be merged based on their functional attributes or the attack evasion while addressing semantic gaps between Gq
operations they perform, rather than being strictly tied to and Gp , we integrated log data from the DARPA project,
specific executable paths or process names. In essence, this leveraged attack stage theories proposed in previous works
method is to transform the information flow from the source like Conan [8] and Aptshield [9], and incorporated relevant
entity to the same target entity while preserving the semantic descriptions from the CDM document [37] to extract six
meaning of the source entity, so that equivalent events can be equivalent semantic transitivity policies, as illustrated in Table
removed as redundant information, thereby achieving efficient III. Therefore, ACTMINER can track and locate suspicious
compression of the data. behavior on the host in real-time by analyzing the data flow
2) Equivalent Semantic Transfer: During the attack pro- transmission, while systems like HOLMES [3] cannot do this
cess, malicious processes controlled by the attacker interact (according to Table 8 in HOLMES). In contrast, Sleuth [10]
with other entities [8], causing malicious behaviors and effects can detect suspicious behavior in specific steps through initial
to proliferate across the system through the intricate web label propagation, but its label propagation mechanism may
of entity interactions and information propagation pathways, fail if the attacker uses legitimate tools (such as the command
thereby expanding the attacker’s control scope. line) to operate, and cannot accurately identify malicious
Based on above finding, we construct the equivalent seman- intent. Based on the context information of entities, these
tics transitivity strategy to address the false negative issue. The policies automatically determine whether an event is attack-
key insight of this strategy is that the semantics of malicious related. The subject represents a process, the object represents
behavior propagate with taking actions. As shown in Fig- different types of entities connected with the subject, and the
ure 3.(a), it depicts a short attack path described in the query direction indicates the information flow between subject and
graph, where a process controlled by the attacker tampers with object. For example, the third policy represents process →
sensitive files. Figure 3.(b) illustrates one of the attacker’s file: if the process in the write event is considered attack-
specific implementation approaches: First, a process P1 with related, suspicious semantics will propagate from the pro-
suspicious semantics injects malicious code to create process cess to the file. If another process reads the file containing
P2; then, process P2 writes a suspicious file Fe; next, another suspicious semantics, the suspicious semantics will further
malicious process P3 accesses the tainted file; subsequently, propagate from the file to that process. Then, we use a heuristic
the process P3 with suspicious semantics creates process P4; search algorithm integrating equivalent semantic transitivity in
7

TABLE III: Equivalent Semantic Transitivity Policies in the To optimize the traversal and analysis process, the algorithm
Context of Generalized Attack Pattern Identification and stores context information for each node (Lines 11-15). This
Matching. step avoids redundant traversals of the graph and enables the
Subject Object Direction
∃p.semanticsϵ{SuspiciousLabel}
Requisites algorithm to order the edges based on timestamps, resulting in
P P forward ∧[Event Fork(p, p’) | Event Create(p, p’)|
Event Clone(p, p’)]: p’.semantics.add(”SSuspiciousLabel”)
an ordered hunting sequence (Lines 16-19). This sequence is
P P forward
∃p.semanticsϵ{SuspiciousLabel}
∧Event Inject(p, p’): p’.semantics.add(”SuspiciousLabel”)
instrumental in guiding the subsequent steps of the algorithm.
P F forward
∃p.semanticsϵ{SuspiciousLabel}
∧Event Write(p, f): f.semantics.add(”SuspiciousLabel”)
It will calculate the reciprocal of the length of the shortest
P F backward
∃f.semanticsϵ{SuspiciousLabel}∧f.tagϵ{Fd, Fe}
∧Event Execute(p, f)|Event Load(p, f): p.semantics.add(”SuspiciousLabel”)
path between nodes in Gq and Gp as the path score. For each
P F backward
∃f.semanticsϵ{SuspiciousLabel}
∧Event Read(p, f): p.semantics.add(”SupiciousLabel”)
node in Gq, the candidate node with the highest contribution
value will be selected as the fixed node.
Step 3: Creating Tree Nodes. The final step of the
Section IV-D. algorithm involves the creation of tree nodes (Lines 19-27).
Given the disparity in size between the query graph and the
D. Threat Hunting and Incremental Aligning Module provenance graph, the CreateTreeNode function (Line 31) is
1) Suspicious Semantic Tree Construction: An event con- utilized to map each query graph to one or more subgraphs
tains the interaction information between entities and can be in the provenance graph that exhibit similar patterns. The
transformed into an information flow, which can be further function specifically creates a branch in the tree for the
classified into data flows and control flows. Data flows indicate attack entry point. This mapping is facilitated by the hunting
dependencies in data content, reflecting the data propagation sequence obtained in the previous step, enabling efficient
path (e.g., a process reading a file), while control flows detection navigation against the large and complex dataset.
primarily refer to process creation relationships (e.g., a parent After creating the tree nodes, when the attack progression
process creating a child process). In the threat hunting mod- exceeds the total attack sequence, the system raises an alert to
ule, data flows and control flows will be jointly abstracted notify security analysts of this anomalous situation.
into a suspicious semantic tree. The process of generating a To illustrate the aforementioned methodology, we present an
suspicious semantic tree is detailed as following three steps, example for better comprehension. We aim to find suspicious
as shown in algorithm 1: subgraphs similar to the query graph in the provenance graph
Step 1: Finding Candidate Nodes. To capture malicious shown in Figure 4. As observed, the representation above
behaviors in the provenance graph constructed from low-level illustrates a concrete instantiation of the graph, whereas the
system logs that match the patterns in the corresponding query depiction below presents an abstraction of the model. Assum-
graph, our system first searches for all nodes in the provenance ing we start from the process P1 within the red box as the
graph with attributes identical to those of entity node in the starting point of the attack chain (corresponds to the above is
query graph. These candidate nodes are collected into a list, powershell1 .exe), according to the query graph, the next step
referred to as the candidate set FC(i), which is associated with should be to find an executable file associated with P1, with
the query node (Line 7). the edge semantics being a write event. In the graph, only the
Step 2: Confirming Attack Intent. As the query graph Fe1 node (corresponds to the above is update.ps1) satisfies
Gq carries clear temporal features and causal relationships, this condition, so we can generate a tree node that stores a
we leverage these information to guide the attack detection variety of data, including the unique identifier of the query
reconstruction and reconstruction processes. This enables us to graph, event type, and relevant temporal information to assist
quickly determine the initial intrusion location, relevant entity subsequent hunting tasks. This step is largely analogous for
nodes, and the sequence of attack events. Such a query graph both approaches, with only marginal differences in storage
can assist analysts in searching malicious behaviors effectively. efficiency and temporal performance, the detail can see in
The ExtractRelevantNodes function is subsequently invoked Section V-D.
to identify the critical nodes within the query graph that are Similarly, P 2 will also be identified and generate a tree
essential for comprehending the attack methodology (Line node (corresponds to the above is powershell2 .exe). However,
8). This function operates by determining the next potential when the flows in Gq come to be associated with the socket,
action nodes in Gp based on the preceding step Fc in Gq . multiple similar scenarios may arise. In the Gp , the path from
The ReconstructAttackSequence algorithm reconstructs the at- P2 to the socket IP2 (indicated by the red arrow) satisfies
tack sequence (Line 9), carefully aligning with the temporal the previously defined equivalent semantic transitivity policy,
and causal patterns within Gq . Unlike indiscriminate search indicating that P2 and P3 share the same semantic information.
methods, the approach is strategically guided by specific target Therefore, IP2 can be retained as a suspicious node, while
nodes, precisely capturing the attacker’s intended progression. generating a tree node and preserving the temporal relationship
The function determines the matching order by meticulously from P3 to IP2. But for the above scenario, the absence of
tracing the sequence of attacks delineated in the graph. ACT- the strategy and rigid aligning rules, a false positive result
MINER evaluates whether nodes introduce malicious seman- is produced. Likewise, the path from P2 to IP3 also satisfies
tics, which is identified by first fixing the target nodes and then the equivalent semantic transitivity policy (It is important to
analyzing behavior within Gq . This process involves tracing note that intermediate nodes will be represented in the form
actions from the fixed nodes to determine patterns indicating of equivalent semantic attributes within the initial P1 node’s
malicious intent, forming the basis for further analysis. properties). However, for the path from P2 to IP1, although
8

101.write 156.78.147.114 111.connect powershell4 123.fork


powershell1 Update.ps1 /etc/group powershell7
109.fork
104.Load 113.read 124.connect
125.Recv
/tmp/tc 108.fork powershell3 112.execute
116.write powershell2 Update.ps 146.153.68.151
121.load 120.execute 105.read Update.ps1
115.fork 193.189.212.26 Powershell.exe write /etc/passwd
cmd1
102.fork Mydoct.rtf load read
122.execute
107.connect write read
106.read *profile* Powershell.exe /etc/group
powershell6
powershell5 firefox
116.read connect read
load execute
114.read read
/etc/passwd 103.connect 208.75.117.5 pefile cmd.exe External IP *.rtf*
pefile
Provenance graph Query graph

111.connect 123.fork
P1 101.write Fe1 Fa1 IP3 P6 P5
109.fork
104.Load 113.read 124.connect
125.Recv
Ff1 116.write P2
108.fork
P 112.execute
Fe2 IP4
121.load 120.execute 105.read 107.connect
115.fork 1.write Fe
P4 IP2 P Fa
102.fork
Fg1 2.load 4.read ……
122.execute
106.read
107.connect 6.write P Fa
P Ff
P3
P 8.load 3.connect 5.read Fg
114.read 116.read 7.execute
……
Fa2 103.connect IP1 Fg P IP
Fg
Provenance graph Query graph

Fig. 4: A case study of strategies such as POIROT and the underlying scenarios that our ACTMINER faces in the same scenario.
Where the dashed part indicates that the middle contains the multi-step behaviour whether black or red color, and the red line
and red box part indicate the real captured entities and events.

it also represents P→IP with the edge semantics of a connect allows for the efficient retrieval and reinstatement of these
operation, it occurs before the previous node (Fe1→P2 at time nodes into memory.
104), violating the sequence of the attack, and thus, this path For the former, through the affected tree nodes, we can
is ignored. Fa can be found as the same. obtain the current attack progress and their mapped nodes
Next, we need to find P→Ff (with the edge semantics of in the provenance graph and query graph. Then, through the
a write operation), but no files of the Ff type in the current sequence of suspicious candidate nodes, we can determine
provenance graph (assume the current time has not yet reached the next suspicious entity to hunt. Finally, we judge whether
116), so the system needs to wait for new data to arrive. the suspicious states are met in the candidate node set of the
2) Incremental Aligning: As time progresses, the log data suspicious entity, and if so, we construct a new tree node.
generated by hosts will continue to increase. For traditional For the latter, we first determine whether the node has
threat hunting systems, for any newly added logs after a candidate nodes. If there are no candidate nodes, it means
period of time, they need to re-scan the entire dataset with that this part of the data does not have an entry point for
a larger volume. To address the inefficiency of traditional attacks, indicating that this part of the data is considered
threat hunting in analyzing incremental streaming data, we benign. However, if there are candidate nodes, we will start
adopt an incremental graph computation method to hunt for rebuilding the suspicious subtree from its candidate nodes.
attacks and update suspicious semantic trees. First, we search As an example, the shaded part in Figure 4 represents the
for new candidate nodes in the newly arrived provenance graph newly added data. For the new data, we determine whether it
based on the attributes of nodes in the query graph. Next, we affects the existing results. The query graph (Gq ) awaits the
divide the impact of the new data on the suspicious subtrees arrival of a pattern where process P2 writes the file Ff1. If so,
into two parts: the new data affects the existing suspicious a new node representing Ff1 is added to the graph. According
semantic subtrees, and the new data is unrelated to the existing to the query graph, this P→Ff1 (write) is the desired one-hop
suspicious subtrees. Furthermore, to manage memory overhead attack path. For the above graph is powershell.exe →profile
effectively, we implement a forgetting rate to reduce memory (write). Although in the query graph, P←Fg (read) occurs
consumption. We transfer the nodes that remain un-updated before P→Ff (write), since the previous operation read an
for a period of 6 hours (can be adjusted according to different ordinary file with weak attack relevance, if the provenance
circumstances) into the database and create a corresponding graph does not contain the corresponding related connected
index for them. The index includes the attributes of the flow and nodes, it indicates that the operation did not introduce
node itself and its parent node, enabling rapid localization of new suspicious attack semantics, and the next attack target
relevant nodes in the event of a subsequent occurrence. This should be further explored. Then, we search for the target
9

Algorithm 1 Threat Hunting Algorithm TABLE IV: The detail of the attack and benign datasets.
Require: Gq , Gp Scenario Behavior
E4-Trace case1 Malicious file download and execute
Ensure: Suspicious Subgraph Gs E4-Trace case2 Information gather and exfiltration
1: /*DataProcessing*/ E4-Trace case3 Malicious file download and sensitive file exfiltration
E4-Trace case4 In-memory attack with firefox
2: Gq ← M ergeSimilarEntity(Gq ) E3-FiveDir case1 Pine backdoor
E3-FiveDir case2 Phishing E-mail Link with macro viruses
3: LSe ← GetQuerySequences(Gq ) E3-Trace case1 Firefox backdoor and load malicious software
4: f ← GetSetQueryGraphF low(Gq ) E3-Trace case2 Firefox backdoor and deploy malicious programme
E3-Trace case3 Phishing E-mail
5: Gp ← M ergeSimilarEvent(Gp ) E3-Theia case1 Firefox Backdoor and privilege escalation
6: /*ThreatHunting and Incremental Aligning*/ Win Benign Data Account operation, network communication and application activity
Linux Benign Data User Login, application operation and network interaction
7: F C ← F indCandidateN odes(Gq , Gp )
8: relevant nodes ← ExtractRelevantN odes(Gq ) TABLE V: The summary of the experimental dataset. Column
9: attack sequence ← ReconstructAttackSequence(Gq ) 1 specifies the name of dataset, and Column 2 denotes the
10: hunting sequence ← ∅ corresponding duration. Columns 3 and 4 indicate the number
11: for n ∈ relevant nodes do of nodes and edges, respectively. Column 5 represents the
12: context ← RetrieveN odeContext(n) number of attack nodes.
13: hunting sequence.add(context)
Datasets Duration Time #N #E % of Attack Nodes
14: end for E3-Trace 310h 1.950M 9.053M 37890
15: hunting sequence.sort(key = lambdax : E3-FiveDirections 210h 1.287M 2.577M 2956
E3-THEIA 168h 960.357K 2.352M 14781
x.timestamp) E4-Trace 8h 3.035M 13.586M 39582
Benign Linux 240h 2.385M 3.891M 0
16: visited ← ∅ Benign Windows 192h 5.324M 12.856M 0
17: f ixed candidates ← ∅ Avg 234.667h 2.490M 7.386M 0.669%
18: seqN um ← 0
19: for context ∈ hunting sequence do
20: q node ← context.corresponding query node • RQ2: How robust is ACTMINER against adversarial at-
21: if q node ∈ / visited then tacks?
22: candidates ← F C[q node] • RQ3: How important are the components we design for

23: f ixed candidates[q node] ← candidates assisting threat hunting?


24: visited[q node] ← T rue • RQ4: How efficient is ACTMINER compared with the

25: else SOTA in terms of runtime overhead?


26: candidates ← f ixed candidates[q node] • RQ5: How robust is ACTMINER in benign dataset?

27: end if DataSet. ACTMINER is evaluated using three datasets:


28: for c ∈ candidates do DARPA E3 [20], DARPA E4, Simulated Environments. The
29: if IsM aliciousSemantic(c) then DarpaE3 dataset is open-source, whereas the DarpaE4 dataset
30: seqN um ← seqN um + 1 is not publicly available. Theia and FiveDirections are both
31: node ← CreateT reeN ode(c, seqN um) from DARPA Engagement 3, and Trace is from DARPA
32: Gs .add(node) Engagement 3 and 4. The data of Theia and Trace was
33: end if collected from the Linux, while the data of FiveDirections
34: end for was collected from the Windows 7. As shown in Table V, the
35: end for duration encompasses both benign activities and attack-related
36: return Gs activities within a dataset, wherein only a small portion of the
total time frame involves actual attacks. Detailed description
of both actual attack behaviors and benign operations can be
corresponding to P→P (execute). In the P2 process node, there found in the Table IV.
exists an execute action to another process P4 (the dashed line As show in Table IV, these attacks include the following
indicates no direct connection between nodes, and there are scenarios: malicious file downloads and execution, information
multiple hops), generating new tree nodes corresponding to collection and exfiltration, Firefox memory attacks, backdoor
P2. Similarly, in the newly added data on the right shaded extensions, and phishing emails. Furthermore, we employ
part, there are interactions with sockets. As described above, Kellect [34] to extract benign data from the Windows platform,
the semantics represented by P6 are the same as those of P2, while we utilize SPADE [38] for the acquisition of benign
and the P2 node can be mapped to the P node in the query log data from Linux. All benign data includes various daily
graph P→IP (connect), so P6→IP4 (connect) is equivalent system operations, such as network browsing, file operations,
to P2→IP4 (connect), allowing the generation of a new tree etc. We also use the OpTC dataset which cotnains benign
node. activities of 500 Windows hosts over seven days. These benign
datasets contain billions of audit records on Windows, Linux,
V. E VALUATION and FreeBSD.
Our evaluation aims to answer the following five questions: Detector for Comparison. To evaluate ACTMINER, we use
• RQ1: How effectively can ACTMINER detect the attacks a graph alignment-based threat hunting system as benchmark:
especially in terms of false alarms? POIROT. Why are we only comparing ACTMINER to
10

POIROT? There are already several threat hunting studies on name and simple types as query graph’s features, which
[2], [11], [25], [39], [40]. We broadly categorise them into is insufficient. On the other hand, the entities attributes and
the following two based on the techniques they use: machine action semantic employed by ACTMINER offer more semantic
learning-based approaches [11], [39], [40] and search-based information for each node and less sensitive for nodes with
approaches [2], [25]. While both DeepHunter [39] and ProvG- certain types of characteristics, making it harder for attack
Searcher [11] conduct model training by constructing positive nodes to conceal themselves.
(attack graphs extracted manually from the provenance graph) At first sight, ACTMINER shows incremental improvement
and crafted negative samples. MEGR-APT [40] utilizes a in comparison to POIROT in terms of FN. This is attributed
graph matching model to compute similarity scores between to ACTMINER considering the correlation between semantics
the query graph’s embedding vector and the embedding vectors of multi-hop nodes. Table VI shows the performance of ACT-
of detected subgraphs. However, a fundamental limitation of MINER and POIROT on all datasets. ACTMINER does not miss
these coarse-grained methods is their inability to consider the any malicious nodes, i.e., ACTMINER’s average false negatives
relationships between nodes in the attack chain (e.g., Deep- (FNs) are 0, reduced by 61 compared to POIROT. On average,
Hunter solely considers the relationships between IOCs rather the false postive nodes generated by ACTMINER (∼ 389 nodes)
than the comprehensive association information of all attack is 1.91 × less than POIROT. ACTMINER demonstrates a
nodes). In other words, the results obtained from DeepHunter notable improvement in precision over POIROT, achieving
and ProvG-Searcher may not necessarily represent complete a 2.96% higher precision score, while also outperforming
attack chains. Furthermore, our initial attempt to re-implement POIROT in terms of recall with a 1.94% increase.
the ProvG-Searcher revealed that the core component respon-
sible for processing both provenance graphs and query graphs B. RQ2: How robust is ACTMINER against adversarial at-
is not available as an open-source solution. Hence, we do not tacks?
compare our work with them. ThreatRaptor [25] uses NLP When an attack occurs, the attacker’s behavior pattern may
technology to extract threat behaviour graphs from CTI reports be highly similar or even nearly identical to the normal system
and transforms the graphs into TBQL query language using behavior in a regular environment. From a technical perspec-
specific algorithms. Unlike ACTMINER, it stores audit log tive, attackers can mimic normal processes at the API call level
data in a database, allowing for the retrieval of individual or employ code injection techniques to make their behavior
attack behaviors. The single-point matching results obtained patterns indistinguishable from normal processes at the low-
from the TBQL query statement do not correspond well to level system log. This poses a challenge for provenance-
the contextual content of the attack process described in the based threat hunting. However, by considering richer con-
CTI report. So we do not compare with it. textual semantics, differentiation can still be achieved. Goyal
Here, we compare the performance of ACTMINER with et al. [42] devise three strategies for adversarial detection of
POIROT [2], which are the most relevant in term of level anomaly detection systems based on graph-level granularity.
and methodology for our evaluation. POIROT aligns the query Enlightened by their work, we design experiments to assess
graph with the provenance graph based on node type and name ACTMINER’s resilience against adversarial attacks.
regularization. Due to the unavailability of the query graphs To access ACTMINER’s resilience against adversarial at-
manually extracted by the authors in POIROT, for fairness, our tacks, we perform adversarial mimicry attacks on provenance-
query graph is uniformly extracted by the Extractor [28] in our based graph alignment threat hunting system. To evalu-
experimental setting. However, during the extraction process, ate ACTMINER’s anti-attack capability, based on the Darpa
we encounter graph disconnections, incomplete attack chains, datasets as a reference, we modify and add attack steps in the
etc. Therefore, we use the state-of-the-art method CRUcialG provenance graph, primarily considering two scenarios.
[41] to assist in obtaining the query graph. Scenario I: The attacker inserts a large number of invalid
attack paths into the actual attack chain, attempting to disrupt
and mislead the detection algorithm. Test results show that
A. RQ1: How effectively can ACTMINER detect the attacks POIROT suffered severe missed detections in this scenario,
especially in terms of false alarms? while ACTMINER successfully detected all real attack nodes,
Table VI presents the performance of ACTMINER and resisting the attacker’s disruptive attacks.
POIROT on our evaluation datasets. ACTMINER consistently Scenario II: The attacker uses normal programs to perform
surpass POIROT, achieving superior precision, recall values operations similar to attack patterns, attempting to introduce
and lower number of FN/FP. In comparison to POIROT, false positives. Tests found that the POIROT exhibits varying
ACTMINER utilizes entities attributes and action semantic to degrees of false positives, marking normal processes as attack
generate a more generalizable query graph. This provides brief nodes. Our system, on the other hand, effectively distinguishes
abstracted entity information to taking a graph alignment, the true intent of normal programs through behavior associ-
subsequently reducing false negative and enhancing precision ation and intent analysis, avoiding false positives. Based on
and recall. As the POIROT paper lacks evaluation on E4 the above two scenarios, we designed the following three
dataset, we execute POIROT on E4 to obtain evaluation strategies:
results. The findings demonstrate that ACTMINER significantly 1) Strategy I: Insert additional unrelated benign process
outperforms POIROT, as E4 attacks more challenging to detect read/write file flows between process and file read/write
due to well-blended malicious activity. POIROT relies solely operations. As shown in Figure 5;
11

TABLE VI: Performance of ACTMINER and POIROT. FN denotes the false negative, which occurs when a genuine attack
pattern is incorrectly classified as benign. Conversely, FP represents the false positive, where a benign event or data point is
mistakenly identified as an attack one. The notation Prec. denotes precision.
read/ POIROT ActMiner without EST ActMiner
P F
ATTACK write
FN/FP Recall Prec. FN/FP Recall Prec. FN/FP Recall Prec.
Darpa4 Trace case1 19/89 99.79 99.00 21/43 99.76 99.51 0/79 100.00 99.11
Darpa4 Trace case2fork/ 58/1005 99.62
fork/ 93.89 37/532
fork/ 99.76 read/ 96.62 0/781 100.00 95.18
P
Darpa4 Trace case3clone 77/82 P clone
94.58 P
94.24 clone
63/41 ......
95.51 write 97.03 F
0/65 100.00 95.39
Darpa4 Trace case4 96/1352 98.51 88.61 74/775 99.30 93.16 0/912 100.00 92.15
Darpa3 FiveDir case1 18/254 98.52 82.49 7/88 99.42 93.19 0/142 100.00 89.40
Darpa3 FiveDir fork/
case2 49/49 P fork/
96.36 96.36 fork/
32/30 97.59 read/97.74 0/37 100.00 97.32
P P ...... F
Darpa3 Trace case1clone 77/1146 clone
99.47 92.62 clone
54/412 99.62 write 97.20 0/613 100.00 95.91
Darpa3 Trace case2 59/1952
read/ 99.67 read/
90.20 35/742 99.80 95.95 0/996 100.00 94.80
Darpa3 Trace case3 write
78/233 94.72 write
85.72 62/103 95.76 93.14 0/154 100.00 90.08
Darpa3 Theia 85/1263 99.37 91.38 63/578 99.53 95.85 0/746 100.00 94.72
Avg FP/FN/Recall/prec. 61/742 F 98.06 F
91.45 45/334 98.60 95.94 0/452 100.00 94.41
† EST: Equivalent Semantic Transfer

2) Strategy II: Insert additional benign processes between connect/


P sendto/ S
process and
P
process executionP operations. As shown in
execute recvfrom
Figure 6;
3) Strategy III: Insert additional processes communicating
with sockets between processes and socket communica- fork/
connect/
fork/ P P ...... sendto/ S
P
tion operations. clone 7. execute
As shownPin Figure ...... P clone
recvfrom
execute connect/
sendto/
read/ recvfrom
P F
write P

fork/ fork/ fork/ read/


P
clone
P
clone
P
clone
......
write
F S

fork/ fork/ fork/ read/


P P P ...... F
clone clone clone write
Fig. 7: Strategy III. Connect/Send/Recv operations Insert.
read/
read/ read/
P F
write
write
write

F F
with a decreasing trend approaching approximately 1%. This
fork/ fork/ fork/
by the design of our system. In characteristic read/
is determined
P P P ...... F
clone clone connect/ clone write
Fig. 5: Strategy I. Read/Write operations Insert. contrast, POIROT exhibits a diminishing recall rate as the
P sendto/ S
P execute P recvfrom
proportion increases, with the rate of decline accelerating with
The experimental results, as illustrated fork/
fork/ in Table VII, depictfork/higher proportions read/ added. This can be observed from the AVG
P P P ...... fork/ F connect/
the percentage
P clone and edges incrementally
fork/
of nodes
clone
P clone
...... added
execute to thecloneof R., where
P P write
the clone
P ......
descent shifts from an initial 1.5% at full scale sendto/ S
recvfrom
connect/
attack graph using the three aforementioned
execute
read/ strategies
read/on the to around 4% towards the latter end of the scale. Regarding
sendto/

writeBoth nodes andwrite


recvfrom
Darpa 4 dataset across all attacks. P edges are precision, as the proportion increases, both our system and
incrementally added at a constant rate. POIROT reach their respective lowest points of 89.33% and
S

In terms of recall rate, our system F F


consistently maintains a 69.75%. Through further analysis, we conclude that our system
high average ratio as the proportion of added nodes increases, is capable of detecting evasion attacks based on the equivalent
semantic approach we have devised.
Why can’t POIROT defend against adversarial attacks?
connect/
The P attacker inserts a Slarge number of irrelevant nodes and
sendto/
P execute P recvfrom
Remote IP
connecting edges between the real attack steps, increasing the
hop count of a single attack stage to more than three hops. This
causes POIROT to fail to associate different stages as a single
10.connect connect/
fork/ attack fork/ resulting in missed detection. POIROT decom-
event,
P P ...... execute P P P ...... sendto/ S
clone clone
recvfrom
poses the graph through bounded branches (γ) and depth (β),
MainProcess 1.fork Firefox 3.write /tmp/1.txt
execute separating graph connect/ neighborhoods
sendto/
from each other. The authors
of POIROT recvfrom
consider the optimal depth to be three. Therefore,
9.write
Remote IP 2.clone P
even if suspicious nodes have a common malicious ancestor,
10.connect
/tmp/ex.txt Firefox2 4.load
if the hop distance Sof that behavior from the earliest attack
/usr/lib
MainProcess 1.fork Firefox 3.write /tmp/1.txt
node exceeds 1.load
β hops, the anomaly score obtained by that
Firefox /usr/lib
Fig. 6: Strategy II. Execute operations Insert.
5.execute behavior will be lower than the score of the previous malicious
9.write 2.clone

2.Execute
/tmp/ex.txt Firefox2 4.load /usr/lib
SubProcess 6.connect Socket 1.load
Firefox /usr/lib
5.execute

2.Execute SubProcess 3.connect Socket


12

TABLE VII: The performance comparison between the ACTMINER and POIROT across four scenarios (i.e., four query graphs)
from the E4-Trace dataset, where varying proportions of edges and nodes are added in the provenance graph. Each addition
consists of an equal number of three types of nodes/edges. The percentage added represents the percentage of the number of
nodes in the provenance graph relative to the attack node.
0% 25% 50%
Query
POIROT ACTMINER POIROT ACTMINER POIROT ACTMINER
Graph
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
case1 99.00 99.79 100.00 100.00 98.34 99.23 100.00 100.00 96.24 97.35 99.26 100.00
case2 93.98 99.62 97.37 100.00 93.28 96.55 97.58 100.00 92.26 94.98 96.33 100.00
case3 94.24 94.58 98.32 100.00 93.21 94.52 98.42 100.00 91.35 92.25 96.47 100.00
case4 88.61 98.51 95.08 100.00 87.25 94.31 95.52 100.00 83.21 90.44 94.35 100.00
AVG 93.96 98.13 97.69 100.00 93.02 96.13 97.88 100.00 90.77 93.76 96.60 100.00
75% 100% 125%
Query
POIROT ACTMINER POIROT ACTMINER POIROT ACTMINER
Graph
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
case1 93.55 95.26 97.56 100.00 91.41 94.32 96.46 99.98 89.74 90.25 95.84 99.98
case2 90.04 93.52 94.32 100.00 87.43 90.78 94.16 100.00 82.36 87.32 92.52 99.98
case3 89.21 91.56 95.36 100.00 88.47 90.24 95.13 99.97 86.25 84.33 94.78 99.97
case4 80.32 87.03 93.42 100.00 77.22 85.41 92.74 100.00 74.32 81.36 92.04 99.97
AVG 88.28 91.84 95.17 100.00 86.13 90.19 94.62 99.99 83.17 85.82 93.80 99.98
150% 175% 200%
Query
POIROT ACTMINER POIROT ACTMINER POIROT ACTMINER
Graph
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
case1 87.21 88.41 92.36 99.98 82.34 81.64 91.50 99.96 77.28 76.88 90.03 98.74
case2 77.23 83.42 89.64 99.47 72.14 80.21 88.75 99.48 61.45 76.33 87.25 99.52
case3 83.52 81.02 93.74 99.86 79.42 77.31 92.36 99.23 75.04 73.98 91.11 99.13
case4 70.30 79.52 91.14 99.97 68.23 75.21 90.02 99.95 65.23 72.06 88.94 99.32
AVG 79.57 83.10 91.72 99.82 75.53 78.59 90.66 99.66 69.75 74.81 89.33 99.17

behavior, and that behavior may be mistaken as benign. The and efficiency, we conduct ablation experiments on them sepa-
attacker can also insert a small number of irrelevant nodes rately. The QGP step is responsible for generating the required
between attack stages, keeping the hop number of the attack query graphs, and facilitating the subsequent operations of our
chain for each stage within POIROT’s detection threshold of system. On the other hand, the EST step plays a crucial role
three. Although the inserted nodes themselves are harmless, in threat hunting module.
the edges connecting them to the real attack nodes cause
1) Efficacy of Equivalent Semantic Transfer. We test
POIROT to erroneously judge them as part of the attack,
the performance of ACTMINER without EST using the same
resulting in false negatives. Through our further investigation,
settings in the RQ1. As demonstrated in Table VI, without
we analyze the reasons behind the superior performance of
the EST component the number of FN enhanced by 1.4%,
ACTMINER: POIROT uses the query graph directly for regular
the number of false positives is 83.96% of POIROT’s, while
matching of node types and their attribute names, and selects
it represented an increase compared to ACTMINER. It is
appropriate nodes based on path scores, neglecting to consider
attributed to the following reasons: The process of equivalent
temporal and causal issues between nodes, resulting in many
semantic transfer also inherently involves the causal logic and
false alarms.
temporal relationships of the attack to some extent. Therefore,
The robustness of ACTMINER results from two reasons. removing this component may introduce a small portion of
First of all, during the construction of suspicious semantic tree, false positives.
attack intent information would be confirmed. The asserted
behaviors like fork/clone will not influence the score of the 2) Effectiveness of Query Graph Processing. As part of
certain attack path, because the multiple-hops strategy will our study, we directly utilize the query graphs extracted by the
consider these nodes to contain the same suspicious semantic. Extractor and CRUcialG on the same provenance graphs used
Second, ACTMINER do not raise alarms unless nodes exceed- in RQ1 for threat hunting. The size of query graph before QGP
ing the attack sequence length are detected. This ensures that and after QGP for all situations are presented in Table VIII,
our system does not generate excessive false positives when along with the results and the corresponding analysis. In the
attackers intend to conduct multi-point blasting to affect the QGP process, we uses the method of attribute abstraction for
hunting system, while also enhancing its resilience against nodes, so the number of nodes would be reduced. Our analysis
Scenario I. explores the impact of incorporating Query Graph Processing
(QGP) on the system’s accuracy and timeliness. We use the
Trace dataset from E3 to study the effect of QGP. The result,
C. RQ3: How important are the components we design for presented in Figure 8, shows a significant reduction in time
assisting threat hunting? consumption when considering QGP. This improvement re-
To demonstrate the impact of our system components, sults from QGP further merging the extracted nodes. Ignoring
particularly the query graph processing (QGP) and equivalent query graph processing makes efficiently hunting these nodes
semantic transfer (EST) components on system performance challenging. However, incorporating QGP allows the system
45
40
13
35
30

100%
25 TABLE IX: Overhead of ACTMINER and POIROT.
20
Percentage of runtime

90% GQ OGQ POIROT ActMiner


ATTACK
80% 15 Ti.(s) Mem.(MB) CPU(%) Ti.(s) Mem.(MB) CPU(%)
E4 Trace case1 2931 65.2 28.3 5 86.6 25.7
70% 10 E4-Trace case2 2311 95.6 28.6 82 118.1 25.7
60% 5 E4-Trace case3 102 27.8 26.4 28 35.8 26.0
E4-Trace case4 4284 69.4 28.3 69 105 25.7
50%
0 E3-FiveDir case1 5 53.5 26 26 46.2 21.4
40% E3-FiveDir
FNR FPRcase2
FNR FPR 73 31.0FPR FNR FPR
FNR FPR FNR 29 FNR FPR 6 FNR FPR50.2 23.7FPR
FNR FPR FNR
30% E3-Trace case1 7814 178.1 26.5 15927 112 27.2
0% case225% 39701
E3-Trace 50% 348.8
75% 100%
25.5 125%9982 150%155.4175% 200%
25.4
20% E3-Trace case3 62 27.8 25 149 55.9 25.5
10% E3-Theia 3416 case1280.3 case2 26.8
case3 9629
case4 105.3 25.0
0% Avg 6070 117.8 27.1 3590 87.1 25.1
Trace-1 Trace-2 Trace-3 Theia-4 FD-1 FD-2

E3 DataSet 300
Poirot Actminer
Fig. 8: Ratio of running time between GQ and OGQ. OGQ 250

Memory Usage(MB)
denotes the original query graph, the GQ denotes the query 200
graphs we use in ACTMINER.
150
TABLE VIII: The edges and nodes in the query graph, OV
100
and OE denote the number of nodes and edges in the graph
before QGP, and V and E denote the number of nodes and 50
edges after processing, respectively.
0
E4 E3 0 2 4 6 8 10 12 14 16 18 20 22 24 48 72
Scenario
T-1 T-2 T-3 T-4 Tr-1 Tr-2 Tr-3 Th-1 W-1 W-2
Time (h)
OV 9 13 16 9 14 10 15 12 13 10
OE 8 13 17 8 16 11 14 14 14 11
V 6 4 6 8 6 8 5 6 5 7 Fig. 9: Average Memory consumption in different time states.
E 6 3 5 7 5 9 4 5 4 6

and indexed. As illustrated in the Figure 9, POIROT’s mean


to identify these nodes more effectively, leading to improved memory requirements gradually increase as more data is
threat hunting performance. collected from hosts over time. However, our incremental
hunting module ensures that the rate of memory consumption
D. RQ4: How efficient is ACTMINER compared with the SOTA growth is consistently maintained at a steady pace, primarily
in terms of runtime overhead? determined by our storage mechanism.
We evaluate how efficiently ACTMINER can detect APT
attacks in a timely manner by measuring its running time E. RQ5: How robust is ACTMINER in benign dataset?
performance. As shown in Table IX, it can be seen that
under the overall time overhead, POIROT consumes 1.69 times To comprehensively evaluate the robustness of ACTMINER,
more time than ACTMINER. This suggests that ACTMINER we conducted experiments on a sizeable benign dataset, in-
has better timeliness compared to POIROT. However, in some troduced in Section V. This dataset was collected from mul-
cases, especially the scenario of E3-Trace case1. Through our tiple users performing typical non-malicious actions such as
further investigation, we have discovered a key cause that downloading and uploading files, taking backups, browsing the
led to this discrepancy. ACTMINER employs a mechanism to web, and installing or uninstalling software. Additionally, we
offload data instances spanning over 6 hours from memory included the OpTC dataset, which involves benign activities
to the database. While beneficial for optimizing memory like website browsing, checking emails, and SSH log-ins. We
management, this storage approach incurs extra overhead when randomly selected three hosts and tested their benign data
encountering situations that necessitate repeated interactions collected over a period of one day from the OpTC dataset
between the database and memory, resulting in higher time and five datasets acquired from our own laboratory.
expenditure compared to POIROT. We applied ACTMINER to all benign datasets and searched
Furthermore, the presence of an incremental hunting module for the query graphs extracted from the TC reports. Although
allows our system to input data in segments for hunting, these logs were attack-free, they shared many nodes and
rather than having to input all the data at once for each hunt. events with our query graphs, such as critical system files and
Specifically, enterprises only need to input the newly added processes related to email clients and text editing tools. As
data to the system for each batch (e.g., the data from each illustrated in Table X, despite these similarities, ACTMINER
day), instead of combining it with data from previous days and successfully demonstrated robustness by generating zero false
performing rescanning. At the same time, we have designed a alerts throughout the experiment.
storage mechanism that stores branches in a database. When
an edge interacting with that node appears again, the branch VI. D ISCUSSION & F UTURE W ORK
is retrieved from the database. The accuracy of ACTMINER. Due to the accuracy of
Instead of loading all historical data for each new hunt, threat hunting relying on the quality of query graphs, we
only the newly collected information needs to be processed utilized EXTRACTOR to automate the extraction of CTI
14

TABLE X: Experimental results on different benign datasets. with widespread applications (e.g., database analysis, search
DataSet Test Duration Platform Hosts FP engines, software plagiarism detection, and social networks).
OpTC 01d00h00m Window 3 0 Based on whether the matching results are completely con-
Our Lab 01d08h13m Ubuntu 12.04 x64 5 0
sistent, algorithms can be divided into two categories: graph
isomorphism matching and graph approximate matching. The
main idea of graph isomorphism matching algorithms is to
reports provided by DARPA. We achieved results superior to
iteratively map nodes from the query graph to the target graph
the SOTA on well-known APT attack datasets (i.e., E3 and
one by one. Ullmann et al. [43] proposes a backtracking-
E4). However, limited by scarce APT attack samples, we were
based algorithm that enumerates all subgraphs satisfying the
unable to conduct large-scale experiments and analyses. In
matching requirements using a depth-first search approach.
practice, accurately extracting query graphs from numerous
However, as the scale of the graph gradually increases, the
CTI reports and automatically generating diverse and rational
enumeration range also expands, resulting in low algorithm
ones can improve threat hunting accuracy in the future.
efficiency. To address this issue, Cordella et al. [44] proposes
The interpretability of ACTMINER. The outputs of ACT-
the VF2 algorithm, which improves algorithm efficiency by
MINER inherently provide interpretability, as analysts can de-
incorporating the verification order of query nodes. However,
rive attack-related information from the query graphs reports.
in practical applications, its time complexity is superlinear,
However, due to the dynamic nature of APT attacks, the
and the need for secondary filtering consumes a significant
query graphs and hunting results are not entirely consistent.
amount of time. Shasha et al. [45] introduces the GraphGrep
Therefore, in the future, we can utilize generative artificial
algorithm, which achieves fast matching by encoding node se-
intelligence (e.g., LLM), to automatically transform the attack
mantic information. Several other studies [46]–[48] use graph
chains obtained from threat hunting into CTI reports that
mining techniques to find subgraphs from databases and then
analysts can clearly understand, aiding in better response.
employ filtering and optimization strategies to prune incorrect
The robustness of ACTMINER. The use of graph process-
nodes. However, due to the diversity and complexity of APT
ing approaches may lead to the loss of fine-grained details.
techniques, we cannot always assume that the nodes and edges
To validate the robustness of our ACTMINER, we conducted
of the query graph can be fully mapped to the target graph.
cross-validation by employing query graphs extracted from
Graph approximate matching algorithms often rely on heuristic
diverse CTI reports against distinct provenance graphs. The
methods to identify important nodes and then gradually expand
experimental findings consistently demonstrate that our system
to neighboring nodes [2], [49]–[51]. Tian et al. [49] uses
effectively avoids generating false alarms across all tested
a graph distance model to measure the similarity between
origin query graphs. The identified suspicious nodes do not
graphs. He et al. [52] introduces an index-based algorithm to
meet the threshold for triggering an alert.
support subgraph queries and similarity queries. Several other
works [51], [53] consider the shape and edge attributes of the
VII. R ELATEDW ORK query graph. However, the aforementioned research overlooks
A. Provenance Graph-based Threat Hunting adversarial knowledge. Milajerdi et al. [2] proposes POIROT,
The provenance graph contains causal relationships between which is similar to our work and utilizes node attributes and
system events, and it is a data structure that can be effectively information flows between nodes for approximate matching.
utilized for cyber threat hunting. DeepHunter [39] utilizes However, if an attacker intentionally takes a detour to achieve
Neural Tensor Networks (NTN) to judge the subgraph re- their goal, it may result in missed detections. Furthermore,
lationship through graph embedding. Due to the need for if the similarity score exceeds the threshold, POIROT stops
multiple-to-multiple traversal comparisons between subgraphs, hunting, and the obtained attack subgraph may not represent
the efficiency of DeepHunter will decrease as the size of the the optimal attack behavior. Therefore, we develop a new
provenance graph increases. ProvG-Searcher [11] establishes matching technique to address these challenges.
a coarse-grained threat hunting by employing a graph repre-
sentation learning method on the subgraph, aiming to enhance C. Incremental Graph Computation
hunting efficiency. ThreatRaptor [25] achieves extracting struc-
tured threat behaviors from OSCTI and automatically syn- In real-world scenarios, graphs are typically large in scale
thesizing query statements to search for malicious activities. and frequently updated over time. When graphs are updated,
Unlike the above studies, ACTMINER, inspired by POIROT traditional batch processing methods require starting the com-
[2], tries to align attack graphs extracted from CTI reports with putation from scratch, which is extremely time-consuming. In
provenance graphs from system logs to find complete attack contrast to traditional batch processing algorithms, incremental
chain. However, ACTMINER ia able to significantly reduces matching only analyzes and matches the updated portion, uti-
false positives, false negatives, and system overhead through lizing previous matching results to maximize the reduction of
an optimized graph alignment algorithm. redundant computations. Fan et al. proposes the IncSIMMatch
[54] algorithm, which effectively reduces redundant computa-
tions by creating an index for the pattern graph. They then
B. Graph Matching Algorithms introduces an incremental computation method IncISO [55],
Graph pattern matching refers to the problem of finding sim- where only the set of nodes within d hops of the updated node
ilarities between a small query graph and a large target graph, d in the data graph needs to be rematched as the affected region
15

when nodes and edges change in the graph. Subsequently, [20] A. D. Keromytis., “Transparent computing engagement 3 data release.”
researchers [56]–[62] introduce the idea of incremental com- 2018, https://github.com/darpa-i2o/Transparent-Computing/blob/master/
README-E3.md.
putation and incorporated query search optimization strategies. [21] “Darpa transparent computing engagement,” 2020, https://www.darpa.
For example, Sutanay et al. [57] constructs the pattern graph as mil/program/transparent-computing.
a binary tree and decomposes it for storage in tree nodes, with [22] “mandiant/openioc 1.1,” https://github.com/mandiant/OpenIOC/.
[23] “Introduction to stix,” https://oasis-open.github.io/cti-documentation/
data updates performed by searching leaf nodes. However, stix/intro.html.
storing a large number of indexes consumes memory. Sun et [24] “Misp - open source threat intelligence platform & open standards for
al. [63] proposes an exploration-based approximate matching threat information sharing,” https://www.misp-project.org.
[25] P. Gao et al., “Enabling efficient cyber threat hunting with cyber threat
technique, but if the initially selected node is inappropriate, intelligence,” in ICDE. IEEE, 2021, pp. 193–204.
a large number of useless intermediate result values will be [26] G. Husari et al., “Ttpdrill: Automatic and accurate extraction of threat
generated. ACTMINER constructs a tree structure to incremen- actions from unstructured text of cti sources,” in ACSAC, 2017.
[27] X. Liao et al., “Acing the ioc game: Toward automatic discovery and
tally update the alignment results and introduces a forgetting analysis of open-source cyber threat intelligence,” in CCS, 2016.
rate to maintain stable memory overhead. [28] K. Satvat et al., “Extractor: Extracting attack behavior from threat
reports,” in EuroS&P. IEEE, 2021, pp. 598–615.
[29] W. U. Hassan et al., “Nodoze: Combatting threat alert fatigue with
VIII. C ONCLUSION automated provenance triage,” in NDSS, 2019.
[30] Hassan, Wajih Ul and others, “Tactical provenance analysis for endpoint
We propose ACTMINER, a system that enables the detection detection and response systems,” in S&P. IEEE, 2020, pp. 1172–1189.
[31] Q. Wang et al., “You are what you do: Hunting stealthy malware via
of the complete APT attacks chains by applying causality data provenance analysis.” in NDSS, 2020.
tracking and increment aligning. It overcomes the issues of [32] Inam, Muhammad Adil and others, “Sok: History is a vast early warning
low precision, low recall, and low efficiency that existed in system: Auditing the provenance of system intrusions,” in S&P. IEEE,
2023, pp. 2620–2638.
previous work. Experimental results show that ACTMINER [33] R. Sekar et al., “eaudit: A fast, scalable and deployable audit data
exhibits better detection rates and resilience against adversarial collection system,” in S&P. IEEE, 2023, pp. 87–87.
attacks compared to the SOTA. [34] T. Chen et al., “Kellect: a kernel-based efficient and lossless event log
collector,” arXiv preprint arXiv:2207.11530, 2022.
[35] “Windows event tracing,” https://docs.microsoft.com/en-us/windows/
desktop/ETW/event-tracing-portal/. [Online]. Available: https://docs.
R EFERENCES microsoft.com/en-us/windows/desktop/ETW/event-tracing-portal
[36] T. Zhu et al., “General, efficient, and real-time data compaction strategy
[1] A. Bates et al., “Trustworthy {Whole-System} provenance for the linux
for apt forensic analysis,” TIFS, vol. 16, pp. 3312–3325, 2021.
kernel,” in USENIX, 2015, pp. 319–334.
[37] “Darpa3-cdm.” [Online]. Available: https://drive.google.com/drive/
[2] S. M. Milajerdi et al., “Poirot: Aligning attack behavior with kernel
folders/1gwm2gAlKHQnFvETgPA8kJXLLm3L-Z3H1
audit records for cyber threat hunting,” in CCS, 2019, pp. 1795–1812.
[38] A. Gehani et al., “Spade: Support for provenance auditing in distributed
[3] Milajerdi, Sadegh M and others, “Holmes: real-time apt detection
environments,” in Middleware 2012: ACM/IFIP/USENIX 13th Interna-
through correlation of suspicious information flows,” in S&P. IEEE,
tional Middleware Conference, Montreal, QC, Canada, December 3-7,
2019, pp. 1137–1152.
2012. Proceedings 13. Springer, 2012, pp. 101–120.
[4] M. N. Hossain et al., “Combating dependence explosion in forensic [39] R. Wei et al., “Deephunter: A graph neural network based approach for
analysis using alternative tag propagation semantics,” in S&P. IEEE, robust cyber threat hunting,” in SecureComm. Springer, 2021, pp. 3–24.
2020, pp. 1139–1155. [40] A. Aly, S. Iqbal, A. Youssef, and E. Mansour, “Megr-apt: A memory-
[5] Y. Xie et al., “Pagoda: A hybrid approach to enable efficient real-time efficient apt hunting system based on attack representation learning,”
provenance based intrusion detection in big data environments,” TDSC, IEEE Transactions on Information Forensics and Security, vol. 19, pp.
2018. 5257–5271, 2024.
[6] M. Bishop et al., Introduction to computer security. Addison-Wesley [41] W. Cheng, T. Zhu, T. Chen, Q. Yuan, J. Ying, H. Li, C. Xiong,
Boston, 2005, vol. 50. M. Li, M. Lv, and Y. Chen, “Crucialg: Reconstruct integrated attack
[7] C. Kruegel et al., Intrusion detection and correlation: challenges and scenario graphs by cyber threat intelligence reports,” arXiv preprint
solutions. Springer Science & Business Media, 2004, vol. 14. arXiv:2410.11209, 2024.
[8] C. Xiong et al., “Conan: A practical real-time apt detection system with [42] A. Goyal et al., “Sometimes, you aren’t what you do: Mimicry attacks
high accuracy and efficiency,” TDSC, 2020. against provenance graph host intrusion detection systems,” in NDSS,
[9] T. Zhu et al., “Aptshield: A stable, efficient and real-time apt detection 2023.
system for linux hosts,” TDSC, 2023. [43] J. R. Ullmann, “An algorithm for subgraph isomorphism,” JACM,
[10] M. N. Hossain et al., “{SLEUTH}: Real-time attack scenario recon- vol. 23, no. 1, pp. 31–42, 1976.
struction from {COTS} audit data,” in USENIX, 2017. [44] L. P. Cordella et al., “A (sub) graph isomorphism algorithm for matching
[11] E. Altinisik et al., “Provg-searcher: A graph representation learning large graphs,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 26, no. 10,
approach for efficient provenance graph search,” in CCS, 2023. pp. 1367–1372, 2004.
[12] J. Zengy et al., “Shadewatcher: Recommendation-guided cyber threat [45] D. Shasha et al., “Algorithmics and applications of tree and graph
analysis using system audit records,” in S&P. IEEE, 2022. searching,” in PODS, 2002, pp. 39–52.
[13] S. Wang et al., “Threatrace: Detecting and tracing host-based threats [46] J. Cheng et al., “Fg-index: towards verification-free query processing on
in node level through provenance graph learning,” TIFS, vol. 17, pp. graph databases,” in SIGMOD, 2007, pp. 857–872.
3972–3987, 2022. [47] P. Zhao et al., “Graph indexing: tree+ delta¡= graph,” in VLDB. Citeseer,
[14] X. Han, et al., “Unicorn: Runtime provenance-based detector for ad- 2007, pp. 938–949.
vanced persistent threats,” pp. 1–18, 2020. [48] X. Yan et al., “Graph indexing: a frequent structure-based approach,” in
[15] E. Manzoor et al., “Fast memory-efficient anomaly detection in stream- SIGMOD, 2004, pp. 335–346.
ing heterogeneous graphs,” in Dblp, 2016, pp. 1035–1044. [49] Y. Tian et al., “Saga: a subgraph matching tool for biological graphs,”
[16] M. U. Rehman et al., “Flash: A comprehensive approach to intrusion Bioinformatics, vol. 23, no. 2, pp. 232–239, 2007.
detection via provenance graph representation learning,” in S&P. IEEE, [50] Tian, Yuanyuan and others, “Tale: A tool for approximate large graph
2024, pp. 139–139. matching,” in ICDE. IEEE, 2008, pp. 963–972.
[17] F. Yang et al., “{PROGRAPHER}: An anomaly detection system based [51] H. Tong et al., “Fast best-effort pattern matching in large attributed
on provenance graph embedding,” in USENIX, 2023, pp. 4355–4372. graphs,” in Proceedings of the 13th ACM SIGKDD international con-
[18] “Threat report,” https://www.crowdstrike.com/global-threat-report/. ference on Knowledge discovery and data mining, 2007, pp. 737–746.
[19] “Lateral movement,” https://www.crowdstrike.com/cybersecurity-101/ [52] H. He et al., “Closure-tree: An index structure for graph queries,” in
lateral-movement/. ICDE. IEEE, 2006, pp. 38–38.
16

[53] D. J. Pohly et al., “Hi-fi: collecting high-fidelity whole-system prove-


nance,” in ACSAC, 2012, pp. 259–268.
[54] W. Fan et al., “Incremental graph pattern matching,” TODS, vol. 38,
no. 3, pp. 1–47, 2013.
[55] Fan, Wenfei and others, “Incremental graph computations: Doable and
undoable,” in SIGMOD, 2017, pp. 155–169.
[56] J.-S. Kao and otehrs, “Distributed incremental pattern matching on
streaming graphs,” in HPGP, 2016, pp. 43–50.
[57] S. Choudhury et al., “A selectivity based approach to continuous pattern
detection in streaming graphs,” arXiv preprint arXiv:1503.00849, 2015.
[58] C. Kankanamge et al., “Graphflow: An active graph database,” in
SIGMOD, 2017, pp. 1695–1698.
[59] M. Idris et al., “The dynamic yannakakis algorithm: Compact and
efficient query processing under updates,” in SIGMOD, 2017.
[60] Idris, Muhammad and others, “General dynamic yannakakis: conjunctive
queries with theta joins under updates,” The VLDB Journal, vol. 29, no.
2-3, pp. 619–653, 2020.
[61] K. Kim et al., “Turboflux: A fast continuous subgraph matching system
for streaming graph data,” in SIGMOD, 2018, pp. 411–426.
[62] S. Min et al., “Symmetric continuous subgraph matching with bidirec-
tional dynamic programming,” arXiv preprint arXiv:2104.00886, 2021.
[63] X. Sun et al., “An in-depth study of continuous subgraph matching,”
Proceedings of the VLDB Endowment, pp. 1403–1416, 2022.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy