BreachSeek A Multi-Agent Automated Penetration Tester
BreachSeek A Multi-Agent Automated Penetration Tester
Tester
1*
Majed Bamardouf Alaqsa Akbar 1 *
majedTB12@gmail.com alaqsaakbar@hotmail.com
1
King Fahd University of Petroleum and Minerals (KFUPM)
*
Equal contribution
1
the results, providing a robust solution to the ever- GPT outperforming previous models like GPT-
evolving landscape of cybersecurity threats. 3.5 and GPT-4 by significant margins. This un-
One of the key technical innovations in Breach- derscores its effectiveness in maintaining context
Seek is the use of multiple AI agents, each with throughout complex testing scenarios, a critical
a distinct focus, to manage the complexity and challenge in the application of LLMs to penetra-
breadth of tasks involved in penetration testing. tion testing tasks [1].
This approach ensures that the system avoids run- In a broader context, the use of generative AI in
ning out of context window, a common limitation penetration testing offers both opportunities and
in LLMs, and allows for the separation of concerns. challenges. On one hand, generative models can
Each agent is tasked with a specific aspect of the quickly identify vulnerabilities and generate test
testing process, ensuring a high level of specializa- scenarios that might be missed by human testers.
tion and accuracy. This design principle not only For example, tools like Mayhem utilize techniques
optimizes the performance of individual agents but such as fuzzing and symbolic execution to uncover
also contributes to the overall efficiency and effec- vulnerabilities in a fraction of the time it would
tiveness of the platform. take a human tester. These models also bring a
The platform’s scalability further enhances its level of creativity to the process, simulating novel
utility, enabling it to be deployed in a wide range attack vectors that enhance the robustness of pen-
of environments, from small to large-scale net- etration testing. On the other hand, challenges
works. By deploying multiple agents in differ- remain, particularly regarding the models’ ability
ent containers, BreachSeek can efficiently manage to fully grasp the broader context of testing sce-
large volumes of data and complex network archi- narios. This can lead to incomplete or inaccurate
tectures, making it adaptable to various cyberse- results, highlighting the need for further refine-
curity needs. This scalability is particularly bene- ment of these models to ensure they meet the spe-
ficial for organizations that operate in sectors with cific needs of different organizations [2]. Breach-
high security demands, such as finance, healthcare, Seek addresses some of these challenges by em-
and government, where the ability to rapidly and ploying multiple AI agents to manage context win-
accurately identify vulnerabilities is crucial. dows, ensuring a more comprehensive understand-
In summary, BreachSeek represents a signifi- ing throughout the penetration testing process.
cant advancement in the field of automated cy- Unlike other tools, BreachSeek doesn’t just gen-
bersecurity penetration testing. By combining the erate text-based outputs but also executes com-
power of AI-driven agents with the flexibility and mands within a terminal, directly interacting with
scalability required in modern network environ- the target environment.
ments, BreachSeek offers a comprehensive solution
LLMs are not only transforming penetration
that addresses the limitations of traditional pen-
testing but are also being integrated into vari-
etration testing methods. As cyber threats con-
ous aspects of cybersecurity. Their applications
tinue to evolve, tools like BreachSeek will become
extend to defensive measures, such as risk man-
increasingly vital in ensuring the security and re-
agement and automated vulnerability fixing. In
silience of digital infrastructure.
these areas, LLMs help automate complex tasks,
reducing the need for human intervention and al-
2 Literature Review lowing for faster, more efficient responses to secu-
rity threats. However, the effectiveness of LLMs
Recent advancements in large language mod- is often limited by their ability to maintain con-
els (LLMs) have significantly impacted the field text over extended interactions, a challenge that
of cybersecurity, particularly in the automation continues to be a focal point in ongoing research.
of penetration testing. Traditionally, penetration Future advancements are expected to improve the
testing has been a manual and labor-intensive adaptability of LLMs to specific organizational en-
process, requiring significant expertise and time. vironments, enabling them to continuously learn
However, the introduction of tools like Pentest- and remain effective against evolving cybersecu-
GPT marks a turning point in how these tasks rity threats [3]. Additionally, BreachSeek uniquely
can be automated. PentestGPT leverages the ex- contributes to this space by generating a compre-
tensive knowledge embedded in LLMs to perform hensive, formatted PDF report that captures the
tasks traditionally handled by human penetration entire journey of the penetration testing process,
testers. This tool has been evaluated using a providing valuable insights that are automatically
benchmark created from popular platforms like documented and ready for review.
HackTheBox and VulnHub, which includes 182 The integration of LLMs into cybersecurity,
sub-tasks aligned with OWASP’s top 10 vulner- particularly in automated penetration testing, rep-
abilities. The results indicate a remarkable im- resents a significant step forward in enhancing se-
provement in task completion rates, with Pentest- curity measures. However, these advancements
2
Figure 1: The general workflow of such models
come with their own set of challenges that re- 3.3 Specific Architecture for Pene-
searchers and practitioners must continue to ad- tration Testing
dress to fully realize the potential of these tech-
nologies. The continued refinement of tools like For this study, we implemented a specialized
PentestGPT, alongside broader applications of architecture (Figure 2) that adheres to the general
generative AI in cybersecurity, will likely shape the workflow while incorporating task-specific agents:
future of how organizations defend against increas-
1. Recorder: Maintains a summary of actions
ingly sophisticated cyber threats.
and generates a final report when prompted
3
vulnerabilities on this machine, providing a real- focus on integrating a user permission system
istic scenario for assessing its penetration testing that prompts for approval before executing specific
capabilities. tools or commands. This feature will allow users to
maintain oversight and intervene when necessary,
ensuring that critical actions are only performed
with explicit consent. This approach not only in-
creases the security of the testing process but also
provides a safeguard against unintended or poten-
tially harmful operations.
4
only responding when the conversation is directly References
related to the task at hand.
Whether a user prefers a witty companion, a [1] G. Deng, Y. Liu, V. Mayoral-Vilches, et al.,
laid-back conversationalist, or a task-focused pro- “Pentestgpt: An llm-empowered automatic
fessional, BreachSeek will adapt to meet these penetration testing tool,” arXiv (Cornell Uni-
preferences while still providing top-notch secu- versity), Jan. 1, 2023. doi: 10.48550/arxiv.
rity testing services. By expanding the scope of 2308 . 06782. [Online]. Available: https : / /
responses and introducing a range of interaction arxiv.org/abs/2308.06782.
styles, BreachSeek will maintain both its relevance [2] E. Hilario, S. Azam, J. Sundaram, K. I. Mo-
to security tasks and its appeal to a broader audi- hammed, and B. Shanmugam, “Generative ai
ence. for pentesting: The good, the bad, the ugly,”
International Journal of Information Secu-
5.5 Multi-Modality rity, vol. 23, no. 3, pp. 2075–2097, Mar. 15,
2024. doi: 10.1007/s10207- 024- 00835- x.
To further expand the capabilities of Breach- [Online]. Available: https://doi.org/10.
Seek, future iterations will introduce multi-modal 1007/s10207-024-00835-x.
input support, allowing users to submit images
and videos as part of the penetration testing pro- [3] F. N. Motlagh, M. Hajizadeh, M. Majd, P.
cess. This feature will enable the system to analyze Najafi, F. Cheng, and C. Meinel, “Large lan-
visual content, such as screenshots of network se- guage models in cybersecurity: State-of-the-
tups or video recordings of security camera feeds, art,” arXiv (Cornell University), Jan. 30,
providing a more comprehensive analysis and en- 2024. doi: 10 . 48550 / arxiv . 2402 . 00891.
abling more sophisticated testing scenarios. By in- [Online]. Available: https : / / arxiv . org /
corporating multiple data types, BreachSeek will abs/2402.00891.
be better equipped to handle a broader range of [4] “Owasp web security testing guide — owasp
penetration testing challenges. foundation.” (Dec. 3, 2020), [Online]. Avail-
able: https : / / owasp . org / www - project -
web-security-testing-guide/.
6 Conclusion [5] OffSec. “Pen-200: Penetration testing certifi-
BreachSeek, a multi-agent automated pene- cation with kali linux — offsec.” (Aug. 26,
tration testing platform, addresses critical gaps 2024), [Online]. Available: https : / / www .
in traditional cybersecurity practices by leverag- offensive-security.com/pwk-oscp/.
ing Large Language Models through LangGraph.
Its graph-based architecture, comprising special-
ized agents like the supervisor, pentester, and
recorder, enables efficient task distribution and
mitigates context window limitations. Deployed in
a Docker-based Kali Linux environment, Breach-
Seek demonstrated its effectiveness by success-
fully exploiting a Metasploitable 2 machine within
150,000 tokens. While initially evaluated quali-
tatively, future work will incorporate quantitative
measures using benchmarks like OWASP WSTG
and OSCP exam content. Planned enhancements
include a user permission system for human over-
sight, fine-tuning with specialized cybersecurity
data, integration of Retrieval-Augmented Genera-
tion (RAG), enhanced dynamic and responsive in-
teractions according to user preference, and multi-
modal input support. These advancements, cou-
pled with BreachSeek’s ability to generate compre-
hensive security reports, position it as a powerful,
adaptable tool in the evolving landscape of AI-
driven cybersecurity solutions, promising contin-
ued innovation in automated penetration testing
and defense against sophisticated cyber threats.
The code used for the model can be found here:
https://github.com/snow10100/pena/
5
A Appendix
Figure 3: The clean web UI when you start chatting with model
6
Figure 5: The web UI when the task is done