A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence
A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence
& Resilience in the Internet of Things (CSRIoT @ IEEE Services), Milan, Italy, July 2019
IEEE copyright notice ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,
creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
A crawler architecture for harvesting the clear, social, and dark web
for IoT-related cyber-threat intelligence
Abstract—The clear, social, and dark web have lately been security alerts, threat intelligence reports, as well as recom-
identified as rich sources of valuable cyber-security information mended security tool configurations, and is often referred to
arXiv:2109.06932v1 [cs.CR] 14 Sep 2021
that –given the appropriate tools and methods– may be identi- as cyber-threat intelligence (CTI). To this end, with the term
fied, crawled and subsequently leveraged to actionable cyber-
threat intelligence. In this work, we focus on the information CTI we typically refer to any information that may help an
gathering task, and present a novel crawling architecture for organization identify, assess, monitor, and respond to cyber-
transparently harvesting data from security websites in the threats. In the era of big data, it is important to note that the
clear web, security forums in the social web, and hacker fo- term intelligence does not typically refer to the data itself,
rums/marketplaces in the dark web. The proposed architecture but rather to information that has been collected, analysed,
adopts a two-phase approach to data harvesting. Initially a
machine learning-based crawler is used to direct the harvesting leveraged and converted to a series of actions that may be
towards websites of interest, while in the second phase state- followed upon, i.e., has become actionable.
of-the-art statistical language modelling techniques are used to While CTI may be collected by resorting to a variety of
represent the harvested information in a latent low-dimensional means (e.g., monitoring cyber-feeds) and from a variety of
feature space and rank it based on its potential relevance to sources, we are particularly interested in gathering CTI from
the task at hand. The proposed architecture is realised using
exclusively open-source tools, and a preliminary evaluation the clear, social, and dark web where threat actors collabo-
with crowdsourced results demonstrates its effectiveness. rate, communicate and plan cyber-attacks. Such an approach
allows us to provide visibility to a number of sources that
Keywords-IoT; cyber-security; cyber-threat intelligence;
crawling architecture; machine learning; language models; are of preference to threat-actors and identify timely CTI
including zero-day vulnerabilities and exploits. To do so,
we envision an integrated framework that encompasses key
I. I NTRODUCTION technologies for pre-reconnaissance CTI gathering, analysis
Over the years cyber-threats have increased in numbers and sharing through the use of state-of-the-art tools and
and sophistication; adversaries now use a vast set of tools technologies. In this context, newly discovered data from
and tactics to attack their victims with their motivations various sources will be inspected for their relevance to the
ranging from intelligence collection to destruction or fi- task (gathering), and discovered CTI in the form of vulner-
nancial gain. Lately, the utilisation of IoT devices on a abilities, exploits, threat actors, or cyber-crime tools will be
number of applications, ranging from home automation to identified (analysis) and stored in a vulnerability database
monitoring of critical infrastructures, has created an even using existing formats like CVE1 and CPE1 (sharing).
more complicated cyber-defense landscape. The sheer num- In this work we focus on the gathering part of the
ber of IoT devices deployed globally, most of which are envisioned framework, and present a novel architecture that
readily accessible and easily hacked, allows threat actors to is able to transparently provide a crawling infrastructure for
use them as the cyber-weapon delivery system of choice in a variety of CTI sources in the clear, social, and dark web.
many today’s cyber-attacks, ranging from botnet-building for Our approach employs a thematically focused crawler for
DDoS attacks, to malware spreading and spamming. directing the crawl towards websites of interest to the CTI
Trying to stay on top of these evolving cyber-threats gathering task. This is realised by resorting to a combination
has become an increasingly difficult task, and timeliness of machine learing techniques (for open domain crawl)
in the delivery of relevant cyber-threat related information and regex-based link filtering (for structured domains like
is essential for appropriate protection and mitigation. Such forums). The retrieved content is stored in an efficient
information is typically leveraged from collected data, and NoSQL datastore and is retrieved for further inspection in
includes zero-day vulnerabilities and exploits, indicators order to decide its usefulness to the task. This is achieved by
(system artefacts or observables associated with an attack), employing statistical language modelling techniques [1] to
represent all information in a latent low-dimensional feature
space and a ranking-based approach to the collected content
This work has received funding from the European Union’s (i.e., rank it according to its potential to be useful). These
Horizon 2020 research and innovation programme under grant techniques allow us to train our language model to (i) capture
agreement no. 786698. The work reflects only the authors’
view and the Agency is not responsible for any use that may be made of
the information it contains. 1 https://www.mitre.org
Crawler Content Ranking
Crawler Configuration Data
Data preprocessor
Clear Web
Crawler Model Trainer Retriever
IoTsec
Vocabulary
collection
and exploit the most salient words for the given task by harvested content. To this end, we designed a novel module
building upon user conversations, (ii) compute the semantic that employs a ranking-based approach to the assessment of
relatedness between the crawled content and the task at hand the usefulness of a website content using language models.
by leveraging the identified salient words, and (iii) classify To capture the background knowledge regarding vocabulary
the crawled content according to its relevance/usefulness and term correlations, we resorted to latent topic models [1],
based on its semantic similarity to CTI gathering. Notice while ranking is modelled as vector similarity to the task.
that the post-mortem inspection of the crawled content is The proposed architecture has been entirely designed on
necessary, since the thematically focused crawl is forced to and developed using open-source software; it employs an
make a crude decision on the link relevance (and if it should open-source focused crawler2 , an open source implementa-
be visited or not) since it resorts on a limited feature space tion of word embeddings3 for the latent topic modeling, and
(e.g., alt-text of the link, words in the url, or relevance of an open-source NoSQL database4 for the storage of the topic
the parent page). models and the crawled content. The front-end modules are
To the best of our knowledge, this is the first approach to based on HTML, CSS and JavaScript. Figure 1 displays the
CTI gathering that views the crawling task as a two-stage complete architecture with all the sub-components that will
process, where a crude classification is initially used to prune be detailed in the following sections.
the crawl frontier, while a more refined approach based on
the collected content is used to decide on its relevance to A. Clear/social/dark web crawling
the task. This novel approach to CTI gathering is integrated The crawling module is based upon the open-source
to an infrastructure that is entirely open-source and able to implementation of the ACHE Crawler2 and contains three
transparently monitor the clear, social, and dark web. different sub-components: (i) a focused crawler for the clear
The rest of the paper is organised as follows. In the web, (ii) an in-depth crawler for the social web (i.e., forums),
next section, we present the architecture and provide details and (iii) a TOR-based crawler for the dark web. Below, we
on several design and implementation choices, while in outline the characteristics of the different sub-components
Section III we present our evaluation plan and a preliminary and present the rationale behind our implementation choices.
effectiveness evaluation using both crowdsourced results Clear web crawler. This sub-component is designed to
and anecdotal examples. Finally, Section IV outlines related perform focused crawls on the clear web with the purpose
work, while Section V discusses future research directions. to discover new resources that may contain cyber-threat
II. A RCHITECTURE intelligence. To direct the crawl towards topically relevant
websites we utilise an SVM classifier, which is trained by
The proposed architecture consists of two major compo-
resorting on an equal number of positive and negative exam-
nents: the crawling module and the content ranking module.
ples of websites that are used as input to the Model Builder
The idea behind this two-stage approach is the openness
component of ACHE. Table I contains some example entries
of the topic at hand that cannot be accurately modelled by
that are specific to our task. Subsequently, the SeedFinder
a topical crawler, which mainly focuses on staying within
[2] component is utilised to aid the process of locating initial
the given topic. The difficulty emerges from websites that,
seeds for the focused crawl on the clear web; this is achieved
although relevant to the topic (e.g., discussing IoT security in
by combining the classification model built previously with
general), have no actual information that may be leveraged to
a user-provided query relevant to the topic - in our case we
actionable intelligence (e.g., do not mention any specific IoT
use the query “iot vulnerabilities”.
related vulnerability). To overcome this challenge, a focused
crawler that employs machine learning techniques to direct 2 https://github.com/ViDA-NYU/ache
the crawl is aided by advanced language models to decide 3 https://radimrehurek.com/gensim/
done by using the Gensim3 open source implementation of This IoT botnet was made possible by malware called Mirai. Once
word2vec (based on a neural network for word relatedness), infected with Mirai, computers continually search the internet for
vulnerable IoT devices and then use known default usernames and
setup with a latent space of 150 dimensions, a training passwords to log in, infecting them with malware. These devices
window of 5 words, a minimum occurrence of 1 term were things like digital cameras and DVR players.
instance, and 10 parallel threads. The result of the training Relevance Score 0.8563855440900794
is a 150-dimensions distributional vector for each term that
occurs at least once in the training corpus. Comparing the pros and cons of thresholding and top-k post
Topic vocabulary creator. To automatically extract the set selection in the context of assessing a continuous crawling
of salient words that will be used to represent the topic we task is an ongoing research effort.
utilised the extracted user tags, and augmented them with the
III. P RELIMINARY E VALUATION
set of N most related terms in the latent space for each user
tag; term relatedness was provided by the trained language In this section we present a preliminary evaluation of
models and the corresponding word vectors. Table III shows the architecture and the evaluation plan for a thorough
an example of the most relevant terms to the DDoS user tag, investigation of each system component.
for N =5,10, and 15. The resulting (expanded) vocabulary For the purposes of the evaluation, we ran a focused crawl
is stored in a separate NoSQL document store (mongoDB4 ). with a model consisting of 7 positive and 7 negative URLs,
a sample of which may be seen on Table I, and 21 seeds
Content ranking. To assess the relevance and usefulness
extracted by the SeedFinder component for the query “iot
of the crawled content we employ the content ranking sub-
vulnerabilities”. The crawl harvested around 22K websites
component; this component utilises the expanded vocabulary
per hour/per thread on a commodity machine; around 10%
created in the previous phase to decide how similar a crawled
of the frontier links were considered relevant according to
post is to the topic by computing the similarity between the
the crawler model and were thus harvested.
topic and post vectors. This is done as follows.
In order to determine the quality of information within the
The topic vector T~ is constructed as the sum of the
harvested URLs that were considered relevant in the focused
distributional vectors of all the topic terms t~i that exist in
crawl, we developed a web-based evaluation tool (a snapshot
the topic vocabulary, i.e.,
X of which is shown in Figure 2) for collecting crowdsourced
T~ = t~i data from human experts. Using this tool, human judges are
∀i provided with a random harvested website and are asked to
Similarly, the post vector P~ is constructed as the sum of assess its relevance on a 4-point scale; these assessments
the distributional vectors of all the post terms w~j that are are expected to showcase the necessity of the language
present in the topic vocabulary. To promote the impact of models and drive our classification task (i.e., help us identify
words related to the topic at hand, we introduce a topic- appropriate thresholding/top-k values). Early results from a
dependent weighting scheme for post vectors in the spirit of limited number of judgements on the clear web crawl show
[3]. Namely for a topic T and a post containing the set of that about 1% of the harvested websites contain actionable
words {w1 , w2 , . . .}, the post vector is computed as cyber-threat intelligence, while (as expected) the percentage
X is higher for the social and dark web.
P~ = cos (w~j , T~ ) · w~j Besides the evaluation tool, we have also implemented a
∀j
visualisation component for the computation of the relevance
Finally, after both vectors have been computed, the rele- score for posts. This component highlights vocabulary words
vance score r between the topic T and a post P is computed on the actual crawled posts, and displays the computed
as the cosine similarity of their respective distributional relevance score according to our model. A sample output
vectors in the latent space of the component for a random post is shown in Table IV.
r = cos (T~ , P~ ) IV. R ELATED W ORK
Having computed a relevance score for every crawled post Web crawlers, typically also known as robots or spiders,
in the NoSQL datastore, the classification task of identifying are tightly connected to information gathering from online
relevant/useful posts is trivially reduced to either a threshold- sources. In this section we review the state-of-the-art in
ing or a top-k selection operation. Notice that the most/least crawling by (i) outlining typical architectural alternatives
relevant websites may also be used to reinforce the crawler that fit the crawling task and (ii) categorising the different
model (i.e., as input to the Model Builder sub-component). crawler types based on the type of the targeted content.
Figure 2. Web-based evaluation tool for collecting crowdsourced data from human experts