0% found this document useful (0 votes)

71 views6 pages

A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence

Uploaded by

Manoj Kumar Maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views6 pages

A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence

Uploaded by

Manoj Kumar Maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

This paper is a preprint; it has been published in 2019 IEEE World Congress on Services (SERVICES), Workshop on Cyber Security

& Resilience in the Internet of Things (CSRIoT @ IEEE Services), Milan, Italy, July 2019
IEEE copyright notice ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,
creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

A crawler architecture for harvesting the clear, social, and dark web
for IoT-related cyber-threat intelligence

Paris Koloveas Thanasis Chantzios Christos Tryfonopoulos Spiros Skiadopoulos

University of the Peloponnese, GR22131, Tripolis, Greece
{pkoloveas, tchantzios, trifon, spiros}@uop.gr

Abstract—The clear, social, and dark web have lately been security alerts, threat intelligence reports, as well as recom-
identified as rich sources of valuable cyber-security information mended security tool configurations, and is often referred to
arXiv:2109.06932v1 [cs.CR] 14 Sep 2021

that –given the appropriate tools and methods– may be identi- as cyber-threat intelligence (CTI). To this end, with the term
fied, crawled and subsequently leveraged to actionable cyber-
threat intelligence. In this work, we focus on the information CTI we typically refer to any information that may help an
gathering task, and present a novel crawling architecture for organization identify, assess, monitor, and respond to cyber-
transparently harvesting data from security websites in the threats. In the era of big data, it is important to note that the
clear web, security forums in the social web, and hacker fo- term intelligence does not typically refer to the data itself,
rums/marketplaces in the dark web. The proposed architecture but rather to information that has been collected, analysed,
adopts a two-phase approach to data harvesting. Initially a
machine learning-based crawler is used to direct the harvesting leveraged and converted to a series of actions that may be
towards websites of interest, while in the second phase state- followed upon, i.e., has become actionable.
of-the-art statistical language modelling techniques are used to While CTI may be collected by resorting to a variety of
represent the harvested information in a latent low-dimensional means (e.g., monitoring cyber-feeds) and from a variety of
feature space and rank it based on its potential relevance to sources, we are particularly interested in gathering CTI from
the task at hand. The proposed architecture is realised using
exclusively open-source tools, and a preliminary evaluation the clear, social, and dark web where threat actors collabo-
with crowdsourced results demonstrates its effectiveness. rate, communicate and plan cyber-attacks. Such an approach
allows us to provide visibility to a number of sources that
Keywords-IoT; cyber-security; cyber-threat intelligence;
crawling architecture; machine learning; language models; are of preference to threat-actors and identify timely CTI
including zero-day vulnerabilities and exploits. To do so,
we envision an integrated framework that encompasses key
I. I NTRODUCTION technologies for pre-reconnaissance CTI gathering, analysis
Over the years cyber-threats have increased in numbers and sharing through the use of state-of-the-art tools and
and sophistication; adversaries now use a vast set of tools technologies. In this context, newly discovered data from
and tactics to attack their victims with their motivations various sources will be inspected for their relevance to the
ranging from intelligence collection to destruction or fi- task (gathering), and discovered CTI in the form of vulner-
nancial gain. Lately, the utilisation of IoT devices on a abilities, exploits, threat actors, or cyber-crime tools will be
number of applications, ranging from home automation to identified (analysis) and stored in a vulnerability database
monitoring of critical infrastructures, has created an even using existing formats like CVE1 and CPE1 (sharing).
more complicated cyber-defense landscape. The sheer num- In this work we focus on the gathering part of the
ber of IoT devices deployed globally, most of which are envisioned framework, and present a novel architecture that
readily accessible and easily hacked, allows threat actors to is able to transparently provide a crawling infrastructure for
use them as the cyber-weapon delivery system of choice in a variety of CTI sources in the clear, social, and dark web.
many today’s cyber-attacks, ranging from botnet-building for Our approach employs a thematically focused crawler for
DDoS attacks, to malware spreading and spamming. directing the crawl towards websites of interest to the CTI
Trying to stay on top of these evolving cyber-threats gathering task. This is realised by resorting to a combination
has become an increasingly difficult task, and timeliness of machine learing techniques (for open domain crawl)
in the delivery of relevant cyber-threat related information and regex-based link filtering (for structured domains like
is essential for appropriate protection and mitigation. Such forums). The retrieved content is stored in an efficient
information is typically leveraged from collected data, and NoSQL datastore and is retrieved for further inspection in
includes zero-day vulnerabilities and exploits, indicators order to decide its usefulness to the task. This is achieved by
(system artefacts or observables associated with an attack), employing statistical language modelling techniques [1] to
represent all information in a latent low-dimensional feature
space and a ranking-based approach to the collected content
This work has received funding from the European Union’s (i.e., rank it according to its potential to be useful). These
Horizon 2020 research and innovation programme under grant techniques allow us to train our language model to (i) capture
agreement no. 786698. The work reflects only the authors’
view and the Agency is not responsible for any use that may be made of
the information it contains. 1 https://www.mitre.org
Crawler Content Ranking
Crawler Conﬁguration Data

Stack Exchange Data

Model Builder XML Data

Data preprocessor
Clear Web
Crawler Model Trainer Retriever

SeedFinder Social Crawler Content Parser CTI Crawl Normalizer

collection Topic Vocabulary
Content Ranking
Creator
Dark Web
Crawler MWE Tokenizer

IoTsec
Vocabulary
collection

Figure 1. A high-level view of the proposed architecture

and exploit the most salient words for the given task by harvested content. To this end, we designed a novel module
building upon user conversations, (ii) compute the semantic that employs a ranking-based approach to the assessment of
relatedness between the crawled content and the task at hand the usefulness of a website content using language models.
by leveraging the identified salient words, and (iii) classify To capture the background knowledge regarding vocabulary
the crawled content according to its relevance/usefulness and term correlations, we resorted to latent topic models [1],
based on its semantic similarity to CTI gathering. Notice while ranking is modelled as vector similarity to the task.
that the post-mortem inspection of the crawled content is The proposed architecture has been entirely designed on
necessary, since the thematically focused crawl is forced to and developed using open-source software; it employs an
make a crude decision on the link relevance (and if it should open-source focused crawler2 , an open source implementa-
be visited or not) since it resorts on a limited feature space tion of word embeddings3 for the latent topic modeling, and
(e.g., alt-text of the link, words in the url, or relevance of an open-source NoSQL database4 for the storage of the topic
the parent page). models and the crawled content. The front-end modules are
To the best of our knowledge, this is the first approach to based on HTML, CSS and JavaScript. Figure 1 displays the
CTI gathering that views the crawling task as a two-stage complete architecture with all the sub-components that will
process, where a crude classification is initially used to prune be detailed in the following sections.
the crawl frontier, while a more refined approach based on
the collected content is used to decide on its relevance to A. Clear/social/dark web crawling
the task. This novel approach to CTI gathering is integrated The crawling module is based upon the open-source
to an infrastructure that is entirely open-source and able to implementation of the ACHE Crawler2 and contains three
transparently monitor the clear, social, and dark web. different sub-components: (i) a focused crawler for the clear
The rest of the paper is organised as follows. In the web, (ii) an in-depth crawler for the social web (i.e., forums),
next section, we present the architecture and provide details and (iii) a TOR-based crawler for the dark web. Below, we
on several design and implementation choices, while in outline the characteristics of the different sub-components
Section III we present our evaluation plan and a preliminary and present the rationale behind our implementation choices.
effectiveness evaluation using both crowdsourced results Clear web crawler. This sub-component is designed to
and anecdotal examples. Finally, Section IV outlines related perform focused crawls on the clear web with the purpose
work, while Section V discusses future research directions. to discover new resources that may contain cyber-threat
II. A RCHITECTURE intelligence. To direct the crawl towards topically relevant
websites we utilise an SVM classifier, which is trained by
The proposed architecture consists of two major compo-
resorting on an equal number of positive and negative exam-
nents: the crawling module and the content ranking module.
ples of websites that are used as input to the Model Builder
The idea behind this two-stage approach is the openness
component of ACHE. Table I contains some example entries
of the topic at hand that cannot be accurately modelled by
that are specific to our task. Subsequently, the SeedFinder
a topical crawler, which mainly focuses on staying within
[2] component is utilised to aid the process of locating initial
the given topic. The difficulty emerges from websites that,
seeds for the focused crawl on the clear web; this is achieved
although relevant to the topic (e.g., discussing IoT security in
by combining the classification model built previously with
general), have no actual information that may be leveraged to
a user-provided query relevant to the topic - in our case we
actionable intelligence (e.g., do not mention any specific IoT
use the query “iot vulnerabilities”.
related vulnerability). To overcome this challenge, a focused
crawler that employs machine learning techniques to direct 2 https://github.com/ViDA-NYU/ache
the crawl is aided by advanced language models to decide 3 https://radimrehurek.com/gensim/

on the usefulness of the crawled websites by utilising the 4 https://www.mongodb.com/

Table I Table II
P OSITIVE & N EGATIVE W EBPAGE E XAMPLES R EGEX - BASED LINK FILTERS
Positive Negative Category Pattern Operation
IoT, Cloud, or Mobile: All Ripe for Scammers pose as CNN’s Wolf Whitelist https://www.wilderssecurity. Crawl content only from
Exploit and Need Security’s Atten- Blitzer, target security profession- com/threads/* the Threads section of
tion | CSO Online als | CSO Online the domain
New IoT Threat Exploits Lack of 19 top UEBA vendors to protect Whitelist https://blogs.oracle.com/ Crawl all articles about
Encryption in Wireless Keyboards against insider threats and external security/* Security
| eSecurity Planet attacks | eSecurity planet Blacklist https://www.wilderssecurity. Crawl the entire domain
Security Testing the Internet of Build your own cloud | SoftLayer com/members/* apart from the Members
Things: Dynamic testing (Fuzzing) Area
for IoT security | Beyond Security Blacklist https://www.securityforum. Crawl the entire domain
org/events/* apart from Events
Social web crawler. This sub-component is used to perform
in-depth crawls on specific selected forums on the social a topic on IoT security could be captured by related words
web, where the topic of discussion matches the given task. and phrases like “Mirai botnet”, “IoT”, or “exploit kits”.
To this end, the social web crawler can be provided with Such salient phrases related to the topic may be obtained
links to discussion threads on IoT vulnerabilities and use by un-/semi-supervised training of latent topic models over
them to traverse the forum structure and download all rele- external datasets such as IoT and security related forums. In
vant discussions on the topic. Notice that contrary to clear this way, we are able to capture semantic dependencies and
web crawling, in the forum crawl all links are considered statistical correlations among words for a given topic and
relevant by default since they correspond to discussions represent them in a low-dimension latent space by state-of-
over the given topic, so there is no need for utilising a the-art latent topic models [1].
page classification model. However, to filter out parts of Since useful cyber-threat intelligence manifests itself in
the forum that are irrelevant or non-informative (e.g., user the form of cyber-security articles, user posts in secu-
info pages), we employ regex-based link filters that may be rity/hacker forums, or advertisement posts in cybercrime
applied within specific forums or in a cross-forum fashion. marketplaces, it can also be characterised as distributional
Table II contains some example link filters. Notice that this vectors of salient words. Then, the similarity between the
type of crawl is primarily used for thread/forum monitoring distributional vectors of harvested content and the given
and is not targeted in identifying new websites. topic (i.e., IoT vulnerabilities) may be used to assess the
content relevance to the topic.
Dark web crawler. This sub-component is used to perform To better capture the salient vocabulary that is utilised
in-depth crawls on specific websites on the dark web by by users for the IoT security domain we resorted to a
utilising TOR proxies. To do so, the crawler is provided with number of different discussion forums within the Stack
a number of onion links that correspond to hacker forums or Exchange ecosystem to create a training dataset. To this end,
marketplaces selling cybercrime tools and zero-day vulner- we utilised the Stack Exchange Data Dump5 to get access
abilities/exploits and monitors the discussions for content of to IoT and information security related discussion forums
interest. To overcome user authentication mechanisms that including Internet of Things, Information Security, Arduino,
are often in place in dark web forums/marketplaces, the dark Raspberry Pi, and others. The utilised data dumps contain
web crawler requires an initial manual login. After a success- user discussions in Q&A form, including the text from posts,
ful user authentication, the session cookies are stored and comments and related tags, and were used as input to the
are utilised (via HTTP requests) in subsequent visits of the data preprocessor sub-component described below.
crawler to simulate user login. After each crawl completes
a set interval, the crawled HTML pages are parsed by the Data preprocessor. This sub-component is responsible for
content parser sub-component, which extracts the textual the data normalisation process that involves a number of
content along with useful metadata (e.g., bitcoin value of steps described in the following. The input data for the sub-
sold cybercrime tools or user fame/activity/reputation level). component is typically XML-formatted and at first an XML
All content from the different (clear/social/dark) web DOM parser is used to parse the data and keep only the
crawling components is downloaded in its raw HTML for- useful part that contains user posts, comments, and tags. The
mat and stored in a NoSQL document store (mongoDB4 ) parsed data are, subsequently, fed into the Normalizer that
for further processing as discussed in the next section. performs typical normalisation (e.g., case folding, symbol
removal) and anonymisation (e.g., username elimination)
B. Content ranking and classification actions. Finally, the third step in the preprocessing phase
Deciding whether a crawled website contains useful (Multi-Word Expression tokenisation) includes the identifi-
cyber-threat intelligence is a challenging task given the typ- cation and characterisation of important multi-word terms
ically generic nature of many websites that discuss general (such as “exploit kits” or “Mirai botnet”) in order to extend
security issues. To tackle this problem, we designed and the functionality of the skip-gram model [1] for such terms.
implemented a novel content ranking module that assesses Model trainer. The preprocessed document corpus is sub-
the relevance and usefulness of the crawled content. To do sequently utilised to train the language model [1]; this is
so, we represent the topic as a vocabulary distribution by
utilising distributional vectors of related words; for example 5 https://archive.org/details/stackexchange
Table III Table IV
M OST RELEVANT TERMS FOR TAG DD O S R ELEVANCE S CORE C OMPUTATION
rank term rank term rank term Excerpt from: www.iotforall.com/5-worst-iot-hacking-vulnerabilities
#1 dos #6 flood #11 slowloris
#2 volumetric #7 aldos #12 botnet The Mirai Botnet (aka Dyn Attack) Back in October of 2016, the
#3 flooding #8 floods #13 drdos largest DDoS attack ever was launched on service provider Dyn
#4 cloudflare #9 ip spoofing #14 blackholing using an IoT botnet. This lead to huge portions of the internet going
#5 prolexic #10 radware #15 amplification down, including Twitter, the Guardian, Netflix, Reddit, and CNN.

done by using the Gensim3 open source implementation of This IoT botnet was made possible by malware called Mirai. Once
word2vec (based on a neural network for word relatedness), infected with Mirai, computers continually search the internet for
vulnerable IoT devices and then use known default usernames and
setup with a latent space of 150 dimensions, a training passwords to log in, infecting them with malware. These devices
window of 5 words, a minimum occurrence of 1 term were things like digital cameras and DVR players.
instance, and 10 parallel threads. The result of the training Relevance Score 0.8563855440900794
is a 150-dimensions distributional vector for each term that
occurs at least once in the training corpus. Comparing the pros and cons of thresholding and top-k post
Topic vocabulary creator. To automatically extract the set selection in the context of assessing a continuous crawling
of salient words that will be used to represent the topic we task is an ongoing research effort.
utilised the extracted user tags, and augmented them with the
III. P RELIMINARY E VALUATION
set of N most related terms in the latent space for each user
tag; term relatedness was provided by the trained language In this section we present a preliminary evaluation of
models and the corresponding word vectors. Table III shows the architecture and the evaluation plan for a thorough
an example of the most relevant terms to the DDoS user tag, investigation of each system component.
for N =5,10, and 15. The resulting (expanded) vocabulary For the purposes of the evaluation, we ran a focused crawl
is stored in a separate NoSQL document store (mongoDB4 ). with a model consisting of 7 positive and 7 negative URLs,
a sample of which may be seen on Table I, and 21 seeds
Content ranking. To assess the relevance and usefulness
extracted by the SeedFinder component for the query “iot
of the crawled content we employ the content ranking sub-
vulnerabilities”. The crawl harvested around 22K websites
component; this component utilises the expanded vocabulary
per hour/per thread on a commodity machine; around 10%
created in the previous phase to decide how similar a crawled
of the frontier links were considered relevant according to
post is to the topic by computing the similarity between the
the crawler model and were thus harvested.
topic and post vectors. This is done as follows.
In order to determine the quality of information within the
The topic vector T~ is constructed as the sum of the
harvested URLs that were considered relevant in the focused
distributional vectors of all the topic terms t~i that exist in
crawl, we developed a web-based evaluation tool (a snapshot
the topic vocabulary, i.e.,
X of which is shown in Figure 2) for collecting crowdsourced
T~ = t~i data from human experts. Using this tool, human judges are
∀i provided with a random harvested website and are asked to
Similarly, the post vector P~ is constructed as the sum of assess its relevance on a 4-point scale; these assessments
the distributional vectors of all the post terms w~j that are are expected to showcase the necessity of the language
present in the topic vocabulary. To promote the impact of models and drive our classification task (i.e., help us identify
words related to the topic at hand, we introduce a topic- appropriate thresholding/top-k values). Early results from a
dependent weighting scheme for post vectors in the spirit of limited number of judgements on the clear web crawl show
[3]. Namely for a topic T and a post containing the set of that about 1% of the harvested websites contain actionable
words {w1 , w2 , . . .}, the post vector is computed as cyber-threat intelligence, while (as expected) the percentage
X is higher for the social and dark web.
P~ = cos (w~j , T~ ) · w~j Besides the evaluation tool, we have also implemented a
∀j
visualisation component for the computation of the relevance
Finally, after both vectors have been computed, the rele- score for posts. This component highlights vocabulary words
vance score r between the topic T and a post P is computed on the actual crawled posts, and displays the computed
as the cosine similarity of their respective distributional relevance score according to our model. A sample output
vectors in the latent space of the component for a random post is shown in Table IV.
r = cos (T~ , P~ ) IV. R ELATED W ORK
Having computed a relevance score for every crawled post Web crawlers, typically also known as robots or spiders,
in the NoSQL datastore, the classification task of identifying are tightly connected to information gathering from online
relevant/useful posts is trivially reduced to either a threshold- sources. In this section we review the state-of-the-art in
ing or a top-k selection operation. Notice that the most/least crawling by (i) outlining typical architectural alternatives
relevant websites may also be used to reinforce the crawler that fit the crawling task and (ii) categorising the different
model (i.e., as input to the Model Builder sub-component). crawler types based on the type of the targeted content.
Figure 2. Web-based evaluation tool for collecting crowdsourced data from human experts

A. Architectural typology the Internet, as opposed to their distributed counterparts

Depending on the crawling application, the available that are designed for clusters and server farms. To this
hardware, the desired scalability properties and the ability end, peer-to-peer crawlers are lightweight processes that
to scale up/out the existing infrastructure, related literature emphasise crawl personalisation and demonstrate large-scale
provides a number of architectural alternatives [4]–[6]. collaboration usually by means of an underlying distributed
routing infrastructure [11].
Centralized. Typically, special-purpose or small-scale
crawlers follow a centralised architecture [4]; the page Cloud-based. Lately, the requirement for more effective
downloading, the URL manipulation, and the page stor- use of resources by means of elasticity gave rise to a new
age modules, resort in a single machine. This centralised crawler paradigm: the cloud-based crawlers [13], [14], which
architecture is, naturally, easier to implement, simpler to revived known machinery to a renewed scope, versatility
deploy, and straightforward to administer, but is limited to and options. Such architectures use cloud computing features
the capabilities of the hardware and thus cannot scale well. alongside big data solutions like Map/Reduce and NoSQL
For this reason, the more sophisticated crawler designs put databases, to allow for resource adaptable web crawling and
effort in scaling out, i.e., exploiting the inherently distributed serve the modern the Data-as-a-Service (DaaS) concept.
nature of the web and adopt some form of decentralisation. B. Usage typology
Hybrid. Hybrid crawler architectures (e.g., [7]) are the norm Although the first web crawlers that set the pathway for
in the architectural typology as they aim for a conceptually the spidering technology were developed for the clear (or
simple design that involves distributing some of the pro- surface) web, in the course of time specialised solutions
cesses, while keeping others centralised. In such architec- aiming at the different facets (social, deep, dark) of the web
tures, the page downloading module is typically distributed, were gradually introduced. Below we organise crawlers in
while URL management data structures and modules are terms of their intended usage.
maintained at a single machine for consistency. Such designs Clear/surface web. Since the introduction of the first
aim at harnessing the control of a centralised architecture crawler in 1993, the majority of the research work on
and the scalability of a distributed system; however the crawlers has focused on the crawling of the surface web,
centralised component usually acts as a bottleneck for the initially on behalf of search engines, and gradually also for
crawling procedure and represents a single point of failure. other tasks. There is an abundance of work on clear web
Parallel/Distributed. A parallel crawler [8], [9] consists of crawling; some insightful surveys include [4], [15].
multiple crawling processes (usually referred to as C-procs Web 2.0. The advent of the user-generated content philos-
in crawler jargon), where each such process performs all the ophy and the participatory culture of Web 2.0 sites like
basic tasks of a crawler. To benefit from the parallelisation blogs, forums and social media, formed a new generation
of the crawling task the frontier is typically split among the of specialised crawlers that focused on forum [16]–[19],
different processes, while to minimise overlap in the crawled blog/microblog [20]–[22], and social media [23], [24] spi-
space, links and other metadata are communicated between dering. The need for specialised crawlers for these websites
the processes. When all processes run on the same LAN emerged from (i) the content quality, (ii) the inherent struc-
then we refer to an intra-site parallel crawler, while when C- ture that is present in forums/blogs, and (iii) the implemen-
proc’s run at geographically distributed locations connected tation particularities (e.g., Javascript-generated URLs) that
by a WAN (or the Internet) we refer to a distributed crawler. make other crawler types inapplicable or inefficient.
Peer-to-peer. The advent of peer-to-peer computing almost Deep/Dark/Hidden web. The amount of information and
two decades ago introduced peer-to-peer search engines the inherent interest for data that reside out of reach of
like Minerva [10]; this in turn gave rise to the concept major search engines, hidden either behind special access
of peer-to-peer crawlers [11], [12]. Peer-to-peer crawlers websites (deep web) or anonymisation networks like Tor6
constitute a special form of distributed crawlers that are
typically targeted to be run on machines at the edge of 6 https://www.torproject.org/
and I2P7 (dark web), gave rise to specialised crawlers [25]. [14] Y. Li, L. Zhao, X. Liu, and P. Zhang, “A security framework
To this end, over the last ten years, a number of works for cloud-based web crawling system,” in WISA, 2014.
related to deep web crawling have been published [26]–[28], [15] K. S. McCurley, “Incremental crawling,” in Encyclopedia of
investigating also (i) different architectural [29], [30] and Database Systems, 2009.
automation [31] options, (ii) quality issues such as sampling [16] J. Jiang, N. Yu, and C. Lin, “Focus: learning to crawl web
forums,” in WWW, 2012.
[32], query-based exploration [33], duplicate elimination
[17] J. Yang, R. Cai, C. Wang, H. Huang, L. Zhang, and W. Ma,
[34], and (iii) new application domains [35], [36]. “Incorporating site-level knowledge for incremental crawling
Cloud. Finally, the elasticity of resources and the popularity of web forums: a list-wise strategy,” in ACM, SIGKDD, 2009.
of cloud-based services inspired a relatively new line of [18] Y. Wang, J. Yang, W. Lai, R. Cai, L. Zhang, and W. Ma,
research focusing on crawler-based service configuration “Exploring traversal strategy for web forum crawling,” in
ACM, SIGIR, 2008.
[37] and discovery in cloud environments [38].
[19] R. Cai, J. Yang, W. Lai, Y. Wang, and L. Zhang, “irobot: an
V. O UTLOOK intelligent crawler for web forums,” in WWW, 2008.
[20] M. Hurst and A. Maykov, “Social streams blog crawler,” in
We are currently working on a thorough evaluation of our ICDE, 2009.
architecture. Our future research plans involve the extraction [21] S. Agarwal and A. Sureka, “A topical crawler for uncovering
of cyber-threat intelligence from the relevant harvested con- hidden communities of extremist micro-bloggers on tumblr,”
tent, by utilising natural language understanding for named in WWW, 2015.
entity recognition/disambiguation. [22] R. Ferreira, R. Lima, J. Melo, E. Costa, F. L. G. de Freitas,
and H. P. L. Luna, “Retriblog: a framework for creating blog
R EFERENCES crawlers,” in ACM, SAC, 2012.
[23] F. Buccafurri, G. Lax, A. Nocera, and D. Ursino, “Crawling
[1] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, social internetworking systems,” in ASONAM, 2012.
“Distributed representations of words and phrases and their [24] A. Khan and D. K. Sharma, “Self-adaptive ontology based
compositionality,” NIPS, 2013. focused crawler for social bookmarking sites,” IJIRR, 2017.
[2] K. Vieira, L. Barbosa, A. S. Silva, J. Freire, and E. Moura, [25] G. Valkanas, A. Ntoulas, and D. Gunopulos, “Rank-aware
“Finding seeds to bootstrap focused crawlers,” WWW, 2016. crawling of hidden web sites,” in WebDB, 2011.
[3] J. Biega, K. Gummadi, I. Mele, D. Milchevski, C. Try- [26] Y. Wang, J. Lu, J. Chen, and Y. Li, “Crawling ranked deep
fonopoulos, and G. Weikum, “R-Susceptibility: An IR- web data sources,” WWW, vol. 20, no. 1, 2017.
Centric Approach to Assessing Privacy Risks for Users in
Online Communities,” in ACM SIGIR, 2016. [27] F. Zhao, J. Zhou, C. Nie, H. Huang, and H. Jin,
“Smartcrawler: A two-stage crawler for efficiently harvesting
[4] M. Najork, “Web crawler architecture,” in Encyclopedia of deep-web interfaces,” IEEE TSC, 2016.
Database Systems, 2009.
[28] Q. Zheng, Z. Wu, X. Cheng, L. Jiang, and J. Liu, “Learning
[5] J. M. Hsieh, S. D. Gribble, and H. M. Levy, “The architec- to crawl deep web,” Inf. Syst., vol. 38, no. 6, 2013.
ture and implementation of an extensible web crawler,” in
USENIX, NSDI, 2010. [29] J. Zhao and P. Wang, “Nautilus: A generic framework for
crawling deep web,” in ICDKE, 2012.
[6] A. Harth, J. Umbrich, and S. Decker, “Multicrawler: A
pipelined architecture for crawling and indexing semantic web [30] Y. Li, Y. Wang, and E. Tian, “A new architecture of an
data,” in ISWC, 2006. intelligent agent-based crawler for domain-specific deep web
databases,” in IEEE/WIC/ACM, WI, 2012.
[7] V. Shkapenyuk and T. Suel, “Design and implementation of
a high-performance distributed web crawler,” in ICDE, 2002. [31] T. Furche, G. Gottlob, G. Grasso, C. Schallhart, and A. J.
Sellers, “Oxpath: A language for scalable data extraction,
[8] F. Ahmadi-Abkenari and A. Selamat, “An architecture for a automation, and crawling on the deep web,” VLDB J., 2013.
focused trend parallel web crawler with the application of
clickstream analysis,” Inf. Sci., vol. 184, 2012. [32] J. Lu, Y. Wang, J. Liang, J. Chen, and J. Liu, “An approach
to deep web crawling by sampling,” in IEEE/WIC/ACM, WI,
[9] D. L. Quoc, C. Fetzer, P. Felber, E. Rivière, V. Schiavoni, and 2008.
P. Sutra, “Unicrawl: A practical geographically distributed
web crawler,” in IEEE, CLOUD, 2015. [33] J. Liu, Z. Wu, L. Jiang, Q. Zheng, and X. Liu, “Crawling
deep web content through query forms,” in WEBIST, 2009.
[10] C. Zimmer, C. Tryfonopoulos, and G. Weikum, “Minervadl:
An architecture for information retrieval and filtering in [34] L. Jiang, Z. Wu, Q. Feng, J. Liu, and Q. Zheng, “Efficient
distributed digital libraries,” in ECDL, 2007. deep web crawling using reinforcement learning,” in PAKDD,
2010.
[11] O. Vikas, N. J. Chiluka, P. K. Ray, G. Meena, A. K. Meshram,
A. Gupta, and A. Sisodia, “Webminer–anatomy of super peer [35] Y. Li, Y. Wang, and J. Du, “E-FFC: an enhanced form-focused
based incremental topic-specific web crawler,” in ICN, 2007. crawler for domain-specific deep web databases,” JIIS, 2013.
[12] B. Bamba, L. Liu, J. Caverlee, V. Padliya, M. Srivatsa, [36] Y. He, D. Xin, V. Ganti, S. Rajaraman, and N. Shah, “Crawl-
T. Bansal, M. Palekar, J. Patrao, S. Li, and A. Singh, ing deep web entity pages,” in ACM, WSDM, 2013.
“Dsphere: A source-centric approach to crawling, indexing [37] M. Menzel, M. Klems, H. A. Lê, and S. Tai, “A configuration
and searching the world wide web,” in ICDE, 2007. crawler for virtual appliances in compute clouds,” in IEEE,
[13] K. Gupta, V. Mittal, B. Bishnoi, S. Maheshwari, and D. Pa- IC2E, 2013.
tel, “AcT: Accuracy-aware crawling techniques for cloud- [38] T. H. Noor, Q. Z. Sheng, A. Alfazi, A. H. H. Ngu, and J. Law,
crawler,” WWW, 2016. “CSCE: A crawler engine for cloud services discovery on the
world wide web,” in IEEE, ICWS, 2013.
7 https://geti2p.net/

On The Generation of Cyber Threat Intell
No ratings yet
On The Generation of Cyber Threat Intell
268 pages
Sensors 23 05941 v2
No ratings yet
Sensors 23 05941 v2
26 pages
Perspectives On The History of Mathematical Logic PDF
100% (1)
Perspectives On The History of Mathematical Logic PDF
218 pages
Jimmy Kenya Power Report
No ratings yet
Jimmy Kenya Power Report
38 pages
Acig Erwin Adi Final2
No ratings yet
Acig Erwin Adi Final2
23 pages
Explainable - AI - For - IDS - bkp2
No ratings yet
Explainable - AI - For - IDS - bkp2
36 pages
Internet of Things (IoT) Security Intelligence - A Comprehensive Overview, Machine Learning Solutions and Research Directions
No ratings yet
Internet of Things (IoT) Security Intelligence - A Comprehensive Overview, Machine Learning Solutions and Research Directions
17 pages
Threats From The Dark A Review Over Dark Web InvestigationResearch For Cyber Threat Intelligence
No ratings yet
Threats From The Dark A Review Over Dark Web InvestigationResearch For Cyber Threat Intelligence
21 pages
(IJCST-V13I2P12) :sanjeev Kumar, Prof. Shivank Soni
No ratings yet
(IJCST-V13I2P12) :sanjeev Kumar, Prof. Shivank Soni
12 pages
Rev 2
No ratings yet
Rev 2
32 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
50 pages
DDOS Attack Final
No ratings yet
DDOS Attack Final
41 pages
Shandon Cytospin 3 Operator Guide
No ratings yet
Shandon Cytospin 3 Operator Guide
68 pages
Ensemble Technique of Intrusion Detection For Iot Edge Platform
No ratings yet
Ensemble Technique of Intrusion Detection For Iot Edge Platform
16 pages
Electronics 13 04611
No ratings yet
Electronics 13 04611
22 pages
Applsci 12 01863
No ratings yet
Applsci 12 01863
16 pages
Machine Learning For Internet of Things Data Analysis - A Survey
No ratings yet
Machine Learning For Internet of Things Data Analysis - A Survey
17 pages
Uzoezie Fidelis C. Cyb Project
No ratings yet
Uzoezie Fidelis C. Cyb Project
83 pages
SOA - Design of Intrusion Detection System Based On Cyborg Intelligence For Security of Cloud Network Traffic of Smart Cities
No ratings yet
SOA - Design of Intrusion Detection System Based On Cyborg Intelligence For Security of Cloud Network Traffic of Smart Cities
33 pages
Enhanced Threat Intelligence Framework For Advanced Cybersecurity
No ratings yet
Enhanced Threat Intelligence Framework For Advanced Cybersecurity
26 pages
30 - Ciciot2023dataset
No ratings yet
30 - Ciciot2023dataset
22 pages
Linux Kernel Slides
No ratings yet
Linux Kernel Slides
481 pages
Sensors 23 05941 With Cover
No ratings yet
Sensors 23 05941 With Cover
27 pages
Botnet Detection
No ratings yet
Botnet Detection
16 pages
Neofiti 1 - Deuteronomio - Translation-English
No ratings yet
Neofiti 1 - Deuteronomio - Translation-English
68 pages
Internet of Things
No ratings yet
Internet of Things
25 pages
DTL Ids
No ratings yet
DTL Ids
10 pages
Portfolio Management in Kotak Securites
0% (1)
Portfolio Management in Kotak Securites
92 pages
Fulltext
No ratings yet
Fulltext
123 pages
An Email Cyber Threat Intelligence Method Using Domain Ontology and Machine Learning
No ratings yet
An Email Cyber Threat Intelligence Method Using Domain Ontology and Machine Learning
23 pages
Lang Aquisition - Emergent Rubric Original All Criteria
No ratings yet
Lang Aquisition - Emergent Rubric Original All Criteria
4 pages
Khalaudi - AIsurvey - Updated
No ratings yet
Khalaudi - AIsurvey - Updated
35 pages
Secure Paper.2024
No ratings yet
Secure Paper.2024
17 pages
Manoj SR
No ratings yet
Manoj SR
82 pages
Lesson 04 - Physical Science
No ratings yet
Lesson 04 - Physical Science
24 pages
6936 PDF
100% (2)
6936 PDF
2 pages
Jamb Mat Questions 1 5
No ratings yet
Jamb Mat Questions 1 5
46 pages
Pruebas y Ajustes r1300G
No ratings yet
Pruebas y Ajustes r1300G
21 pages
Pci Dss Compliance Checklist
No ratings yet
Pci Dss Compliance Checklist
9 pages
CICIoT2023 A Real-Time Dataset and Benchmark For L
No ratings yet
CICIoT2023 A Real-Time Dataset and Benchmark For L
22 pages
Next-Generation Cyber Attack Prediction For Iot Systems: Leveraging Multi-Class SVM and Optimized Chaid Decision Tree
No ratings yet
Next-Generation Cyber Attack Prediction For Iot Systems: Leveraging Multi-Class SVM and Optimized Chaid Decision Tree
20 pages
Role of Neural Network Fuzzy and IoT in Integrat
No ratings yet
Role of Neural Network Fuzzy and IoT in Integrat
14 pages
Electronics: TIME: A Machine Learning-Based Framework For Gathering and Leveraging Web Data To Cyber-Threat Intelligence
No ratings yet
Electronics: TIME: A Machine Learning-Based Framework For Gathering and Leveraging Web Data To Cyber-Threat Intelligence
34 pages
BondMaster1000eplus en
No ratings yet
BondMaster1000eplus en
2 pages
Cyber Threat Intelligence Mining For Proactive Cybersecurity Defense: A Survey and New Perspectives
No ratings yet
Cyber Threat Intelligence Mining For Proactive Cybersecurity Defense: A Survey and New Perspectives
27 pages
Flash Memory Summit 2019 Persistent Memory and Nvdimms
No ratings yet
Flash Memory Summit 2019 Persistent Memory and Nvdimms
45 pages
HG Grade 3
No ratings yet
HG Grade 3
3 pages
Main Paper
No ratings yet
Main Paper
17 pages
10.1515 - Jisys 2023 0150
No ratings yet
10.1515 - Jisys 2023 0150
25 pages
Paper 1
No ratings yet
Paper 1
35 pages
Edge Intelligence For Network Intrusion Prevention in IoT Ecosystem
No ratings yet
Edge Intelligence For Network Intrusion Prevention in IoT Ecosystem
25 pages
Cyber Threat Intelligence: Chapter 01: Reconnaissance
No ratings yet
Cyber Threat Intelligence: Chapter 01: Reconnaissance
11 pages
IoT Dataset 2023
No ratings yet
IoT Dataset 2023
23 pages
Artificial Intelligence Deployment To Secure IoT in Industrial Environment
No ratings yet
Artificial Intelligence Deployment To Secure IoT in Industrial Environment
24 pages
An Introduction To Ferrography
No ratings yet
An Introduction To Ferrography
37 pages
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
No ratings yet
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
4 pages
Assessments in Occupational Therapy Mental Health An Integrative Approach, 4th Edition Full Digital Edition
100% (15)
Assessments in Occupational Therapy Mental Health An Integrative Approach, 4th Edition Full Digital Edition
16 pages
Cyber Threat Intelligence For The Internet of Things
No ratings yet
Cyber Threat Intelligence For The Internet of Things
98 pages
SNIA Tutorial 3 Everything You Wanted To Know About Storage
No ratings yet
SNIA Tutorial 3 Everything You Wanted To Know About Storage
48 pages
Crowdsourced Cti Datasets
No ratings yet
Crowdsourced Cti Datasets
10 pages
ACIIA July Newsletter
No ratings yet
ACIIA July Newsletter
14 pages
Overview of Cyber Attacks Classification and Detection in IoT Using CNN-Deep Reinforcement Learning
No ratings yet
Overview of Cyber Attacks Classification and Detection in IoT Using CNN-Deep Reinforcement Learning
6 pages
Facebook Marketing For Tourism
No ratings yet
Facebook Marketing For Tourism
20 pages
Kernel API
No ratings yet
Kernel API
167 pages
Death by Thomas Nagel Com
100% (1)
Death by Thomas Nagel Com
10 pages
Dept. of Chemistry, Rajabazar Science College, 92-Acharya Prafulla Chandra Road, University of Calcutta, Kolkata - 700009, West Bengal, India
No ratings yet
Dept. of Chemistry, Rajabazar Science College, 92-Acharya Prafulla Chandra Road, University of Calcutta, Kolkata - 700009, West Bengal, India
6 pages
Applied Sciences: Intelligent Detection of Iot Botnets Using Machine Learning and Deep Learning
No ratings yet
Applied Sciences: Intelligent Detection of Iot Botnets Using Machine Learning and Deep Learning
22 pages
Fin Irjmets1708609848
No ratings yet
Fin Irjmets1708609848
4 pages
Detection of Cyber Attack in Network Using Machine Learning Techniques
No ratings yet
Detection of Cyber Attack in Network Using Machine Learning Techniques
8 pages
SNIA Tutorial 1 A Case For Flash Storage - How To Choose Flash Storage For Your Applications
No ratings yet
SNIA Tutorial 1 A Case For Flash Storage - How To Choose Flash Storage For Your Applications
26 pages
MartyCzekalski - SCSI Express - Fast - Reliable - Flash - Storage-Final PDF
No ratings yet
MartyCzekalski - SCSI Express - Fast - Reliable - Flash - Storage-Final PDF
31 pages
Digital Storage Memory Technology PDF
No ratings yet
Digital Storage Memory Technology PDF
31 pages
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
No ratings yet
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
5 pages
Advanced Network Defense: Architectures and Best Practices for Today’s Perimeter
From Everand
Advanced Network Defense: Architectures and Best Practices for Today’s Perimeter
Lawrence Bennier
No ratings yet
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
No ratings yet
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
10 pages
Explainable Intrusion Detection For Cyber Defences in The Internet of Things Opportunities and Solutions
No ratings yet
Explainable Intrusion Detection For Cyber Defences in The Internet of Things Opportunities and Solutions
33 pages
MAIN
No ratings yet
MAIN
20 pages
Project - Software Development
No ratings yet
Project - Software Development
3 pages
Peace and Conflict Studies
No ratings yet
Peace and Conflict Studies
18 pages
Loading of Linux Kernel at AM355x
No ratings yet
Loading of Linux Kernel at AM355x
5 pages
United States Patent (10) Patent No.: US 8,301,822 B2
No ratings yet
United States Patent (10) Patent No.: US 8,301,822 B2
31 pages
Please Note That Cypress Is An Infineon Technologies Company
No ratings yet
Please Note That Cypress Is An Infineon Technologies Company
15 pages
Solid Flash To The Future PDF
No ratings yet
Solid Flash To The Future PDF
2 pages
Detection of Cyber Attack in Network Using Machine Learning Techniques New PDF
No ratings yet
Detection of Cyber Attack in Network Using Machine Learning Techniques New PDF
31 pages
Illustrated Parts Catalog Bo105 Ls A-3: Lifting System Assy
No ratings yet
Illustrated Parts Catalog Bo105 Ls A-3: Lifting System Assy
2 pages
Network Attack Detection and Visual Payload Labeli
No ratings yet
Network Attack Detection and Visual Payload Labeli
12 pages
Electronics: Design of High-Security USB Flash Drives Based On Chaos Authentication
No ratings yet
Electronics: Design of High-Security USB Flash Drives Based On Chaos Authentication
9 pages
Designing Secure USB-Based Dongles
No ratings yet
Designing Secure USB-Based Dongles
7 pages
DIY Simple Machine Model Rubric
No ratings yet
DIY Simple Machine Model Rubric
1 page
A Machine Learning Security Framework For Iot Systems
No ratings yet
A Machine Learning Security Framework For Iot Systems
12 pages
Pengaruh Feed Rate Terhadap Sifat Mekanik Pada Pengelasan Friction Stir Welding Alumunium 6110
No ratings yet
Pengaruh Feed Rate Terhadap Sifat Mekanik Pada Pengelasan Friction Stir Welding Alumunium 6110
10 pages
Commoncrawlpresentation 101027182938 Phpapp02
No ratings yet
Commoncrawlpresentation 101027182938 Phpapp02
17 pages
SIMHAR - Smart Distributed Web Crawler For The Hid
No ratings yet
SIMHAR - Smart Distributed Web Crawler For The Hid
12 pages
The Nexus Between Visioning and Planning
No ratings yet
The Nexus Between Visioning and Planning
2 pages
Bahrami Et Al. A Cloud-Based Web Crawler Architecture Cloud Lab Ucm
No ratings yet
Bahrami Et Al. A Cloud-Based Web Crawler Architecture Cloud Lab Ucm
8 pages
03-The Best ADX Signal - Wipro, HCL Tech, Godrej, Crude Oil (March 11, 2019)
No ratings yet
03-The Best ADX Signal - Wipro, HCL Tech, Godrej, Crude Oil (March 11, 2019)
9 pages
Sandeep Julakanti - Resume
No ratings yet
Sandeep Julakanti - Resume
9 pages
Etasr 4202 PDF
No ratings yet
Etasr 4202 PDF
6 pages
4032 Whispering LLaMA A Cross
No ratings yet
4032 Whispering LLaMA A Cross
10 pages
Fisa Tehnica Pompe SPAU
No ratings yet
Fisa Tehnica Pompe SPAU
4 pages
Formato de Excel Modelo para Revision de Literatura
No ratings yet
Formato de Excel Modelo para Revision de Literatura
11 pages
What Is A Perceptron - Basics of Neural Networks - by Anjali Bhardwaj - Towards Data Science
No ratings yet
What Is A Perceptron - Basics of Neural Networks - by Anjali Bhardwaj - Towards Data Science
23 pages
FastWARC Optimizing Large-Scale Web Archive Analyt
No ratings yet
FastWARC Optimizing Large-Scale Web Archive Analyt
1 page
DR - AishaCv 20250422 152511 0000
No ratings yet
DR - AishaCv 20250422 152511 0000
4 pages
IoT Network Attack Detection Using Supervised Machine Learning
No ratings yet
IoT Network Attack Detection Using Supervised Machine Learning
15 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
2 pages
Employment Application Form..
No ratings yet
Employment Application Form..
3 pages
English Demo Sample1
No ratings yet
English Demo Sample1
3 pages
Korean 2
No ratings yet
Korean 2
42 pages
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
Chawimawi Ru
No ratings yet
Chawimawi Ru
1 page
CYBER SECURITY HANDBOOK Part-2: Lock, Stock, and Cyber: A Comprehensive Security Handbook
From Everand
CYBER SECURITY HANDBOOK Part-2: Lock, Stock, and Cyber: A Comprehensive Security Handbook
Poonam Devi
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence

Uploaded by

A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence

Uploaded by

This paper is a preprint; it has been published in 2019 IEEE World Congress on Services (SERVICES), Workshop on Cyber Security

Paris Koloveas Thanasis Chantzios Christos Tryfonopoulos Spiros Skiadopoulos

Stack Exchange Data

SeedFinder Social Crawler Content Parser CTI Crawl Normalizer

Figure 1. A high-level view of the proposed architecture

on the usefulness of the crawled websites by utilising the 4 https://www.mongodb.com/

A. Architectural typology the Internet, as opposed to their distributed counterparts

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.