0% found this document useful (0 votes)
71 views6 pages

A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views6 pages

A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

This paper is a preprint; it has been published in 2019 IEEE World Congress on Services (SERVICES), Workshop on Cyber Security

& Resilience in the Internet of Things (CSRIoT @ IEEE Services), Milan, Italy, July 2019
IEEE copyright notice ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,
creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

A crawler architecture for harvesting the clear, social, and dark web
for IoT-related cyber-threat intelligence

Paris Koloveas Thanasis Chantzios Christos Tryfonopoulos Spiros Skiadopoulos


University of the Peloponnese, GR22131, Tripolis, Greece
{pkoloveas, tchantzios, trifon, spiros}@uop.gr

Abstract—The clear, social, and dark web have lately been security alerts, threat intelligence reports, as well as recom-
identified as rich sources of valuable cyber-security information mended security tool configurations, and is often referred to
arXiv:2109.06932v1 [cs.CR] 14 Sep 2021

that –given the appropriate tools and methods– may be identi- as cyber-threat intelligence (CTI). To this end, with the term
fied, crawled and subsequently leveraged to actionable cyber-
threat intelligence. In this work, we focus on the information CTI we typically refer to any information that may help an
gathering task, and present a novel crawling architecture for organization identify, assess, monitor, and respond to cyber-
transparently harvesting data from security websites in the threats. In the era of big data, it is important to note that the
clear web, security forums in the social web, and hacker fo- term intelligence does not typically refer to the data itself,
rums/marketplaces in the dark web. The proposed architecture but rather to information that has been collected, analysed,
adopts a two-phase approach to data harvesting. Initially a
machine learning-based crawler is used to direct the harvesting leveraged and converted to a series of actions that may be
towards websites of interest, while in the second phase state- followed upon, i.e., has become actionable.
of-the-art statistical language modelling techniques are used to While CTI may be collected by resorting to a variety of
represent the harvested information in a latent low-dimensional means (e.g., monitoring cyber-feeds) and from a variety of
feature space and rank it based on its potential relevance to sources, we are particularly interested in gathering CTI from
the task at hand. The proposed architecture is realised using
exclusively open-source tools, and a preliminary evaluation the clear, social, and dark web where threat actors collabo-
with crowdsourced results demonstrates its effectiveness. rate, communicate and plan cyber-attacks. Such an approach
allows us to provide visibility to a number of sources that
Keywords-IoT; cyber-security; cyber-threat intelligence;
crawling architecture; machine learning; language models; are of preference to threat-actors and identify timely CTI
including zero-day vulnerabilities and exploits. To do so,
we envision an integrated framework that encompasses key
I. I NTRODUCTION technologies for pre-reconnaissance CTI gathering, analysis
Over the years cyber-threats have increased in numbers and sharing through the use of state-of-the-art tools and
and sophistication; adversaries now use a vast set of tools technologies. In this context, newly discovered data from
and tactics to attack their victims with their motivations various sources will be inspected for their relevance to the
ranging from intelligence collection to destruction or fi- task (gathering), and discovered CTI in the form of vulner-
nancial gain. Lately, the utilisation of IoT devices on a abilities, exploits, threat actors, or cyber-crime tools will be
number of applications, ranging from home automation to identified (analysis) and stored in a vulnerability database
monitoring of critical infrastructures, has created an even using existing formats like CVE1 and CPE1 (sharing).
more complicated cyber-defense landscape. The sheer num- In this work we focus on the gathering part of the
ber of IoT devices deployed globally, most of which are envisioned framework, and present a novel architecture that
readily accessible and easily hacked, allows threat actors to is able to transparently provide a crawling infrastructure for
use them as the cyber-weapon delivery system of choice in a variety of CTI sources in the clear, social, and dark web.
many today’s cyber-attacks, ranging from botnet-building for Our approach employs a thematically focused crawler for
DDoS attacks, to malware spreading and spamming. directing the crawl towards websites of interest to the CTI
Trying to stay on top of these evolving cyber-threats gathering task. This is realised by resorting to a combination
has become an increasingly difficult task, and timeliness of machine learing techniques (for open domain crawl)
in the delivery of relevant cyber-threat related information and regex-based link filtering (for structured domains like
is essential for appropriate protection and mitigation. Such forums). The retrieved content is stored in an efficient
information is typically leveraged from collected data, and NoSQL datastore and is retrieved for further inspection in
includes zero-day vulnerabilities and exploits, indicators order to decide its usefulness to the task. This is achieved by
(system artefacts or observables associated with an attack), employing statistical language modelling techniques [1] to
represent all information in a latent low-dimensional feature
space and a ranking-based approach to the collected content
This work has received funding from the European Union’s (i.e., rank it according to its potential to be useful). These
Horizon 2020 research and innovation programme under grant techniques allow us to train our language model to (i) capture
agreement no. 786698. The work reflects only the authors’
view and the Agency is not responsible for any use that may be made of
the information it contains. 1 https://www.mitre.org
Crawler Content Ranking
Crawler Configuration Data

Stack Exchange Data


Model Builder XML Data

Data preprocessor
Clear Web
Crawler Model Trainer Retriever

SeedFinder Social Crawler Content Parser CTI Crawl Normalizer


collection Topic Vocabulary
Content Ranking
Creator
Dark Web
Crawler  MWE Tokenizer

IoTsec
Vocabulary
collection

Figure 1. A high-level view of the proposed architecture

and exploit the most salient words for the given task by harvested content. To this end, we designed a novel module
building upon user conversations, (ii) compute the semantic that employs a ranking-based approach to the assessment of
relatedness between the crawled content and the task at hand the usefulness of a website content using language models.
by leveraging the identified salient words, and (iii) classify To capture the background knowledge regarding vocabulary
the crawled content according to its relevance/usefulness and term correlations, we resorted to latent topic models [1],
based on its semantic similarity to CTI gathering. Notice while ranking is modelled as vector similarity to the task.
that the post-mortem inspection of the crawled content is The proposed architecture has been entirely designed on
necessary, since the thematically focused crawl is forced to and developed using open-source software; it employs an
make a crude decision on the link relevance (and if it should open-source focused crawler2 , an open source implementa-
be visited or not) since it resorts on a limited feature space tion of word embeddings3 for the latent topic modeling, and
(e.g., alt-text of the link, words in the url, or relevance of an open-source NoSQL database4 for the storage of the topic
the parent page). models and the crawled content. The front-end modules are
To the best of our knowledge, this is the first approach to based on HTML, CSS and JavaScript. Figure 1 displays the
CTI gathering that views the crawling task as a two-stage complete architecture with all the sub-components that will
process, where a crude classification is initially used to prune be detailed in the following sections.
the crawl frontier, while a more refined approach based on
the collected content is used to decide on its relevance to A. Clear/social/dark web crawling
the task. This novel approach to CTI gathering is integrated The crawling module is based upon the open-source
to an infrastructure that is entirely open-source and able to implementation of the ACHE Crawler2 and contains three
transparently monitor the clear, social, and dark web. different sub-components: (i) a focused crawler for the clear
The rest of the paper is organised as follows. In the web, (ii) an in-depth crawler for the social web (i.e., forums),
next section, we present the architecture and provide details and (iii) a TOR-based crawler for the dark web. Below, we
on several design and implementation choices, while in outline the characteristics of the different sub-components
Section III we present our evaluation plan and a preliminary and present the rationale behind our implementation choices.
effectiveness evaluation using both crowdsourced results Clear web crawler. This sub-component is designed to
and anecdotal examples. Finally, Section IV outlines related perform focused crawls on the clear web with the purpose
work, while Section V discusses future research directions. to discover new resources that may contain cyber-threat
II. A RCHITECTURE intelligence. To direct the crawl towards topically relevant
websites we utilise an SVM classifier, which is trained by
The proposed architecture consists of two major compo-
resorting on an equal number of positive and negative exam-
nents: the crawling module and the content ranking module.
ples of websites that are used as input to the Model Builder
The idea behind this two-stage approach is the openness
component of ACHE. Table I contains some example entries
of the topic at hand that cannot be accurately modelled by
that are specific to our task. Subsequently, the SeedFinder
a topical crawler, which mainly focuses on staying within
[2] component is utilised to aid the process of locating initial
the given topic. The difficulty emerges from websites that,
seeds for the focused crawl on the clear web; this is achieved
although relevant to the topic (e.g., discussing IoT security in
by combining the classification model built previously with
general), have no actual information that may be leveraged to
a user-provided query relevant to the topic - in our case we
actionable intelligence (e.g., do not mention any specific IoT
use the query “iot vulnerabilities”.
related vulnerability). To overcome this challenge, a focused
crawler that employs machine learning techniques to direct 2 https://github.com/ViDA-NYU/ache
the crawl is aided by advanced language models to decide 3 https://radimrehurek.com/gensim/

on the usefulness of the crawled websites by utilising the 4 https://www.mongodb.com/


Table I Table II
P OSITIVE & N EGATIVE W EBPAGE E XAMPLES R EGEX - BASED LINK FILTERS
Positive Negative Category Pattern Operation
IoT, Cloud, or Mobile: All Ripe for Scammers pose as CNN’s Wolf Whitelist https://www.wilderssecurity. Crawl content only from
Exploit and Need Security’s Atten- Blitzer, target security profession- com/threads/* the Threads section of
tion | CSO Online als | CSO Online the domain
New IoT Threat Exploits Lack of 19 top UEBA vendors to protect Whitelist https://blogs.oracle.com/ Crawl all articles about
Encryption in Wireless Keyboards against insider threats and external security/* Security
| eSecurity Planet attacks | eSecurity planet Blacklist https://www.wilderssecurity. Crawl the entire domain
Security Testing the Internet of Build your own cloud | SoftLayer com/members/* apart from the Members
Things: Dynamic testing (Fuzzing) Area
for IoT security | Beyond Security Blacklist https://www.securityforum. Crawl the entire domain
org/events/* apart from Events
Social web crawler. This sub-component is used to perform
in-depth crawls on specific selected forums on the social a topic on IoT security could be captured by related words
web, where the topic of discussion matches the given task. and phrases like “Mirai botnet”, “IoT”, or “exploit kits”.
To this end, the social web crawler can be provided with Such salient phrases related to the topic may be obtained
links to discussion threads on IoT vulnerabilities and use by un-/semi-supervised training of latent topic models over
them to traverse the forum structure and download all rele- external datasets such as IoT and security related forums. In
vant discussions on the topic. Notice that contrary to clear this way, we are able to capture semantic dependencies and
web crawling, in the forum crawl all links are considered statistical correlations among words for a given topic and
relevant by default since they correspond to discussions represent them in a low-dimension latent space by state-of-
over the given topic, so there is no need for utilising a the-art latent topic models [1].
page classification model. However, to filter out parts of Since useful cyber-threat intelligence manifests itself in
the forum that are irrelevant or non-informative (e.g., user the form of cyber-security articles, user posts in secu-
info pages), we employ regex-based link filters that may be rity/hacker forums, or advertisement posts in cybercrime
applied within specific forums or in a cross-forum fashion. marketplaces, it can also be characterised as distributional
Table II contains some example link filters. Notice that this vectors of salient words. Then, the similarity between the
type of crawl is primarily used for thread/forum monitoring distributional vectors of harvested content and the given
and is not targeted in identifying new websites. topic (i.e., IoT vulnerabilities) may be used to assess the
content relevance to the topic.
Dark web crawler. This sub-component is used to perform To better capture the salient vocabulary that is utilised
in-depth crawls on specific websites on the dark web by by users for the IoT security domain we resorted to a
utilising TOR proxies. To do so, the crawler is provided with number of different discussion forums within the Stack
a number of onion links that correspond to hacker forums or Exchange ecosystem to create a training dataset. To this end,
marketplaces selling cybercrime tools and zero-day vulner- we utilised the Stack Exchange Data Dump5 to get access
abilities/exploits and monitors the discussions for content of to IoT and information security related discussion forums
interest. To overcome user authentication mechanisms that including Internet of Things, Information Security, Arduino,
are often in place in dark web forums/marketplaces, the dark Raspberry Pi, and others. The utilised data dumps contain
web crawler requires an initial manual login. After a success- user discussions in Q&A form, including the text from posts,
ful user authentication, the session cookies are stored and comments and related tags, and were used as input to the
are utilised (via HTTP requests) in subsequent visits of the data preprocessor sub-component described below.
crawler to simulate user login. After each crawl completes
a set interval, the crawled HTML pages are parsed by the Data preprocessor. This sub-component is responsible for
content parser sub-component, which extracts the textual the data normalisation process that involves a number of
content along with useful metadata (e.g., bitcoin value of steps described in the following. The input data for the sub-
sold cybercrime tools or user fame/activity/reputation level). component is typically XML-formatted and at first an XML
All content from the different (clear/social/dark) web DOM parser is used to parse the data and keep only the
crawling components is downloaded in its raw HTML for- useful part that contains user posts, comments, and tags. The
mat and stored in a NoSQL document store (mongoDB4 ) parsed data are, subsequently, fed into the Normalizer that
for further processing as discussed in the next section. performs typical normalisation (e.g., case folding, symbol
removal) and anonymisation (e.g., username elimination)
B. Content ranking and classification actions. Finally, the third step in the preprocessing phase
Deciding whether a crawled website contains useful (Multi-Word Expression tokenisation) includes the identifi-
cyber-threat intelligence is a challenging task given the typ- cation and characterisation of important multi-word terms
ically generic nature of many websites that discuss general (such as “exploit kits” or “Mirai botnet”) in order to extend
security issues. To tackle this problem, we designed and the functionality of the skip-gram model [1] for such terms.
implemented a novel content ranking module that assesses Model trainer. The preprocessed document corpus is sub-
the relevance and usefulness of the crawled content. To do sequently utilised to train the language model [1]; this is
so, we represent the topic as a vocabulary distribution by
utilising distributional vectors of related words; for example 5 https://archive.org/details/stackexchange
Table III Table IV
M OST RELEVANT TERMS FOR TAG DD O S R ELEVANCE S CORE C OMPUTATION
rank term rank term rank term Excerpt from: www.iotforall.com/5-worst-iot-hacking-vulnerabilities
#1 dos #6 flood #11 slowloris
#2 volumetric #7 aldos #12 botnet The Mirai Botnet (aka Dyn Attack) Back in October of 2016, the
#3 flooding #8 floods #13 drdos largest DDoS attack ever was launched on service provider Dyn
#4 cloudflare #9 ip spoofing #14 blackholing using an IoT botnet. This lead to huge portions of the internet going
#5 prolexic #10 radware #15 amplification down, including Twitter, the Guardian, Netflix, Reddit, and CNN.

done by using the Gensim3 open source implementation of This IoT botnet was made possible by malware called Mirai. Once
word2vec (based on a neural network for word relatedness), infected with Mirai, computers continually search the internet for
vulnerable IoT devices and then use known default usernames and
setup with a latent space of 150 dimensions, a training passwords to log in, infecting them with malware. These devices
window of 5 words, a minimum occurrence of 1 term were things like digital cameras and DVR players.
instance, and 10 parallel threads. The result of the training Relevance Score 0.8563855440900794
is a 150-dimensions distributional vector for each term that
occurs at least once in the training corpus. Comparing the pros and cons of thresholding and top-k post
Topic vocabulary creator. To automatically extract the set selection in the context of assessing a continuous crawling
of salient words that will be used to represent the topic we task is an ongoing research effort.
utilised the extracted user tags, and augmented them with the
III. P RELIMINARY E VALUATION
set of N most related terms in the latent space for each user
tag; term relatedness was provided by the trained language In this section we present a preliminary evaluation of
models and the corresponding word vectors. Table III shows the architecture and the evaluation plan for a thorough
an example of the most relevant terms to the DDoS user tag, investigation of each system component.
for N =5,10, and 15. The resulting (expanded) vocabulary For the purposes of the evaluation, we ran a focused crawl
is stored in a separate NoSQL document store (mongoDB4 ). with a model consisting of 7 positive and 7 negative URLs,
a sample of which may be seen on Table I, and 21 seeds
Content ranking. To assess the relevance and usefulness
extracted by the SeedFinder component for the query “iot
of the crawled content we employ the content ranking sub-
vulnerabilities”. The crawl harvested around 22K websites
component; this component utilises the expanded vocabulary
per hour/per thread on a commodity machine; around 10%
created in the previous phase to decide how similar a crawled
of the frontier links were considered relevant according to
post is to the topic by computing the similarity between the
the crawler model and were thus harvested.
topic and post vectors. This is done as follows.
In order to determine the quality of information within the
The topic vector T~ is constructed as the sum of the
harvested URLs that were considered relevant in the focused
distributional vectors of all the topic terms t~i that exist in
crawl, we developed a web-based evaluation tool (a snapshot
the topic vocabulary, i.e.,
X of which is shown in Figure 2) for collecting crowdsourced
T~ = t~i data from human experts. Using this tool, human judges are
∀i provided with a random harvested website and are asked to
Similarly, the post vector P~ is constructed as the sum of assess its relevance on a 4-point scale; these assessments
the distributional vectors of all the post terms w~j that are are expected to showcase the necessity of the language
present in the topic vocabulary. To promote the impact of models and drive our classification task (i.e., help us identify
words related to the topic at hand, we introduce a topic- appropriate thresholding/top-k values). Early results from a
dependent weighting scheme for post vectors in the spirit of limited number of judgements on the clear web crawl show
[3]. Namely for a topic T and a post containing the set of that about 1% of the harvested websites contain actionable
words {w1 , w2 , . . .}, the post vector is computed as cyber-threat intelligence, while (as expected) the percentage
X is higher for the social and dark web.
P~ = cos (w~j , T~ ) · w~j Besides the evaluation tool, we have also implemented a
∀j
visualisation component for the computation of the relevance
Finally, after both vectors have been computed, the rele- score for posts. This component highlights vocabulary words
vance score r between the topic T and a post P is computed on the actual crawled posts, and displays the computed
as the cosine similarity of their respective distributional relevance score according to our model. A sample output
vectors in the latent space of the component for a random post is shown in Table IV.
r = cos (T~ , P~ ) IV. R ELATED W ORK
Having computed a relevance score for every crawled post Web crawlers, typically also known as robots or spiders,
in the NoSQL datastore, the classification task of identifying are tightly connected to information gathering from online
relevant/useful posts is trivially reduced to either a threshold- sources. In this section we review the state-of-the-art in
ing or a top-k selection operation. Notice that the most/least crawling by (i) outlining typical architectural alternatives
relevant websites may also be used to reinforce the crawler that fit the crawling task and (ii) categorising the different
model (i.e., as input to the Model Builder sub-component). crawler types based on the type of the targeted content.
Figure 2. Web-based evaluation tool for collecting crowdsourced data from human experts

A. Architectural typology the Internet, as opposed to their distributed counterparts


Depending on the crawling application, the available that are designed for clusters and server farms. To this
hardware, the desired scalability properties and the ability end, peer-to-peer crawlers are lightweight processes that
to scale up/out the existing infrastructure, related literature emphasise crawl personalisation and demonstrate large-scale
provides a number of architectural alternatives [4]–[6]. collaboration usually by means of an underlying distributed
routing infrastructure [11].
Centralized. Typically, special-purpose or small-scale
crawlers follow a centralised architecture [4]; the page Cloud-based. Lately, the requirement for more effective
downloading, the URL manipulation, and the page stor- use of resources by means of elasticity gave rise to a new
age modules, resort in a single machine. This centralised crawler paradigm: the cloud-based crawlers [13], [14], which
architecture is, naturally, easier to implement, simpler to revived known machinery to a renewed scope, versatility
deploy, and straightforward to administer, but is limited to and options. Such architectures use cloud computing features
the capabilities of the hardware and thus cannot scale well. alongside big data solutions like Map/Reduce and NoSQL
For this reason, the more sophisticated crawler designs put databases, to allow for resource adaptable web crawling and
effort in scaling out, i.e., exploiting the inherently distributed serve the modern the Data-as-a-Service (DaaS) concept.
nature of the web and adopt some form of decentralisation. B. Usage typology
Hybrid. Hybrid crawler architectures (e.g., [7]) are the norm Although the first web crawlers that set the pathway for
in the architectural typology as they aim for a conceptually the spidering technology were developed for the clear (or
simple design that involves distributing some of the pro- surface) web, in the course of time specialised solutions
cesses, while keeping others centralised. In such architec- aiming at the different facets (social, deep, dark) of the web
tures, the page downloading module is typically distributed, were gradually introduced. Below we organise crawlers in
while URL management data structures and modules are terms of their intended usage.
maintained at a single machine for consistency. Such designs Clear/surface web. Since the introduction of the first
aim at harnessing the control of a centralised architecture crawler in 1993, the majority of the research work on
and the scalability of a distributed system; however the crawlers has focused on the crawling of the surface web,
centralised component usually acts as a bottleneck for the initially on behalf of search engines, and gradually also for
crawling procedure and represents a single point of failure. other tasks. There is an abundance of work on clear web
Parallel/Distributed. A parallel crawler [8], [9] consists of crawling; some insightful surveys include [4], [15].
multiple crawling processes (usually referred to as C-procs Web 2.0. The advent of the user-generated content philos-
in crawler jargon), where each such process performs all the ophy and the participatory culture of Web 2.0 sites like
basic tasks of a crawler. To benefit from the parallelisation blogs, forums and social media, formed a new generation
of the crawling task the frontier is typically split among the of specialised crawlers that focused on forum [16]–[19],
different processes, while to minimise overlap in the crawled blog/microblog [20]–[22], and social media [23], [24] spi-
space, links and other metadata are communicated between dering. The need for specialised crawlers for these websites
the processes. When all processes run on the same LAN emerged from (i) the content quality, (ii) the inherent struc-
then we refer to an intra-site parallel crawler, while when C- ture that is present in forums/blogs, and (iii) the implemen-
proc’s run at geographically distributed locations connected tation particularities (e.g., Javascript-generated URLs) that
by a WAN (or the Internet) we refer to a distributed crawler. make other crawler types inapplicable or inefficient.
Peer-to-peer. The advent of peer-to-peer computing almost Deep/Dark/Hidden web. The amount of information and
two decades ago introduced peer-to-peer search engines the inherent interest for data that reside out of reach of
like Minerva [10]; this in turn gave rise to the concept major search engines, hidden either behind special access
of peer-to-peer crawlers [11], [12]. Peer-to-peer crawlers websites (deep web) or anonymisation networks like Tor6
constitute a special form of distributed crawlers that are
typically targeted to be run on machines at the edge of 6 https://www.torproject.org/
and I2P7 (dark web), gave rise to specialised crawlers [25]. [14] Y. Li, L. Zhao, X. Liu, and P. Zhang, “A security framework
To this end, over the last ten years, a number of works for cloud-based web crawling system,” in WISA, 2014.
related to deep web crawling have been published [26]–[28], [15] K. S. McCurley, “Incremental crawling,” in Encyclopedia of
investigating also (i) different architectural [29], [30] and Database Systems, 2009.
automation [31] options, (ii) quality issues such as sampling [16] J. Jiang, N. Yu, and C. Lin, “Focus: learning to crawl web
forums,” in WWW, 2012.
[32], query-based exploration [33], duplicate elimination
[17] J. Yang, R. Cai, C. Wang, H. Huang, L. Zhang, and W. Ma,
[34], and (iii) new application domains [35], [36]. “Incorporating site-level knowledge for incremental crawling
Cloud. Finally, the elasticity of resources and the popularity of web forums: a list-wise strategy,” in ACM, SIGKDD, 2009.
of cloud-based services inspired a relatively new line of [18] Y. Wang, J. Yang, W. Lai, R. Cai, L. Zhang, and W. Ma,
research focusing on crawler-based service configuration “Exploring traversal strategy for web forum crawling,” in
ACM, SIGIR, 2008.
[37] and discovery in cloud environments [38].
[19] R. Cai, J. Yang, W. Lai, Y. Wang, and L. Zhang, “irobot: an
V. O UTLOOK intelligent crawler for web forums,” in WWW, 2008.
[20] M. Hurst and A. Maykov, “Social streams blog crawler,” in
We are currently working on a thorough evaluation of our ICDE, 2009.
architecture. Our future research plans involve the extraction [21] S. Agarwal and A. Sureka, “A topical crawler for uncovering
of cyber-threat intelligence from the relevant harvested con- hidden communities of extremist micro-bloggers on tumblr,”
tent, by utilising natural language understanding for named in WWW, 2015.
entity recognition/disambiguation. [22] R. Ferreira, R. Lima, J. Melo, E. Costa, F. L. G. de Freitas,
and H. P. L. Luna, “Retriblog: a framework for creating blog
R EFERENCES crawlers,” in ACM, SAC, 2012.
[23] F. Buccafurri, G. Lax, A. Nocera, and D. Ursino, “Crawling
[1] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, social internetworking systems,” in ASONAM, 2012.
“Distributed representations of words and phrases and their [24] A. Khan and D. K. Sharma, “Self-adaptive ontology based
compositionality,” NIPS, 2013. focused crawler for social bookmarking sites,” IJIRR, 2017.
[2] K. Vieira, L. Barbosa, A. S. Silva, J. Freire, and E. Moura, [25] G. Valkanas, A. Ntoulas, and D. Gunopulos, “Rank-aware
“Finding seeds to bootstrap focused crawlers,” WWW, 2016. crawling of hidden web sites,” in WebDB, 2011.
[3] J. Biega, K. Gummadi, I. Mele, D. Milchevski, C. Try- [26] Y. Wang, J. Lu, J. Chen, and Y. Li, “Crawling ranked deep
fonopoulos, and G. Weikum, “R-Susceptibility: An IR- web data sources,” WWW, vol. 20, no. 1, 2017.
Centric Approach to Assessing Privacy Risks for Users in
Online Communities,” in ACM SIGIR, 2016. [27] F. Zhao, J. Zhou, C. Nie, H. Huang, and H. Jin,
“Smartcrawler: A two-stage crawler for efficiently harvesting
[4] M. Najork, “Web crawler architecture,” in Encyclopedia of deep-web interfaces,” IEEE TSC, 2016.
Database Systems, 2009.
[28] Q. Zheng, Z. Wu, X. Cheng, L. Jiang, and J. Liu, “Learning
[5] J. M. Hsieh, S. D. Gribble, and H. M. Levy, “The architec- to crawl deep web,” Inf. Syst., vol. 38, no. 6, 2013.
ture and implementation of an extensible web crawler,” in
USENIX, NSDI, 2010. [29] J. Zhao and P. Wang, “Nautilus: A generic framework for
crawling deep web,” in ICDKE, 2012.
[6] A. Harth, J. Umbrich, and S. Decker, “Multicrawler: A
pipelined architecture for crawling and indexing semantic web [30] Y. Li, Y. Wang, and E. Tian, “A new architecture of an
data,” in ISWC, 2006. intelligent agent-based crawler for domain-specific deep web
databases,” in IEEE/WIC/ACM, WI, 2012.
[7] V. Shkapenyuk and T. Suel, “Design and implementation of
a high-performance distributed web crawler,” in ICDE, 2002. [31] T. Furche, G. Gottlob, G. Grasso, C. Schallhart, and A. J.
Sellers, “Oxpath: A language for scalable data extraction,
[8] F. Ahmadi-Abkenari and A. Selamat, “An architecture for a automation, and crawling on the deep web,” VLDB J., 2013.
focused trend parallel web crawler with the application of
clickstream analysis,” Inf. Sci., vol. 184, 2012. [32] J. Lu, Y. Wang, J. Liang, J. Chen, and J. Liu, “An approach
to deep web crawling by sampling,” in IEEE/WIC/ACM, WI,
[9] D. L. Quoc, C. Fetzer, P. Felber, E. Rivière, V. Schiavoni, and 2008.
P. Sutra, “Unicrawl: A practical geographically distributed
web crawler,” in IEEE, CLOUD, 2015. [33] J. Liu, Z. Wu, L. Jiang, Q. Zheng, and X. Liu, “Crawling
deep web content through query forms,” in WEBIST, 2009.
[10] C. Zimmer, C. Tryfonopoulos, and G. Weikum, “Minervadl:
An architecture for information retrieval and filtering in [34] L. Jiang, Z. Wu, Q. Feng, J. Liu, and Q. Zheng, “Efficient
distributed digital libraries,” in ECDL, 2007. deep web crawling using reinforcement learning,” in PAKDD,
2010.
[11] O. Vikas, N. J. Chiluka, P. K. Ray, G. Meena, A. K. Meshram,
A. Gupta, and A. Sisodia, “Webminer–anatomy of super peer [35] Y. Li, Y. Wang, and J. Du, “E-FFC: an enhanced form-focused
based incremental topic-specific web crawler,” in ICN, 2007. crawler for domain-specific deep web databases,” JIIS, 2013.
[12] B. Bamba, L. Liu, J. Caverlee, V. Padliya, M. Srivatsa, [36] Y. He, D. Xin, V. Ganti, S. Rajaraman, and N. Shah, “Crawl-
T. Bansal, M. Palekar, J. Patrao, S. Li, and A. Singh, ing deep web entity pages,” in ACM, WSDM, 2013.
“Dsphere: A source-centric approach to crawling, indexing [37] M. Menzel, M. Klems, H. A. Lê, and S. Tai, “A configuration
and searching the world wide web,” in ICDE, 2007. crawler for virtual appliances in compute clouds,” in IEEE,
[13] K. Gupta, V. Mittal, B. Bishnoi, S. Maheshwari, and D. Pa- IC2E, 2013.
tel, “AcT: Accuracy-aware crawling techniques for cloud- [38] T. H. Noor, Q. Z. Sheng, A. Alfazi, A. H. H. Ngu, and J. Law,
crawler,” WWW, 2016. “CSCE: A crawler engine for cloud services discovery on the
world wide web,” in IEEE, ICWS, 2013.
7 https://geti2p.net/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy