0% found this document useful (0 votes)

24 views11 pages

Exploiting Web Scraping in A Collaborati

1. The document discusses a collaborative filtering-based web advertising system that exploits web scraping techniques. 2. It proposes using web scraping to extract ads and suggest suitable ads for a given web page when ad datasets are unavailable. 3. The system first uses collaborative filtering and then relies on web scraping to extract ads from other websites to suggest.

Uploaded by

Asfandyar Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views11 pages

Exploiting Web Scraping in A Collaborati

Uploaded by

Asfandyar Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No.

ORI GI N AL RESEARCH

Ex ploit in g w e b scr a pin g in a colla bor a t ive filt e r in g-

ba se d a ppr oa ch t o w e b a dve r t isin g
Eloisa Va r giu 1 , 2 , M ir k o Ur r u 1

1. Dipart im ent o di Mat em at ica e I nform at ica, Univer sit à di Cagliar i, I t aly. 2. Barcelona Digit al Technology Cent r e, Spain

Cor r e spon de n ce : Eloisa Vargiu. Address: Barcelona Digit al Technology Cent er, I t aly . Em ail: evargiu@bdigit al.org.

Re ce ive d: June 30, 2012 Acce pt e d: August 6, 2012 On lin e Pu blish e d: Decem ber 5, 2012
D OI : 10.5430/ air.v2n1p44 URL: ht t p: / / dx.doi.org/ 10.5430/ air.v2n1p44

Abst r a ct
Web scraping is the set of techniques used to automatically get some information from a website instead of manually
copying it. The goal of a Web scraper is to look for certain kinds of information, extract, and aggregate it into new Web
pages. In particular, scrapers are focused on transforming unstructured data and save them in structured databases. In this
paper, among others kind of scraping, we focus on those techniques that extract the content of a Web page. In particular,
we adopt scraping techniques in the Web advertising field. To this end, we propose a collaborative filtering-based Web
advertising system aimed at finding the most relevant ads for a generic Web page by exploiting Web scraping. To illustrate
how the system works in practice, a case study is presented.

Ke y w or ds
Web advertising, Collaborative filtering, Web scraping

1 I n t r odu ct ion
Web Advertising is an emerging research field, at the intersection of information retrieval, machine learning, optimization,
and microeconomics. It is one of the major sources of income for a large number of websites. Its main goal is to suggest
products and services to the ever growing population of Internet users.

There are two primary channels for distributing ads: sponsored search (or paid search advertising) and contextual
advertising (or content match). Sponsored search advertising displays ads on the page returned from a Web search engine
following a query; whereas contextual advertising displays ads within the content of a generic, third part, Web page. A
commercial intermediary, namely ad network, is usually in charge of optimizing the selection of ads with the twofold goal
of increasing revenue and improving user experience. The ads are selected and served by automated systems based on the
content displayed to the user.

Web scraping (also called Web harvesting or Web data extraction) is a software technique aimed at extracting information
from websites [1]. Usually, Web scrapers simulate human exploration of the World Wide Web by either implementing
low-level hypertext transfer protocol or embedding suitable Web browsers. Web scraping is closely related to Web
indexing, which is an information retrieval technique adopted by several search engines to index information on the Web
through a bot. In contrast, Web scraping focuses on the transformation of unstructured data on the Web, typically in
HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping
44 ISSN 1927-6974 E-ISSN 1927-6982
www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

is also related to Web automation [2], which simulates human Web browsing using computer software. Web scraping is
currently used to online price comparison, weather data monitoring, website change detection, Web research, Web
mashup, and Web data integration. Several (commercial) software tools, aimed at personalizing websites by adopting
scraping techniques, are currently available.

In this paper, we present a collaborative filtering-based Web advertising system that exploits Web scraping techniques to
suggest suitable ads to a given Web page. In particular, we address Web advertising as an information filtering task
devising our proposed Web advertising system by exploiting collaborative filtering [3]. The proposed system, first, exploits
collaborative filtering and, subsequently, relies on Web scraping to extract ads to be suggested. The idea to exploit
collaborative filtering in a Web advertising has been proposed by Armano & Vargiu [4] and adopted also in Armano et
al. [5]. To our best knowledge this is the first attempt to adopt Web scraping techniques to perform Web advertising. The
underlying motivation in adopting Web scraping is that, in case of no available ad dataset (available only for companies
that operate advertising systems, e.g., Yahoo!, Google, or Microsoft, not for academic purposes), instead of building an
ad-hoc dataset by hand, this unsupervised approach could be adopted.

The rest of the paper is organized as follows. Section 2 summarizes the main work on Web advertising, collaborative
filtering and Web scraping. In Section 3, we illustrate our proposed Web advertising system, focusing on its architecture
and the exploitation of collaborative filtering and Web scraping. In Section 4, we illustrate a case study aimed at
highlighting the effectiveness of the proposed approach. Section 5 ends the paper with conclusions and future research
directions.

2 Ba ck gr ou n d
In this Section, we give a brief overview on the main topics addressed in this paper: Web advertising, collaborative
filtering, and Web scraping.

2 .1 W e b a dve r t isin g
From the beginning of the Web era, companies put graphical banner ads on Web pages at popular websites [6]. Banner
advertising is a form of Web advertising that entails embedding an ad into a Web page [7]. It is intended to attract traffic to
a website by linking it to the website of the advertiser. The ad is constructed from an image (GIF, JPEG, PNG), JavaScript
program or multimedia object. Moreover, they often employ animation, sound, or video to maximize presence. The
primary purpose of these ads was branding, i.e., to convey to the viewer a positive feeling about the brand of the company
placing the ad. These ads were, typically, priced on a cost per mil basis, i.e., the cost to the company of having its banner ad
displayed 1000 times. Some websites made contracts with their advertisers in which an ad was priced not by the number of
times it is displayed, but rather by the number of times it was clicked on by the user (cost per click model). In such cases,
clicking on the ad leads the user to a Web page set up by the advertiser, where the user is induced to make a purchase. Here,
the goal of the ad is not so much brand promotion as to induce a transaction.

To formulate the Web advertising problem, let P be the set of Web pages and let A be the set of ads that can be displayed.
The revenue of the network, given a page p, can be estimated as:

∑ Pr | , , (1)

where k is the number of ads displayed on page p and price(ai, i) is the click-price of the current ad ai at position i. The
price in this model depends on the set of ads presented on the page. Several models have been proposed to determine the
price, most based on generalizations of second price auctions. For the sake of simplicity, we ignore the pricing model and
reformulate the Web advertising problem as follows: for each page p ∈ P we want to select the ad a' ∈ A hat maximizes the
click probability. Formally:
Published by Sciedu Press 45
www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

∀ ∈ : ∈ Pr click|p, a (2)

An alternative way to formulate the problem is the following: let P be the set of Web pages and let A be the set ads that can
be displayed. Let f be a utility function that measures the matching of a to p, i.e. : → , where R is a totally ordered
set (e.g., non-negative integers or real numbers within a certain range). Then, for each page ∈ we want to select ∈
that maximizes the page utility function. More formally:

∀ ∈ : ∈ . (3)

Note that in this case the utility function f can also be viewed as an estimation of the probability that the corresponding ad
be clicked.

A significant part of Web advertising consists of textual ads, the ubiquitous short text messages usually marked as
sponsored links. There are two primary channels for distributing ads: sponsored search (or paid search advertising) and
contextual advertising (or content match). Sponsored search advertising displays ads on the page returned from a Web
search engine following a query. It can be thought as a document retrieval problem, where ads are documents to be
retrieved in response to a query. Ads could be represented in part by their keywords. Carrasco et al. [8] approached the
problem of keyword suggestion by clustering bipartite advertiser-keyword graphs. Joachims [9] proposed to use click-data
to learn ranking functions for results of a search engine as an indicator of relevance. Ciaramita et al. [10] studied an
approach to learn and evaluate sponsored search systems based solely on click-data, focusing on the relevance of textual
content. Contextual advertising is a form of targeted advertising for ads appearing on Websites or other media, such as
content displayed in mobile browsers. The ads themselves are selected and served by automated systems based on the
content displayed to the user. Ribeiro-Neto et al. [11] examined a number of strategies to match pages and ads based on
extracted keywords. In a subsequent work, Lacerda et al. [12] proposed a method to learn the impact of individual features
using genetic programming. Broder et al. [13] classified both pages and ads into a given taxonomy and matched ads to the
page falling into the same node of the taxonomy. Starting from that work, Armano et al. [14] proposed a semantic
enrichment by adopting concepts. Furthermore, modern contextual advertising systems use text summarization techniques
in conjunction with the model developed in Broder et al. [13], see, for instance Anagnostopoulos et al. [15], Armano et al. [16],
Armano et al. [17]. Since bid phrases are basically search queries, another relevant approach is to view contextual
advertising as a problem of query expansion and rewriting [18, 19].

2 .2 Colla bor a t ive filt e r in g

Collaborative filtering consists of automatically making predictions (filtering) about the interests of a user by collecting
preferences or tastes from similar users (collaboration); the underlying idea is that those who agreed in the past tend to
agree again in the future.

Several collaborative filtering systems have been developed to suggest items and goods, including news, photos, people,
and books [20]. Collaborative filtering systems try to predict the utility of items for a particular user based on the items
previously rated by other users. There have been many collaborative recommender systems developed in the academia and
in the industry. Among others, let us recall: the Grundy system [21], GroupLens [22], Video Recommender [23], and Ringo [24]
which have been the first systems that used collaborative filtering algorithms to automate recommendation. Other
examples of collaborative recommender systems are: the book recommender system from Amazon.com [25] and the
PHOAKS system that helps people to find relevant information on the Web [26]. In particular, collaborative recommender
systems are the most suitable if the items to be recommended are multimedia with scarce descriptions, but rated by a
community of users [27].

Let us note that analogously to the Web advertising problem, the recommendation problem can be formulated as follows:
let U be the set of all users and let I be the set of all possible items that can be recommended, e.g., books, movies, and

46 ISSN 1927-6974 E-ISSN 1927-6982

www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

restaurants (it can be very large, ranging in hundreds of thousands or even millions of items in some applications). Let f be
a utility function that measures the usefulness of item i for user u, i.e., : → , where R is a totally ordered set (e.g.,
non-negative integers or real numbers within a certain range). Then, for each user ∈ we want to choose the item ∈
that maximizes the user's utility function. More formally:

∀ ∈ : ∈ , (4)

In which the utility function is typically represented by ratings and is initially defined only on items previously rated by the
users. For example, in a book recommendation application (e.g., Amazon.com), users initially rate some subsets of books
they have read.

According to the analogy in those definitions, a few works that use collaborative filtering to perform Web advertising have
been proposed. As for sponsored search, Anastasakos et al. [28] proposed a technique to determine the relevance of an ad
document for a search query using click-through data. In their work, collaborative filtering is exploited to discover new ads
related to a query using a click graph. As for contextual advertising, Armano & Vargiu [4] proposed to study the problem of
contextual advertising as a recommendation problem, and vice versa.

2 .3 W e b scr a pin g
Nowadays, quite a lot of researchers are working on extracting information about types of events, entities or relationships
from textual data. Information extraction is used for search engines, news libraries, manuals, domain-specific text or
dictionaries. A form of information extraction is text mining, an information retrieval task aimed at discovering new,
previously unknown information, by automatically extracting it from different text resources [29]. In information
extraction, text mining is used to scrap relevant information out of text files by relying on linguistic and statistic
algorithms.

Web search and information extraction is typically performed by Web crawlers. A Web crawler is a program or automated
script that browses the WWW in a methodical, automated manner [30]. A more recent variant of Web crawlers are Web
scrapers, which are aimed at looking for certain kinds of information–such as prices of particular goods from various
online stores—extracting, and aggregating it into new Web pages [31].

Scrapers are basically adopted to transform unstructured data and save them in structured databases. In screen scraping, a
special form of scraping, a program extracts information from the display output of another program [32]. So that, the output
which is scraped is created for the end user and not for other programs that is the difference to a normal scraper. In this
paper, we focus on Web scrapers that extract textual information from Web pages. There are many methods to scrap
information from the Web [33]. Since barriers to prevent machine automation are not effective against humans, the most
effective method is human copy-paste. Although sometimes this is the only way to export information from a Web page,
this is not feasible in practice, especially for big company projects, being too expensive. Another method is text grepping
in which regular expressions are used to find information that matches some patterns. Further Web scraping techniques are
HTTP programming, DOM parsing, and HTML parsers. Finally, a Web scraping method consists of making scraper sites
that are automatically generated from other Web pages by scraping their content [34].

It is worth noting that Web scraping may be against the terms of use of some websites. Being interested in the scientific
issues concerned with the adoption of Web scraping to perform Web advertising, in this paper we do not take into account
legal issues on adopting and implementing Web scraping techniques.

3 Th e pr opose d a ppr oa ch t o w e b a dve r t isin g

Our proposal is to exploit Web scraping to suggest suitable ads to a given Web page. To this end, we address Web
advertising as a collaborative filtering task. In Web advertising, given a Web page p, relevant information are ads related to
Published by Sciedu Press 47
www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

p (i.e., to its content) and the third part is p itself, since it will display the filtered ads. Thus, we propose a Web advertising
system that relies on collaborative filtering and exploits scraping techniques to analyze the page content. In particular, we
decided to apply collaborative filtering for first and to subsequently rely on Web scraping to perform a content-based
analysis. In so doing, given a Web page p, the collaborative filtering module exploits the collaboration of p retrieving a
subset of its peer pages. The content of the retrieved peer pages is then analyzed by adopting Web scraping techniques.

For the sake of rapid prototyping, the system has been implemented in Python, the implemented modules and their
connections are depicted in Figure 1. As shown, the system is composed of three modules: (i) inlink extractor, (ii) ad
extractor, and (iii) ad selector.

Figure 1. The architecture of the proposed system.

Inlink extractor. This module is devoted to find, given a page p, the peer pages. Suitable peer pages appear to be all the
inlinks of p (also called backlinks), i.e., all pages that link to p. The inlink extractor collects the first 10 inlinks of a given
page by relying on the Yahoo! Site Explorer. The adopted collaborative filtering approach is illustrated in Section 3.1.

The inlink extractor gives as output the list of the 10 extracted peer pages that will be scrapped by the Ad extractor in order
to identify the most related ads.

Ad extractor. This module is aimed at extracting banner ads from the peer pages. To this end, we rely on Web scraping,
i.e., a set of techniques used to automatically get some information from a website instead of manually copying it. In
particular, we adopt scraping techniques to: (i) access tags as object members; (ii) find out tags whose name, contents or
attributes match some selection criteria; and (iii) access tag attributes by using a dictionary-like syntax. Scraping is
performed by using Python libraries provided by HTMLParser and Beautiful Soup. The adopted Web scraping techniques
will be explained in details in Section 3.2.

The ad extractor module gives as output an ordered set of the extracted banners, together with the corresponding url and
their descriptions. This set will be analyzed by the ad selector that selects the three banners to be inserted in the original
webpage.

Ad selector. This module is aimed at selecting suitable banner ads from the set extracted by the ad extractor. To this end,
several policies might be applied. In the current version of the system, we decide to adopt a random mechanism that selects
and provides three ads.

48 ISSN 1927-6974 E-ISSN 1927-6982

www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

Let us note that how to insert the banner in the given page is out of the scope of this paper, being it dependent on the server
that provides that page.

3 .1 Colla bor a t ive filt e r in g t o e x t r a ct sim ila r pa ge s

We exploit collaborative filtering to select the most relevant pages related to a given page p. The underlying idea is that, a
page p1 links p (i.e., it is an inlink of p) if the topics of p1 are related to the topics of p [35]. Vice versa, p links a page p2 (i.e.,
it is an outlink of p) if their topics are in some relationships.

Figure 2 gives a view on the all possible kinds of related links. Two kinds of inlinks and outlinks may exist: those that link
to an external domain (i.e., from A and to B in the Figure) and those that link to the same domain of the target webpage
(i.e., from T1 and to T2 in the Figure). In this work, we consider only inlinks belonging to different domains. In other
words we disregard inlinks that come from the same Web domain and outlinks, since –statistically speaking– inlinks are
more informative than outlinks [36].

Figure 2. A graphical view of related links.

3 .2 W e b scr a pin g t o e x t r a ct ba n n e r AD S
The ad extractor module takes as input all the extracted inlinks and analyzes them to extract the information related to all
the embedded ads. In particular, in this work we are interested in extracting banner ads. Thus, the module looks for HTML
anchor tag <a> and selects those that refer to an image:

<a href="brand_url"><img src="banner_ad_url" /></a>

where brand_url is the url of a company or a service to be advertised and banner_ad_url is the url of the image containing
the ad (i.e., the banner).

To extract the HTML code, the ad extractor relies on Web scraping through two specialized libraries: HTMLParser and
Beautiful Soup. HTMLParser defines a class HTMLParser that serves as the basis for parsing text files formatted in
HTML and XHTML. The class is instantiated without arguments and its instance is fed by HTML data and calls handler
functions when tags begin and end. The class is meant to be overridden by the user to provide a desired behavior.
BeautifulSoup is a Python library that parses broken HTML. BeautifulSoup is not a real HTML parser but uses regular
expressions to dive through tag soup. The main features of BeautifulSoup are: it yields a parse tree that makes
approximately as much sense as the original document, in case of the programmer gives it bad markup; it provides a few
simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree; and it automatically converts
incoming documents to Unicode and outgoing documents to UTF-8.

Published by Sciedu Press 49

www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

First, the ad extractor retrieves all the links embedded in the Web page:

bs = BeautifulSoup(page)
divs = bs.findAll('span', attrs={ 'herf', 'class' : 'result'})
results = []
i=0
while i < 10:
soup = BeautifulSoup(repr(divs[i]))
results.append(soup.findAll('a', attrs = {'href' : True})[0]['href'])
i=i+1

After its creation, bs, an instance of class BeautifulSoup, contains a well-formed HTML code of the selected page. From bs, the HTML
code that contains the results of the query is extracted and saved in divs. Finally, the first 10 ininks have been extracted and appended in
the results array.

Then, the module scraps each inlinks:

i=0
while i < 10:
p = urlopen(results[i]).read()
b = BeautifulSoup(p)
j=0
link = b.findAll('a', attrs = {'href': True})
links = []
for x in link:
soup = BeautifulSoup(repr(x))
links.append(soup.findAll(['a','img']))

From each inlink, the source code is extracted, saved in p, and then well-formed (b). All the links are then extracted and
those that are images are saved in the links array.

Figure 3. The two ways of representing an ad.

Finally, the extracted ads are collected in an ads repository and, once selected by the ad selector, they can be put in the
original Web page in two ways: (i) as banners, by simply presenting the retrieved images, or (ii) as textual ads, by
composing the corresponding url, its title and its snippet retrieved by asking to the Yahoo! search engine. Figure 3 shows
the two ways for representing the same ad.

4 A ca se st u dy
Let us note that due to the unsupervised nature of the proposed approach, a fair comparison with a classical Web
advertising system is not feasible. In fact, Web advertising systems use a pre-indexed or pre-classified set of ads, whereas
50 ISSN 1927-6974 E-ISSN 1927-6982
www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

in the current approach the system gathers ads directly from the webpages without any further information. Thus, to show
how the proposed system works, we propose a suitable case study.

Let us consider the task of suggesting ads to a given Web page, i.e., the home page of the portal Crastulo
(http://www.crastulo.it/). Crastulo is a local portal that collects information on events in Cagliari, such as concerts, shows,
and openings. A fragment of the home page is depicted in Figure 4.

Figure 4. A fragment of the webpage adopted in the case study.

First, the proposed system queries Yahoo! asking for the first 10 inlinks belonging to a different domain of the given page
(see Figure 5). Subsequently, the ad extractor performs scraping to capture all the banner ads contained in each inlink.
Finally, three ads are randomly selected to be displayed in the original Web page. Figure 6 shows the three selected ads
represented by their banners.

Figure 5. The inlinks of the webpage in hand proposed by Yahoo! site explorer.

Published by Sciedu Press 51

www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

Figure 6. The suggested retrieved ads.

5 Con clu sion s a n d fu t u r e w or k

In this paper, we presented a novel collaborative filtering-based Web advertising system that exploits Web scraping
techniques to suggest suitable ads to a given Web page. Considering a Web advertising task as information filtering one,
we devised a Web advertising system that adopts collaborative filtering features. The proposed system, first, relies on
collaborative filtering by exploiting peer pages and, subsequently, it resorts to Web scraping to perform the page content
analysis. To show how the system works in practice, we presented a suitable case study, i.e., how to suggest banner ads to
the home page of the Italian portal Crastulo. To our best knowledge this is the first attempt to exploit Web scraping
techniques to perform Web advertising.

As for the future work, we are setting up experiments aimed at calculating the performances of the proposed system in
term of precision at k, i.e., the ability of the system in suggesting k relevant ads, varying k in [1-5]. In particular, we are
interested in selecting a set of users, asking them to give a degree of relevance to each retrieved ads, e.g., relevant,
somewhat relevant, or irrelevant. Moreover, further research directions could be concerned with adapting the proposed
system to social networks (such as Facebook, Google+, or Twitter) to suggest ads according to users' preferences and
tastes.

Re fe r e n ce s
[1] Schrenk, M. Webbots, spiders, and screen scrapers: a guide to developing Internet agents with PHP/CURL. No Starch Press, 2007.
[2] Bolin, M., Webber, M., Rha, P., Wilson, T. & Miller, R.C. Automation and customization of rendered web pages. Proceedings of
the 18th annual ACM symposium on User interface software and technology, UIST '05, pp. 163-172. ACM, New York, NY, USA,
2005.
[3] Goldberg, D., Nichols, D., Oki, B.M. & Terry, D. Using collaborative filtering to weave an information tapestry. Commun. ACM.
1992; 35(12): 61-70. http://dx.doi.org/10.1145/138859.138867
[4] Armano, G. & Vargiu, E. A unifying view of contextual advertising and recommender systems. Proceedings of International
Conference on Knowledge Discovery and Information Retrieval (KDIR 2010). 2010: 463-466.
[5] Armano, G., Giuliani, A. & Vargiu, E. Intelligent Techniques in Recommendation Systems: Contextual Advancements and New
Methods, chap. IntelligentTechniques in Recommender Systems and Contextual Advertising: Novel Approaches and Case Studies.
S. Dehuri, M.R. Patra, B.B. Misra, A.K. Jagadev (eds.), IGI Global. 2012: 105-128.
[6] Manning, C.D., Raghavan, P. & Schtze, H. Introduction to Information Retrieval. Cambridge University Press, New York, NY,
USA, 2008. http://dx.doi.org/10.1017/CBO9780511809071
[7] Reid, R.H. Architects of the Web: 1,000 Days that Built the Future of Business. Wiley, 1997.
[8] Carrasco, J., Fain, D., Lang, K. & Zhukov, L. Clustering of bipartite advertiser-keyword graph. Proc. International Conference on
Data Mining (ICDM'03). Melbourne, Florida, 2003.

52 ISSN 1927-6974 E-ISSN 1927-6982

www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

[9] Joachims, T. Optimizing search engines using clickthrough data. ACM SIGKDD Conference on Knowledge Discovery and Data
Mining (KDD). 2002: 133-142.
[10] Ciaramita, M., Murdock, V. & Plachouras, V. Online learning from click data for sponsored search. Proceeding of the 17th
international conference on World Wide Web, WWW '08. ACM, New York, NY, USA. 2008; 227-236.
[11] Ribeiro-Neto, B., Cristo, M., Golgher, P.B. & Silva de Moura, E. Impedance coupling in content-targeted advertising. SIGIR '05:
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM,
New York, NY, USA. 2005; 496-503.
[12] Lacerda, A., Cristo, M., Gonçalves, M.A., Fan, W., Ziviani, N. & Ribeiro-Neto, B. Learning to advertise. SIGIR '06: Proceedings
of the 29th annual international ACM SIGIRconference on Research and development in information retrieval, ACM, New York,
NY, USA. 2006; 549-556.
[13] Broder, A., Fontoura, M., Josifovski, V. & Riedel, L. A semantic approach to contextual advertising. SIGIR '07: Proceedings of the
30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, NY,
USA. 2007; 559-566. http://dx.doi.org/10.1145/1277741.1277837
[14] Armano, G., Giuliani, A. & Vargiu, E. Semantic enrichment of contextual advertising by using concepts. International Conference
on Knowledge Discovery and Information Retrieval, 2011.
[15] Anagnostopoulos, A., Broder, A.Z., Gabrilovich, E., Josifovski, V. & Riedel, L. Just-in-time contextual advertising. CIKM '07:
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, New York, NY,
USA, 2007: 331-340. http://dx.doi.org/10.1145/1321440.1321488
[16] Armano, G., Giuliani, A. & Vargiu, E. Studying the impact of text summarization on contextual advertising. 8th International
Workshop on Text-based Information Retrieval, 2011.
[17] Armano, G., Giuliani, A. & Vargiu, E. Using snippets in text summarization: a comparative study and an application. Italian
Workshop on Information Retrieval (IIR 2012), 2012.
[18] Murdock, V., Ciaramita, M., Plachouras, V. A noisy-channel approach to contextual advertising. Proceedings of the 1st
international workshop on Data mining and audience intelligence for advertising, ADKDD '07. ACM, New York, NY, USA, 2007:
21-27.
[19] Ciaramita, M., Murdock & V., Plachouras, V. Semantic associations for contextual advertising. Journal of Electronic Commerce
Research. 2008; 9(1): 1-15.
[20] Adomavicius, G. & Tuzhilin, A. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible
extensions. IEEE Transactions on Knowledge and Data Engineering. 2005; 17(6): 734-749.
http://dx.doi.org/10.1109/TKDE.2005.99
[21] Rich, E. User modeling via stereotypes, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998.
[22] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P. & Riedl, J. Grouplens: an open architecture for collaborative filtering of
netnews. CSCW '94: Proceedings of the 1994 ACM conference on Computer supported cooperative work. ACM, New York, NY,
USA. 1994; 175-186. http://dx.doi.org/10.1145/192844.192905
[23] Hill, W., Stead, L., Rosenstein, M. & Furnas, G. Recommending and evaluating choices in a virtual community of use. CHI '95:
Proceedings of the SIGCHI conference on Human factors in computing systems. ACM Press/Addison-Wesley Publishing Co.,
New York, NY, USA. 1995; 194-201.
[24] Shardanand, U. & Maes, P. Social information filtering: algorithms for automating “word of mouth”. CHI '95: Proceedings of the
SIGCHI conference on Human factors in computing systems ACM Press/Addison-Wesley Publishing Co., New York, NY, USA,
1995; 210-217.
[25] Linden, G., Smith, B. & York, J. Amazon.com recommendations. IEEE Internet Computing. 2003; 07(1): 76-80.
http://dx.doi.org/10.1109/MIC.2003.1167344
[26] Terveen, L., Hill, W., Amento, B., McDonald, D. & Creter, J. Phoaks: a system for sharing recommendations. Communication of
ACM. 1997; 40(3): 59-62. http://dx.doi.org/10.1145/245108.245122
[27] Sarwar, B., Karypis, G., Konstan, J. & Reidl, J. Item-based collaborativefiltering recommendation algorithms. WWW '01:
Proceedings of the 10th international conference on World Wide Web. ACM, New York, NY, USA. 2001; 285-295.
http://dx.doi.org/10.1145/371920.372071
[28] Anastasakos, T., Hillard, D., Kshetramade, S. & Raghavan, H. Collaborative filtering approach to ad recommendation using the
query ad click graph. Proceedings of CIKM 2009, 2009. http://dx.doi.org/10.1145/1645953.1646267
[29] Berry, M.W. Survey of Text Mining. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2003.
[30] Kobayashi, M. & Takeda, K. Information retrieval on the web. ACM Comput. Surv. 2000; 32: 144-173.
http://dx.doi.org/10.1145/358923.358934
[31] Adams, A. & McCrindle, R. Pandora's box: social and professional issues of the information age. John Wiley & Sons, 2008.

Published by Sciedu Press 53

www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No. 1

[32] Lewerenz, E. An example of website screen scraping. Proceedings of MWSUG 2009, 2009.
[33] Mehlfuehrer, A. Web scraping - a tool evaluation. Master's thesis, Wien University, 2009.
[34] Penman, R.B. Web scraping made simple with sitescraper. Text, 2009.
[35] Koolen, M. & Kamps, J. Are semantically related links more effective for retrieval? Proceedings of the 33rd European Conference
on Advances in information retrieval, ECIR'11. Springer-Verlag, Berlin, Heidelberg. 2011: 92-103.
[36] Armano, G., Giuliani, A. & Vargiu, E. Are related links effective for contextual advertising? a preliminary study. International
Conference on Knowledge Discovery and Information Retrieval, 2012.

54 ISSN 1927-6974 E-ISSN 1927-6982

ToPrint ExamTopics 77 - 100
100% (1)
ToPrint ExamTopics 77 - 100
46 pages
Subprime Attention Crisis: Advertising and the Time Bomb at the Heart of the Internet
From Everand
Subprime Attention Crisis: Advertising and the Time Bomb at the Heart of the Internet
Tim Hwang
4/5 (1)
Manual de Usuario Suzuki Grand Vitara (2008) (337 Páginas)
No ratings yet
Manual de Usuario Suzuki Grand Vitara (2008) (337 Páginas)
2 pages
StotraNidhi Telugu 15-Books Combo
No ratings yet
StotraNidhi Telugu 15-Books Combo
1 page
API Casing To Recommended Bit Size
100% (1)
API Casing To Recommended Bit Size
3 pages
SAMA's Regulatory Role Strategy Project: Saudi Arabian Monetary Authority RFI Presentation - January 29, 2019
100% (2)
SAMA's Regulatory Role Strategy Project: Saudi Arabian Monetary Authority RFI Presentation - January 29, 2019
44 pages
Exploiting Web Scraping in A Collaborative Filtering-Based Approach To Web Advertising
No ratings yet
Exploiting Web Scraping in A Collaborative Filtering-Based Approach To Web Advertising
11 pages
A-Z of Digital Marketing: The Digital Marketing Dictionary
From Everand
A-Z of Digital Marketing: The Digital Marketing Dictionary
Ezekiel Inyang
No ratings yet
Making Money Using Online Advertising
From Everand
Making Money Using Online Advertising
Dr. Hedaya Alasooly
No ratings yet
Secret Techniques for Evaluating (Cpc) Cost Per Click Advertising: Google Adwords and Yahoo Overture Action Research Evaluation
From Everand
Secret Techniques for Evaluating (Cpc) Cost Per Click Advertising: Google Adwords and Yahoo Overture Action Research Evaluation
Jimmy Ghinis
No ratings yet
The E-Term Dictionary
From Everand
The E-Term Dictionary
James Halpin
No ratings yet
Earning Money through Online Advertising
From Everand
Earning Money through Online Advertising
Dr. Hidaia Mahmood Alassouli
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
4 pages
Junjoewong (2018) - ProCircle A Promotion Platform Using
No ratings yet
Junjoewong (2018) - ProCircle A Promotion Platform Using
5 pages
Neural Networks in Big Data and Web Search: Will Serrano
No ratings yet
Neural Networks in Big Data and Web Search: Will Serrano
41 pages
GazoPa: Exploring the Visionary World of GazoPa
From Everand
GazoPa: Exploring the Visionary World of GazoPa
Fouad Sabry
No ratings yet
RRIOC 11 1 Gheorghe
No ratings yet
RRIOC 11 1 Gheorghe
13 pages
Nayak (2022) - A Study On Web Scraping
No ratings yet
Nayak (2022) - A Study On Web Scraping
3 pages
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
No ratings yet
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
25 pages
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
No ratings yet
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
11 pages
Developing Products Update-Alert System For E-Commerce Websites Users Using HTML Data and Web Scraping Technique
No ratings yet
Developing Products Update-Alert System For E-Commerce Websites Users Using HTML Data and Web Scraping Technique
7 pages
Research On Redrawing The Tag Base Search Model On The Deep Invisible Web
No ratings yet
Research On Redrawing The Tag Base Search Model On The Deep Invisible Web
6 pages
A Keyword Focused Web Crawler Using Domain Engineering and Ontology
No ratings yet
A Keyword Focused Web Crawler Using Domain Engineering and Ontology
3 pages
Effective Web Crawling
No ratings yet
Effective Web Crawling
191 pages
Web Scraping of Social Networks: Nternational Ournal of Nnovative Esearch in Omputer and Ommunication Ngineering
No ratings yet
Web Scraping of Social Networks: Nternational Ournal of Nnovative Esearch in Omputer and Ommunication Ngineering
4 pages
Crawling Deep Web Entity Pages: Yeye He Heyeye@cs - Wisc.edu Dong Xin Venkatesh Ganti Sriram Rajaraman Nirav Shah
No ratings yet
Crawling Deep Web Entity Pages: Yeye He Heyeye@cs - Wisc.edu Dong Xin Venkatesh Ganti Sriram Rajaraman Nirav Shah
10 pages
How to... Marketing for Small Business
From Everand
How to... Marketing for Small Business
Nicolae Sfetcu
5/5 (1)
Design and Implementation of A Simple Web Search E
No ratings yet
Design and Implementation of A Simple Web Search E
9 pages
Web Crawling State of ArtTechniques ApproachesandApplication
No ratings yet
Web Crawling State of ArtTechniques ApproachesandApplication
26 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
No ratings yet
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
6 pages
Icwet 1094
No ratings yet
Icwet 1094
6 pages
Wireless Systems Comp Advert
No ratings yet
Wireless Systems Comp Advert
69 pages
Sing Rodia 2019
No ratings yet
Sing Rodia 2019
6 pages
chp3A10.10072F978 3 319 32001 4 - 483 1
No ratings yet
chp3A10.10072F978 3 319 32001 4 - 483 1
4 pages
21jul201512071432 DAIWAT A VYAS 1-6
No ratings yet
21jul201512071432 DAIWAT A VYAS 1-6
6 pages
Relative Insertion of Business To Customer URL by Discover Web Information Schemas
No ratings yet
Relative Insertion of Business To Customer URL by Discover Web Information Schemas
4 pages
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
No ratings yet
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
4 pages
Downloading Hidden Web Content
No ratings yet
Downloading Hidden Web Content
25 pages
Adaptive Focus
No ratings yet
Adaptive Focus
6 pages
A Survey On Web Scraping and Its Applications - IJCRT
No ratings yet
A Survey On Web Scraping and Its Applications - IJCRT
4 pages
A Survey of Focused Web Crawling Algorithms
No ratings yet
A Survey of Focused Web Crawling Algorithms
4 pages
AI For Entreprender: How to successfully build, grow or expand your business using artificial intelligence
From Everand
AI For Entreprender: How to successfully build, grow or expand your business using artificial intelligence
empreender
No ratings yet
Web Crawler A Survey
No ratings yet
Web Crawler A Survey
3 pages
e-Strategy Pure & Simple (Review and Analysis of Robert and Racine's Book)
From Everand
e-Strategy Pure & Simple (Review and Analysis of Robert and Racine's Book)
BusinessNews Publishing
No ratings yet
Complete Guide To Programmatic Advertising
From Everand
Complete Guide To Programmatic Advertising
James Lowery
5/5 (1)
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
No ratings yet
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
7 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Conclusion For Srs
No ratings yet
Conclusion For Srs
5 pages
Experiment 9: Web Mining
No ratings yet
Experiment 9: Web Mining
9 pages
Implementing A Web Crawler in A Smart Phone Mobile Application
No ratings yet
Implementing A Web Crawler in A Smart Phone Mobile Application
4 pages
E-Business Models and Web Strategies for Agribusiness
From Everand
E-Business Models and Web Strategies for Agribusiness
Roby Jose Ciju
3/5 (1)
IBM Cognos Business Intelligence
From Everand
IBM Cognos Business Intelligence
Dustin Adkison
No ratings yet
Raw Internet Advertising
No ratings yet
Raw Internet Advertising
44 pages
Intelligent Web Applications As Future Generation
No ratings yet
Intelligent Web Applications As Future Generation
9 pages
Smart Crawler
No ratings yet
Smart Crawler
92 pages
10.1007@s11280 018 0602 1
No ratings yet
10.1007@s11280 018 0602 1
34 pages
Digital Marketing A Comprehensive Guide: African Series, #1
From Everand
Digital Marketing A Comprehensive Guide: African Series, #1
Ncamiso Xaba
No ratings yet
Optimized Futures: The Intersection of SEO and AI Evolution
From Everand
Optimized Futures: The Intersection of SEO and AI Evolution
Jorge Castro
No ratings yet
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
An Approach For Search Engine Optimization Using Classification - A Data Mining Technique
No ratings yet
An Approach For Search Engine Optimization Using Classification - A Data Mining Technique
4 pages
Publication Automation System
No ratings yet
Publication Automation System
11 pages
Explores The Ways of Usage of Web Crawler in Mobile Systems
No ratings yet
Explores The Ways of Usage of Web Crawler in Mobile Systems
5 pages
Chapter - 2 Literature Survey: S. No Page No
No ratings yet
Chapter - 2 Literature Survey: S. No Page No
22 pages
Psyc325 U5 Ip Final Turn in This One 2
No ratings yet
Psyc325 U5 Ip Final Turn in This One 2
6 pages
Amani's Resume 2025
No ratings yet
Amani's Resume 2025
2 pages
Jtac Notes
No ratings yet
Jtac Notes
18 pages
BCI Protocol V1.4
No ratings yet
BCI Protocol V1.4
3 pages
Danais 150 Catálogo
No ratings yet
Danais 150 Catálogo
12 pages
Caterpillar Model
100% (1)
Caterpillar Model
109 pages
PDF Succinctly
100% (1)
PDF Succinctly
60 pages
RD545 Acoustic Leak Detector: Advanced Electronic Ground Microphone
No ratings yet
RD545 Acoustic Leak Detector: Advanced Electronic Ground Microphone
2 pages
3 Categories of Entrants
No ratings yet
3 Categories of Entrants
5 pages
KWV 230 BT
No ratings yet
KWV 230 BT
96 pages
Trade Ultra Brochure Web
No ratings yet
Trade Ultra Brochure Web
11 pages
Standard Truss Garage Plan
No ratings yet
Standard Truss Garage Plan
12 pages
Draft DGS Order As An Addendum To Order 28 of 2020 v3
No ratings yet
Draft DGS Order As An Addendum To Order 28 of 2020 v3
19 pages
Iso 123
No ratings yet
Iso 123
13 pages
DLL - Mapeh 6 - Q1 - W6
No ratings yet
DLL - Mapeh 6 - Q1 - W6
6 pages
NLP Extc Sem8 Final Exam IMPs
No ratings yet
NLP Extc Sem8 Final Exam IMPs
3 pages
Sf6 Gas Density Monitor
No ratings yet
Sf6 Gas Density Monitor
2 pages
Digi EX50 User Guide 90002435
No ratings yet
Digi EX50 User Guide 90002435
1,189 pages
Raphael
No ratings yet
Raphael
8 pages
Full Paper Title in Title Case: Name Surname, Name Surname
No ratings yet
Full Paper Title in Title Case: Name Surname, Name Surname
4 pages
The Business of Intellectual Property A Literature Review of IP Management Research
No ratings yet
The Business of Intellectual Property A Literature Review of IP Management Research
20 pages
3.7.1 Copies of Colabarations For 2021 22 Part 3
No ratings yet
3.7.1 Copies of Colabarations For 2021 22 Part 3
240 pages
Movie Recommendation System
No ratings yet
Movie Recommendation System
28 pages
Whitepaper EngineeringDesignSimulationShapeOptimization OnshapeSimScaleESTECO
No ratings yet
Whitepaper EngineeringDesignSimulationShapeOptimization OnshapeSimScaleESTECO
17 pages
Leviat - Ancon - AUS Coupler BR - 2024
No ratings yet
Leviat - Ancon - AUS Coupler BR - 2024
24 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Exploiting Web Scraping in A Collaborati

Uploaded by

Exploiting Web Scraping in A Collaborati

Uploaded by

www.sciedu.ca/air Artificial Intelligence Research, 2013, Vol. 2, No.

Ex ploit in g w e b scr a pin g in a colla bor a t ive filt e r in g-

2 .2 Colla bor a t ive filt e r in g

46 ISSN 1927-6974 E-ISSN 1927-6982

3 Th e pr opose d a ppr oa ch t o w e b a dve r t isin g

Figure 1. The architecture of the proposed system.

48 ISSN 1927-6974 E-ISSN 1927-6982

3 .1 Colla bor a t ive filt e r in g t o e x t r a ct sim ila r pa ge s

Figure 2. A graphical view of related links.

<a href="brand_url"><img src="banner_ad_url" /></a>

Published by Sciedu Press 49

Then, the module scraps each inlinks:

Figure 3. The two ways of representing an ad.

Figure 4. A fragment of the webpage adopted in the case study.

Published by Sciedu Press 51

Figure 6. The suggested retrieved ads.

5 Con clu sion s a n d fu t u r e w or k

52 ISSN 1927-6974 E-ISSN 1927-6982

Published by Sciedu Press 53

54 ISSN 1927-6974 E-ISSN 1927-6982

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.