0% found this document useful (0 votes)
8 views45 pages

IR Module 3

A web search engine is a software system that enables users to search for information on the internet by entering queries, which are processed through crawling, indexing, and ranking. Web crawling involves automated programs that collect data from web pages, while indexing organizes this data for efficient retrieval. Various types of search engines exist, including crawler-based, directories, hybrid, and meta search engines, each employing different methods to present relevant search results to users.

Uploaded by

yadu.lalb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views45 pages

IR Module 3

A web search engine is a software system that enables users to search for information on the internet by entering queries, which are processed through crawling, indexing, and ranking. Web crawling involves automated programs that collect data from web pages, while indexing organizes this data for efficient retrieval. Various types of search engines exist, including crawler-based, directories, hybrid, and meta search engines, each employing different methods to present relevant search results to users.

Uploaded by

yadu.lalb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

WEB SEARCH ENGINE

 A web search engine is a software system designed to search for


information on the World Wide Web.
 It allows users to enter queries (keywords or phrases) and
retrieves relevant web pages, documents, images, videos, or other
types of content from the web based on those queries.

Here's an overview of how a typical web search engine works:

 Crawling:
 The search engine sends out automated programs called web
crawlers or spiders to systematically browse the web,
following links from one page to another.
 These crawlers collect information about the content and
structure of web pages, indexing the content they find.
 Indexing:
 The collected information is then processed and stored in an
index, which is essentially a large database.
 This index contains information about the words or phrases
found on each web page, along with their location (URL),
metadata, and other relevant data.
 Query Processing:
 When a user enters a search query, the search engine
analyzes the query and retrieves relevant documents from its
index.
 This process involves matching the query terms with the
indexed content and ranking the results based on various
factors such as relevance, authority, and freshness.
 Ranking:
 Once the relevant documents are retrieved, the search
engine ranks them based on their perceived relevance to the
user's query.
 This ranking is typically determined by complex algorithms
that take into account factors such as keyword frequency,
inbound links, page quality, user engagement metrics, and
more.
 Presentation of Results:
 Finally, the search engine displays the search results to the
user, usually in a list format on a search engine results page
(SERP).
 Each result typically includes a title, snippet (a brief summary
of the page content), and URL, along with other optional
features like images, videos, featured snippets, and ads.

Popular web search engines include Google, Bing, Yahoo, Baidu, and
DuckDuckGo, among others. These search engines continuously update
and refine their algorithms to provide users with the most relevant and
useful search results possible. Additionally, they offer various features and
tools to enhance the search experience, such as filters, advanced search
operators, and personalized recommendations.
CRAWLING

Introduction to Web Crawling

 Web crawling, also known as web scraping or spidering, is the


process by which automated programs, called web crawlers or
spiders, systematically browse the World Wide Web to collect
information from web pages.
 Crawling is a fundamental component of web search engines and
various other applications that require access to web data.

Purpose of Web Crawling

 The primary purpose of web crawling is to gather data from across


the web to build indexes that can be searched and queried by users.
 Search engines like Google, Bing, and others rely heavily on web
crawlers to discover and index web pages, enabling users to find
relevant information through search queries.

How Web Crawling Works

 Seed URLs:
 The crawling process typically begins with a set of seed URLs,
which are the starting points for the crawler. These URLs
could be provided manually or generated programmatically.

 Fetching:
 The crawler starts by fetching the content of the seed URLs.
 It makes HTTP requests to the web servers hosting the
pages and retrieves the HTML, CSS, JavaScript, and other
resources associated with each URL.

 Parsing:
 Once the content is fetched, the crawler parses the HTML to
extract links to other pages within the same website (internal
links) and links to external websites (external links).
 It also extracts other relevant information, such as text
content, metadata, and structural elements.

 URL Frontier:
 The crawler maintains a queue of URLs known as the URL
frontier or crawl frontier.
 This queue contains URLs discovered during the crawling
process but not yet visited.
 The crawler prioritizes URLs based on factors such as
relevance, freshness, and popularity.

 Exploration:
 The crawler continues to explore the web by recursively
following links from the URL frontier to new pages.
 It may also revisit previously crawled pages to check for
updates or changes.
 Robots Exclusion Protocol:
 During the crawling process, crawlers respect the rules
specified in the Robots Exclusion Protocol (robots.txt).
 This standard allows website owners to control which parts
of their site are accessible to crawlers and which are not.

Challenges and Considerations

Web crawling poses several challenges and considerations, including:

 Politeness:

 Crawlers must be polite and respect the resources of web


servers by adhering to crawl rate limits and avoiding
overloading servers with requests.

 Dynamic Content:

 Crawlers need to handle dynamic content generated by


client-side scripting languages like JavaScript, as well as
content behind forms or login screens.

 Duplicate Content:
 Crawlers must identify and handle duplicate content
effectively to ensure that each page is indexed only once.

In summary, web crawling is a crucial process for collecting data from the
web and is essential for the functioning of search engines and various
other web-based applications.
Types of search engines

1. Crawler based
 Crawler-based search engines use automated software
programs to survey and categorize web pages.
 The programs used by the search engines to access the web
pages are called spiders, crawlers, robots or bots‘.
 Examples:
a) Google (www.google.com) b) Ask Jeeves (www.ask.com)
2. Directories
 A directory uses human editors who decide what category
the site belongs to; they place websites within specific
categories in the directories database.
 The human editors comprehensively check the website and
rank it, based on the information they find, using a pre-
defined set of rules.
 Examples:
 a) Yahoo Directory (www.yahoo.com) b) Open Directory
(www.dmoz.org)
3. Hybrid Search Engines
 Hybrid search engines use a combination of both crawler-
based results and directory results.
 More and more search engines these days are moving to a
hybrid-based model.
 Examples:
a) Yahoo (www.yahoo.com) b) Google (www.google.com)
4. Meta Search Engines
 Meta search engines take the results from all the other search
engines results and combine them into one large listing.
 Examples:
a)Meta crawler (www.metacrawler.com) b) Dogpile
(www.dogpile.com)

Web Search Overview


Introduction to Web Search
 Web search refers to the process of finding relevant information on
the World Wide Web using search engines.
 It's one of the most common and essential activities performed by
internet users every day.
 Web search engines enable users to discover a vast array of
content, including web pages, images, videos, news articles, and
more.
Components of Web Search
 User Query:
 A search query is a set of words or phrases entered by the
user into the search engine's query box.
 The search engine uses this query to retrieve relevant
information from its index.

 Indexing:
 Search engines maintain massive indexes of web pages and
other content crawled from the web.
 These indexes are organized and structured databases that
enable efficient retrieval of relevant information in response
to user queries.

 Crawling:
 Crawling is the process by which search engines
systematically browse the web to discover and collect
information from web pages.
 Web crawlers, also known as spiders or bots, navigate
through hyperlinks on web pages to find new content and
update the search engine's index.

 Ranking Algorithm:
 Search engines use complex ranking algorithms to determine
the relevance and importance of web pages in response to a
user query.
 Factors such as keyword relevance, content quality,
authority, and user engagement metrics are considered when
ranking search results.

 User Interface:
 The search engine's user interface presents search results to
the user in a user-friendly format, typically as a list of links on
a search engine results page (SERP).
 The SERP may also include additional features such as
snippets, images, videos, knowledge panels, and
advertisements.

Popular Web Search Engines


 Google: Google is the most widely used web search engine, known
for its comprehensive index, relevance, and advanced search
capabilities.
 Bing: Bing is Microsoft's search engine, offering similar features to
Google and serving as its primary competitor.
 Yahoo: Yahoo Search is another popular search engine that
provides web search, along with news, email, and other services.
 DuckDuckGo: DuckDuckGo is a privacy-focused search engine
that emphasizes user privacy and does not track user activity.
 Baidu: Baidu is the dominant search engine in China, offering web
search, maps, and other services tailored to the Chinese market.

Web structure

The bow tie structure of the web is a model that describes the
World Wide Web (WWW) as a giant connected component with
three main sectors: the IN set, the OUT set, and the Strongly
Connected Component (SCC). The structure is often depicted as
a bow tie with finger-like projections.

Components of the bow tie structure


 SCC:
 The central core of the Web (the knot of the bow-tie) is the
strongly connected component (SCC), which means that
for any two pages in the SCC, a user can navigate from one of
them to the other and back by clicking on links embedded in
the pages encountered.
 In other words, a user browsing a page in the SCC can always
reach any other page in the SCC by traversing some path of
links.
 The core of the bow tie, made up of a single strongly
connected component.
 This is the largest subgraph where every node is reachable
from any other node.
 IN set:
 The left bow, called IN, contains pages that have a directed
path of links leading to the SCC.
 Contains nodes that can reach the SCC but cannot be
reached by it.
 This set often includes new pages that have not yet been
linked to.
 OUT set:
 The right bow, called OUT, contains pages that can be
reached from the SCC by following a directed path of links
 Contains nodes that can be reached by the SCC but
cannot reach it. This set often includes corporate websites
that only contain internal links.
 Tendrils:
 Contain nodes that can reach the IN set or the OUT set, but
not the SCC.
 A page in Tendrils can either be reached from IN or leads
into OUT.
 Tubes:
 Contain nodes that travel from the IN set to the OUT set,
but not the SCC.
 A web page in Tubes has a directed path from IN to OUT
bypassing the SCC
 Disconnected Set
 The pages in disconnected are not even weakly connected to
the SCC; that is, even if we ignored the fact that hyperlinks
only allow forward navigation, allowing them to be traversed
backwards as well as forwards, we still could not reach the
SCC from them.
User Problems
There are some problems when users use the interface of a search
engine.
 The users do not exactly understand how to provide a
sequence of words for the search.
 The users may get unexpected answers because he/she is not
aware of the input requirement of the search engine. For
example, some search
 engines are case sensitive.
 The users have problems understanding Boolean logic:
therefore, the user cannot perform advanced searching.
 Novice users do not know how to start using a search
engine.
 The users do not care about advertisements, so the search
engine lacks funding.
 Around 85% of users only look at the first page of the result,
so relevant answers might be skipped. In order to solve the
problems above, the search engine must be easy to use and
provide relevant answers to the query.

Sponsored Search and Paid Placement


 An effective and profitable form of advertising for search
engines is called paid placement also known as sponsored
search.
Paid placement
 Online advertising in which links to advertisers products
appear next to keyword search results, has emerged as a
predominant form of Internet advertising.
 Under paid placement advertising, sellers (advertisers) bid
payments to a search engine to be placed on its
recommended list for a keyword search.
 A group of advertisers who bid more than the rest are
selected for placement, and their positions of placement
react their order of bids, with the highest bidder placed at
the top position.
 The rapid growth of paid placement advertising has made it
one of the most important Internet institutions, and has led
to enormous commercial successes for search engines.
 In this scheme the search engine separates its query results list into
two parts:
(i) an organic list, which contains the free unbiased
results, displayed according to the search engine‘s
ranking algorithm
(ii) a sponsored list, which is paid for by
advertising .This method of payment is called pay
per click (PPC), also known as cost per click (CPC),
since payment is made by the advertiser each time
a user clicks on the link in the sponsored listing.
Organic Search vs Paid Search

 Whenever you type a question into Google, or any other search


engine, the list of links that appear below the ads are known as
"organic results."
 These appear purely based on the quality and content of the page.
 Traffic that comes from people finding your links among these
results is classified as "organic search" traffic or just organic traffic.
 Paid search accounts are those that companies have paid to appear
the top of search results

What is SEO
 Search Engine Optimization refers to set of activities that are
performed to increase number of desirable visitors who come to
our site via search engine.
 These activities may include thing we do to our site itself, such as
making changes to our text and HTML code, formatting text or
document to communicate directly to the search engine.
Define SEO
 Search Engine Optimization is the process of improving the visibility
of a website on organic ("natural" or un-paid) search engine result
pages (SERPs), by incorporating search engine friendly elements into
a website.
 Search engine optimization is broken down into two basic areas:
on-page, and off-page optimization.
 On-page optimization refers to website elements which comprise a
web page, such as HTML code, textual content, and images. Offpage
optimization refers, predominantly, to back links (links pointing to
the site which is being optimized, from other relevant websites).
On-Page Factors
1. Title tags<title>
2. Header tags<h1>
3. ALT image tags
4. Content, Content(Body text)<body>
5. Hyperlink text
6. Keyword frequency and density
Off-Page Factors
1. Anchor text
2. Link Popularity (―votes for your site) – adds credibility

//Anchor text is the clickable text in a hyperlink, also known as link text
or link label. It's the part of a link that users see and click to navigate to
another page
Backlinks are links from other websites that point to a page on your
website.
Link popularity is a way to measure how many backlinks a website has
and how high-quality they are.//
The Search Engine Optimization Techniques
1.Domain name Strategies
 Domain naming is important to our overall foundation, use sub-
directory root domains (example.com/awesome)versus sub-
domains (awesome.example.com).
 All inbound links pointing to a website‘s main domain
(www.example.com) as well as its internal pages contribute to the
domain authority .
 Getting a back link from a high domain authority site is always
valuable.

2. Linking strategies
 the text in the links should include keywords
 the more inbound links the higher the SE ranking
 if the site linking to us is already indexed, spiders will also receive
our site
3. Keywords
 the most important in optimizing rankings
 keywords are words that appear the most in a page
 the spider chooses the appropriate keywords for each page, then
sends them back to its SE
 our web site will then be indexed based on our keywords
 can be key phrases or a single keyword
 do not use common words eg ‘the‘ and ‘of‘: spiders ignore them

4. Title tags
 The title tag on pages of your website tells search engines what
the page is about.
 It should be 70 characters or less and include your business or
brand name and keywords that relate to that specific page only.
 This tag is placed between the <HEAD> </HEAD> tags near the
top of the HTML code for the page.
5.Meta tag and description tags
 displayed below the title in search results
 use dynamic, promotional language
 use keywords
<meta name="description" content="Free Web tutorials
on HTML, CSS,XML, and XHTML">

6.Alt tags
-include keywords in your alt tags
<IMG src="star.gif" alt=”star logo">
7.Submit your website to SEs for indexing
-submit your site to search engine directories, directory sites and portal
sites
- indexing takes time

SEO techniques are classified into two broad categories:


White Hat SEO - Techniques that search engines recommend as part
of a good design.
Black Hat SEO - Techniques that search engines do not approve and
attempt to minimize the effect of. These techniques are also known as
spamdexing.

White Hat SEO


An SEO tactic is considered as White Hat if it has the following features:

 It conforms to the search engine's guidelines.


 It does not involve in any deception.
 It ensures that the content a search engine indexes, and
subsequently ranks, is the same content a user will see.
 It ensures that a web page content should have been created for
the users and not just for the search engines.
 It ensures good quality of the web pages.
 It ensures availability of useful content on the web pages.

Always follow a White Hat SEO tactic and do not try to fool your site
visitors. Be honest and you will definitely get something more.

Black Hat or Spamdexing

An SEO tactic, is considered as Black Hat or Spamdexing if it has the


following features:

 Attempting ranking improvements that are disapproved by the


search engines and/or involve deception.
 Redirecting users from a page that is built for search engines to
one that is more human friendly.
 Redirecting users to a page that was different from the page the
search engine ranked.
 Serving one version of a page to search engine spiders/bots and
another version to human visitors. This is called Cloaking SEO
tactic.
 Using hidden or invisible text or with the page background color,
using a tiny font size or hiding them within the HTML code such
as "no frame" sections.
 Repeating keywords in the meta-tags, and using keywords that
are unrelated to the website content. This is called meta-tag
stuffing.
 Calculated placement of keywords within a page to raise the
keyword count, variety, and density of the page. This is called
keyword stuffing.
 Creating low-quality web pages that contain very little content
but are instead stuffed with very similar keywords and phrases.
These pages are called Doorway or Gateway Pages.
 Mirror websites by hosting multiple websites - all with
conceptually similar content but using different URLs.
 Creating a rogue copy of a popular website which shows
contents similar to the original to a web crawler, but redirects
web surfers to unrelated or malicious websites. This is called
page hijacking.

Always stay away from any of the above Black Hat tactics to improve
the rank of our site. Search engines are smart enough to identify all

the above properties of our site and ultimately we are not going to get
anything.
SEO Tools
 Keyword Discovery - Find popular keywords that your site
should be targeting.
 Keyword Volume - Estimate how much search traffic a specific
keyword
 receives each month.
 Keyword Density - Analyze how well a specific webpage is
targeting a keyword.
 Back link Trackers - Keep track of how many sites are linking to
you.
 Site Popularity - Determine how popular your site is on the
internet.
 Keyword Rankings - Track of your site's rankings for a keyword
in the search engines.
 Firefox Add-ons - Install these add-ons to turn your browser
into an SEO research powerhouse. These extensions allow you
to directly analyze website SEO metrics like keyword density,
page authority, backlink profile, and more right within your
browser while browsing different pages.

Search Engine Spam:


Search engine spam, on the other hand, refers to manipulative
and deceptive practices aimed at tricking search engines into
ranking websites higher than they deserve. Search engine spam
can take various forms, including:
 Keyword Stuffing: Overloading web pages with excessive
keywords or irrelevant keywords in an attempt to
manipulate search engine rankings.
 Cloaking: Presenting different content to search engine
crawlers than what is shown to human users, with the
intent of deceiving search engines into ranking the page
higher.
 Link Schemes: Engaging in artificial or manipulative link-
building tactics, such as buying links, participating in link
farms, or exchanging links with unrelated websites.
 Doorway Pages: Creating low-quality, keyword-optimized
pages that serve as entry points to a website but provide
little value to users.
 Duplicate Content: Publishing duplicate or plagiarized
content across multiple web pages or websites in an
attempt to increase visibility in search results.

Search engines continuously update their algorithms to detect


and penalize spammy tactics, aiming to provide users with high-
quality, relevant search results.
//Refer IR notes also(page: 91)
Websites that engage in search engine spam risk being penalized
or banned from search engine results altogether, damaging their
online reputation and visibility.
In summary, while SEO aims to improve website visibility through
legitimate means, search engine spam seeks to manipulate search
engine rankings through deceptive practices. It's crucial for
website owners and marketers to understand the difference and
adhere to ethical SEO practices to achieve sustainable results in
the long run.

Web Size Measurement


Measuring the size of the web is a challenging task due to its
dynamic and decentralized nature. However, various methods
and tools exist to estimate the size of the web, including:
 Crawling and Indexing: Search engines like Google and Bing
continuously crawl the web to discover and index new web
pages. The number of pages indexed by these search engines
can provide an estimate of the size of the visible web.
 Sampling: Researchers may sample a subset of web pages and
extrapolate their findings to estimate the total size of the
web.
 Domain Counts: Counting the number of registered domain
names can provide an estimate of the size of the web,
although many domains may be inactive or contain only a few
pages.
 Content Management Systems (CMS): Analyzing the usage
statistics of popular CMS platforms like WordPress, Joomla, or
Drupal can provide insights into the distribution of web
content across different platforms.

Overall, estimating the size of the web is an ongoing area of


research, and different methodologies may yield varying results.

Web Search Architecture


The architecture of web search in Information Retrieval (IR)
systems involves several layers of technologies and processes
to enable users to find relevant content on the internet. At a
high level, it consists of the following components:

1. Crawling

 Crawlers/Spiders: These are automated programs that


systematically browse the web, following hyperlinks to
discover and fetch web pages.
 The crawlers start from a set of seed URLs and recursively
visit linked pages to build a vast index of the web.
 The crawlers need to consider the freshness and frequency
of content updates on the pages.

2. Indexing
 After crawling the web, the next step is to indexing,
where the collected data is structured for efficient
searching.
 Inverted Index: A key data structure used in indexing that
maps keywords to the list of documents (web pages) in
which they appear. This allows for fast lookup of terms
during query processing.
 Document Parsing: Web pages often contain rich media
(text, images, videos, links). The document's content is
parsed to extract text, metadata, and other relevant
features.
 Metadata Indexing: In addition to the body text,
metadata like titles, headings, meta descriptions, and anchor
text are also indexed to improve search relevance.
 URL Normalization: To avoid indexing duplicates, URL
normalization ensures that URLs pointing to the same
content are treated as identical.

3. Query Processing

 When a user submits a query, the system processes the


query to understand what the user is searching for.
 Tokenization: Breaking the search query into smaller
units (typically words or terms).
 Stemming/Lemmatization: Reducing words to their
base or root form to match variations (e.g., "running" →
"run").
 Query Expansion: Expanding the query with related
terms or synonyms to improve recall.
 Stop Words Removal: Common words like "the," "is," or
"in" are often ignored as they carry little meaning in
retrieval.

4. Ranking

 The retrieval system must decide how to rank documents


based on their relevance to the user’s query. Several
ranking algorithms are employed:
 TF-IDF (Term Frequency-Inverse Document
Frequency): Measures the importance of a term in a
document relative to its frequency across the entire
corpus.
 PageRank: Used by search engines like Google to rank
pages based on their link structure, under the assumption
that pages with more links pointing to them are more
important.
 Vector Space Model: Documents and queries are
represented as vectors in a multi-dimensional space, with
similarity measured (typically using cosine similarity).
 Machine Learning Models: Many modern search engines
apply machine learning to improve ranking, using features
like click-through rates, relevance feedback, and user
behavior.
5. Retrieval
 After ranking, the most relevant documents (or web pages)
are retrieved and presented to the user.
 The system needs to return the most relevant documents
quickly, often from a large-scale distributed index.
 Caching: Frequently accessed pages or search results
might be cached to reduce retrieval time and enhance
performance.

6. User Interface and Result Presentation

 The retrieved documents are presented to the user in a


readable, organized way. This often includes:
o Snippets: Short descriptions or extracts from
documents that show how the page matches the
query.
o Related Search Suggestions: Additional queries
that are related to the original search.
o Multimedia Results: Image, video, and news results
may also be presented in addition to traditional text-
based results.

7. Feedback Loop and Refinement

 Click-through Data: Search engines analyze user


interaction data (e.g., which links are clicked) to refine their
ranking models.
 Relevance Feedback: Users can mark documents as
relevant or irrelevant, contributing to the training of
machine learning models.
 Personalization: Search results may be personalized
based on a user’s search history, location, or preferences to
improve the user experience.

8. Distributed Systems and Scalability

 Given the vast scale of the web, search engines are typically
built on highly distributed systems to handle the massive
amounts of data and traffic.
 Techniques such as sharding, replication, and load
balancing are used to scale the system and ensure
availability and fault tolerance.
// Sharding is a technique for splitting large databases into
smaller parts, or shards, to improve performance and
scalability.

9. Ranking Factors in Web Search (e.g., for Google)

 Content quality: Well-written, original, and relevant


content tends to rank higher.
 User signals: Data like time spent on a page, bounce rate,
and click patterns inform search rankings.
 Mobile-friendliness: Websites optimized for mobile
devices are ranked higher.
 Page Load Speed: Faster-loading pages are prioritized.
 Security: HTTPS websites are given preference.

10. AI and Deep Learning in Web Search

 Advanced AI techniques, including natural language


processing (NLP) and deep learning models, are
increasingly used to improve the accuracy of search results.

Web search architecture in information retrieval is a complex,


multi-layered process that involves data collection, processing,
ranking, and retrieval, with continuous improvements based
on user feedback and AI advancements.

CRAWLING
 Web crawling is the process by which we gather pages from the
Web, in order to index them and support a search engine.
 The objective of crawling is to quickly and efficiently gather as many
useful web pages as possible, together with the link structure that
interconnects them.

Features of Crawler:

 Robustness: Ability to handle spider traps. The Web contains


servers that create spider traps, which are generators of web pages
that mislead crawlers into getting stuck fetching an infinite number
of pages in a particular domain. Crawlers must be designed to be
resilient to such traps.
 Politeness: Web servers have both implicit and explicit policies
regulating the rate at which a crawler can visit them. These
politeness policies must be respected.
 Distributed: The crawler should have the ability to execute in a
distributed fashion across multiple machines.
 Scalable: The crawler architecture should permit scaling up the
crawl rate by adding extra machines and bandwidth.
 Performance and efficiency: The crawl system should make
efficient use of various system resources including processor,
storage and network bandwidth.
 Quality: the crawler should be biased towards fetching “useful”
pages first.
 Freshness: In many applications, the crawler should operate in
continuous mode: it should obtain fresh copies of previously
fetched pages.
 Extensible: Crawlers should be designed to be extensible in many
ways to cope with new data formats, new fetch protocols, and so
on. This demands that the crawler architecture be modular.

Basic Operation:

 The crawler begins with one or more URLs that constitute a seed
set.
 It picks a URL from this seed set, and then fetches the web page at
that URL.
 The fetched page is then parsed, to extract both the text and the links
from the page (each of which points to another URL).
 The extracted text is fed to a text indexer.
 The extracted links (URLs) are then added to a URL frontier,
which at all times consists of URLs whose corresponding pages
have yet to be fetched by the crawler.

 Initially, the URL frontier contains the seed set; as pages are fetched,
the corresponding URLs are deleted from the URL frontier. The
entire process may be viewed as traversing the web graph.

 In continuous crawling, the URL of a fetched page is added back to


the frontier for fetching again in the future.

Web Crawler Architecture:


1) The URL frontier, containing URLs yet to be fetched in the
current crawl (in the case of continuous crawling, a URL may have
been fetched previously but is back in the frontier for re-fetching).

2) A DNS resolution module that determines the web server from


which to fetch the page specified by a URL.

3) A fetch module that uses the http protocol to retrieve the web
page at a URL.

4) A parsing module that extracts the text and set of links from a
fetched web page.

5) A duplicate elimination module that determines whether an


extracted link is already in the URL frontier or has recently been
fetched.

We begin by assuming that the URL frontier is in place and non-empty.


We follow the progress of a single URL through the cycle of being
fetched, passing through various checks and filters, then finally (for
continuous crawling) being returned to the URL frontier.
Meta-crawlers

 Metasearchers / Meta Crawlers are Web servers send a given


query to several search engines, Web directories and other
databases, collect the answers and unify them.
 Example, a decision about which hotel room to book for a trip is
usually not based on price alone. In fact, the average consumer
visits five or six websites before booking a room. Meta-search
engines consolidate results from multiple sites and categorize
results so consumers can view results based on characteristics
other than price.
How Meta Crawlers Work:

1. Query Submission: A user submits a query to the meta crawler.

2. Forwarding Queries: The meta crawler forwards this query to


multiple search engines (such as Google, Bing, Yahoo, etc.) or
databases.

3. Gathering Results: Each search engine returns a list of results


based on its individual indexing and ranking algorithm.

4. Merging Results: The meta crawler aggregates the results into a


single set of results. It may apply additional filtering or ranking
strategies to combine results from different sources.

5. Ranking and Presentation: The merged results are then ranked,


often based on relevance to the user’s query. This ranking can
consider factors like result quality, diversity, and relevance. Finally,
these results are presented to the user in a consolidated format.
Key Features of Meta Crawlers:

 Query Forwarding: They do not index content themselves but


instead forward queries to other search engines.

 Multiple Sources: They gather information from various search


engines, offering users a broader set of results.

 Results Aggregation: The key feature is the merging of data from


different sources and eliminating redundancy.

 Rank Optimization: Some meta crawlers rank the results to


provide more relevant or diverse answers.

 Customization: Some allow users to select which search engines


they prefer to include in the results.

Advantages of Meta Crawlers:

1. Broad Coverage: They allow users to access results from multiple


sources simultaneously, increasing the chances of finding relevant
information.

2. Reduced Redundancy: Since results are aggregated from multiple


search engines, it can avoid duplicating the same result from
different sources.

3. Improved Relevance: By combining results, a meta crawler might


offer a more comprehensive set of relevant answers.

4. Convenience: Users don’t have to visit multiple search engines


manually to compare results.
Disadvantages of Meta Crawlers:

1. Slower Response Time: Since queries must be sent to multiple


engines and results need to be aggregated, this can lead to slower
retrieval times.

2. Limited Control over Search Engines: The meta crawler does


not control the ranking or indexing algorithm of the search engines
it uses.

3. Data Duplication: If the meta crawler is not well-designed, it may


return duplicate results across different search engines.

Eg: Metacrawler, Kayak

Focused Crawling
 Focused Crawling in information retrieval is a technique used to
selectively crawl the web for content that is relevant to a
particular topic, rather than indiscriminately downloading all
available web pages. This method is used to target specific areas
of interest and gather information efficiently
Key Aspects of Focused Crawling:
1. Targeted Information Collection: The goal is to focus on
crawling pages that contain specific information related to a
predefined subject area or topic. This reduces the time and
resources spent on irrelevant or low-quality pages.
2. Topic Modeling: Focused crawlers use topic modeling techniques
to identify relevant content. This can involve analyzing web pages for
keywords, metadata, or structure that indicates a page is about a
particular subject of interest.
3. Resource Efficiency: By narrowing the scope of what is crawled,
focused crawling helps conserve computational resources (e.g.,
bandwidth and processing power), making it more efficient than
general-purpose crawlers.
4. Prioritization of Relevant Links: Instead of blindly following all
links from a page, a focused crawler prioritizes links that are more
likely to lead to pages relevant to the topic. This often involves a
ranking mechanism that assesses the relevance of a link before
following it.
5. Relevance Feedback: Some focused crawlers incorporate
relevance feedback mechanisms, where the system learns which
pages are relevant or not as it crawls. This feedback loop helps
refine the crawling process over time.
6. Depth vs. Breadth: Unlike traditional crawlers that might try to
visit a broad range of domains or URLs, focused crawlers go deep
into specific parts of the web, exploring more pages from a small
set of highly relevant domains or sections of the internet.
Applications of Focused Crawling:
 Topic-Specific Search Engines: To create specialized search
engines, such as ones focused on legal documents, medical papers,
or academic research.
 Web Mining and Knowledge Discovery: For gathering data on
specific subjects or trends.
 Building Domain-Specific Databases: Such as news aggregation
systems, scientific paper repositories, or e-commerce comparison
tools.
 Monitoring and Alerting Systems: For tracking changes or
updates in a specific field, like news, weather, or academic papers.
How It Works:
1. Initial Seed URLs: A set of starting URLs is chosen, usually based
on the topic of interest.
2. Relevance Function: The crawler evaluates the content of each
page using a relevance function (e.g., based on keyword matching,
metadata, or other criteria).
3. Link Following: The crawler follows links from relevant pages and
repeats the process, expanding its crawl to other pages that are
likely to be relevant.
4. Stop Criteria: The crawling process continues until a predefined
stopping condition is met, such as when the crawler has collected
enough data or when there are no more relevant pages to crawl.
Challenges:
 Noise and Irrelevance: Filtering out irrelevant pages while
crawling can be difficult, especially with the vast amount of content
available online.
 Scalability: Focusing on a particular topic might require
sophisticated techniques to scale, especially when the topic spans a
large part of the web.
 Dynamic Nature of the Web: Websites and pages change
frequently, so maintaining an up-to-date index of relevant content is
a challenge.

Near-duplicate detection
 Near-duplicate detection in information retrieval (IR) refers to
the process of identifying documents or pieces of content that
are very similar to one another, though they may not be exactly
the same.
 This concept plays an essential role in improving search engine
results, data quality, and user experience, especially in scenarios
where large amounts of content, such as news articles, research
papers, or user-generated content, need to be processed and
displayed.
Importance
1. Efficient Storage and Retrieval: Identifying near-duplicates
allows for more efficient storage by avoiding the duplication of
content in databases.
2. Improved Search Results: Users are less likely to encounter
redundant information, leading to more varied and relevant search
results.
3. Data Quality and Integrity: In some cases, near-duplicates may
indicate low-quality or spammy content, which should be filtered
out to improve the overall quality of the dataset.
4. Plagiarism Detection: Near-duplicate detection is also widely
used to identify cases of plagiarism in academic writing, online
content, etc.
Techniques for Near-Duplicate Detection
Several methods can be used to detect near-duplicates, and they can be
broadly categorized into exact matching and approximate
matching.
1. Exact Matching:
 Hashing: Documents can be hashed and compared directly for
matches. If two documents have the same hash, they are
considered identical or near-duplicate. However, this method does
not work well for documents that have slight variations.
// Hashing is a process that creates a unique string of characters
that represents the document.
2. Approximate Matching:
This approach focuses on detecting documents that are similar but not
exactly the same.
 Fingerprinting: This technique involves creating a "fingerprint" for
each document (usually using hash functions) and comparing these
fingerprints. A good example is SimHash (used in Google's search
algorithms), which provides a compact representation of
documents while tolerating minor changes.
 Cosine Similarity: This is a popular vector-based approach. Each
document is converted into a vector, and the similarity between
documents is computed using cosine similarity. If the cosine
similarity is above a certain threshold, the documents are
considered similar.
 Jaccard Similarity: Measures the similarity between two sets by
dividing the size of their intersection by the size of their union. This
can be useful for detecting near-duplicates in documents
represented by sets of words or n-grams.
 Minhashing: This technique is an efficient way of approximating
Jaccard similarity and is particularly useful for large datasets. It helps
in detecting similar sets of documents by comparing their hash
values, reducing the dimensionality of the problem.
 Edit Distance (Levenshtein Distance): This measures the
number of insertions, deletions, or substitutions needed to convert
one document into another. It's often used for text comparison,
although it may be computationally expensive for long documents.
 Latent Semantic Analysis (LSA): LSA is used to identify
patterns in the relationships between terms in a set of documents.
It uses singular value decomposition (SVD) to reduce the
dimensionality of the term-document matrix, helping to identify
latent semantic structures that can be used to determine document
similarity.
 Word Embeddings: Modern approaches involve using word
embeddings (like Word2Vec, GloVe, or BERT) to represent words
as vectors in a continuous vector space. These embeddings capture
the semantic meaning of words, allowing for the comparison of
documents based on their semantic content rather than exact
word matches.
3. Machine Learning Approaches:
 Supervised Models: Train a classifier (e.g., logistic regression,
support vector machines) to detect near-duplicates based on
features like cosine similarity, edit distance, or other domain-
specific features.
 Deep Learning: With the advancement of deep learning
techniques, models like Siamese networks can be trained to
detect near-duplicates by learning a similarity function directly from
the data.
Applications
 Web Search Engines: Search engines like Google and Bing use
near-duplicate detection to ensure that search results are diverse
and not cluttered with redundant information.
 Document Deduplication: In large-scale information retrieval
tasks, identifying near-duplicates helps in keeping datasets clean,
such as in news aggregators or academic databases.
 Plagiarism Detection: Systems like Turnitin or Copyscape use
near-duplicate detection to flag copied content across academic
papers, articles, and websites.
 Content Recommendation: In recommendation systems,
identifying similar items or content (e.g., articles, videos, etc.) helps
in providing relevant suggestions to users.

Index Compression
Index compression refers to techniques used to reduce the size of
inverted indexes, which store mappings from terms (words or phrases)
to the documents they appear in. Since large-scale IR systems, such as
search engines, often work with massive datasets, reducing the storage
footprint of indexes without sacrificing retrieval performance is crucial.
Why Index Compression is Important
 Efficiency: Compressed indexes require less disk space, which
makes them faster to load into memory and reduces the number of
disk accesses, improving retrieval time.
 Cost-effective: Storing compressed indexes is cheaper, especially
for large-scale systems.
 Scalability: As data grows, compressed indexes help the system
scale more efficiently without needing disproportionately large
storage systems.
Types of Index Compression
1. Dictionary Compression:
o Goal: Compress the list of terms (the dictionary).
o Common Techniques:
 Huffman Coding: Assigns shorter codes to frequent
terms and longer codes to infrequent terms.
 Front Coding: Stores the common prefixes of terms
efficiently, reducing redundancy.
 Delta Encoding: Instead of storing the entire term,
stores only the differences (deltas) between consecutive
terms in a sorted order. This is effective when terms
are lexicographically ordered.
 Variable-Length Encoding: Allocates different bit
lengths for terms depending on their frequency or
length.
2. Postings List Compression:
o Goal: Compress the lists of document IDs (postings) for each
term.
o Common Techniques:
 Gap Encoding (Delta Encoding): Instead of storing
absolute document IDs, stores the differences between
consecutive document IDs. This leads to smaller values
when the document IDs are close to each other.
 Gamma Coding: Encodes integers using a
combination of binary and unary codes, suitable for
small integers (like document IDs).
 Run-Length Encoding (RLE): If there are long
stretches of consecutive document IDs, RLE stores the
number of consecutive occurrences, reducing
redundancy.
 Simple9 and PForDelta: Special encoding techniques
optimized for storing postings lists that achieve high
compression ratios for certain types of data.
//Simple9 and PForDelta are both integer compression
algorithms that are used to pack numbers into words or
compress groups of numbers
3. Combined Compression:
o Goal: Integrate multiple compression techniques to compress
the entire inverted index effectively.
o Common Techniques:
 Block-based compression: Postings lists are divided
into blocks, and each block is compressed separately
using various techniques like PForDelta or Gamma
coding.
 Static vs. Dynamic Compression: Static
compression methods assume the index does not
change over time, while dynamic methods handle
updates and deletions efficiently.
Benefits of Index Compression
 Reduced Storage: Compressed indexes take up significantly less
space, which is important for systems that manage huge amounts of
data.
 Improved Speed: Although compression involves some overhead
for decompression, the reduced index size allows for faster
retrieval operations due to reduced disk I/O and memory
consumption.
 Faster Search: Smaller index sizes mean less data needs to be
processed during query evaluation, speeding up the search process.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy