0% found this document useful (0 votes)

8 views45 pages

IR Module 3

A web search engine is a software system that enables users to search for information on the internet by entering queries, which are processed through crawling, indexing, and ranking. Web crawling involves automated programs that collect data from web pages, while indexing organizes this data for efficient retrieval. Various types of search engines exist, including crawler-based, directories, hybrid, and meta search engines, each employing different methods to present relevant search results to users.

Uploaded by

yadu.lalb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views45 pages

IR Module 3

Uploaded by

yadu.lalb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

WEB SEARCH ENGINE

 A web search engine is a software system designed to search for

information on the World Wide Web.
 It allows users to enter queries (keywords or phrases) and
retrieves relevant web pages, documents, images, videos, or other
types of content from the web based on those queries.

Here's an overview of how a typical web search engine works:

 Crawling:
 The search engine sends out automated programs called web
crawlers or spiders to systematically browse the web,
following links from one page to another.
 These crawlers collect information about the content and
structure of web pages, indexing the content they find.
 Indexing:
 The collected information is then processed and stored in an
index, which is essentially a large database.
 This index contains information about the words or phrases
found on each web page, along with their location (URL),
metadata, and other relevant data.
 Query Processing:
 When a user enters a search query, the search engine
analyzes the query and retrieves relevant documents from its
index.
 This process involves matching the query terms with the
indexed content and ranking the results based on various
factors such as relevance, authority, and freshness.
 Ranking:
 Once the relevant documents are retrieved, the search
engine ranks them based on their perceived relevance to the
user's query.
 This ranking is typically determined by complex algorithms
that take into account factors such as keyword frequency,
inbound links, page quality, user engagement metrics, and
more.
 Presentation of Results:
 Finally, the search engine displays the search results to the
user, usually in a list format on a search engine results page
(SERP).
 Each result typically includes a title, snippet (a brief summary
of the page content), and URL, along with other optional
features like images, videos, featured snippets, and ads.

Popular web search engines include Google, Bing, Yahoo, Baidu, and
DuckDuckGo, among others. These search engines continuously update
and refine their algorithms to provide users with the most relevant and
useful search results possible. Additionally, they offer various features and
tools to enhance the search experience, such as filters, advanced search
operators, and personalized recommendations.
CRAWLING

Introduction to Web Crawling

 Web crawling, also known as web scraping or spidering, is the

process by which automated programs, called web crawlers or
spiders, systematically browse the World Wide Web to collect
information from web pages.
 Crawling is a fundamental component of web search engines and
various other applications that require access to web data.

Purpose of Web Crawling

 The primary purpose of web crawling is to gather data from across

the web to build indexes that can be searched and queried by users.
 Search engines like Google, Bing, and others rely heavily on web
crawlers to discover and index web pages, enabling users to find
relevant information through search queries.

How Web Crawling Works

 Seed URLs:
 The crawling process typically begins with a set of seed URLs,
which are the starting points for the crawler. These URLs
could be provided manually or generated programmatically.

 Fetching:
 The crawler starts by fetching the content of the seed URLs.
 It makes HTTP requests to the web servers hosting the
pages and retrieves the HTML, CSS, JavaScript, and other
resources associated with each URL.

 Parsing:
 Once the content is fetched, the crawler parses the HTML to
extract links to other pages within the same website (internal
links) and links to external websites (external links).
 It also extracts other relevant information, such as text
content, metadata, and structural elements.

 URL Frontier:
 The crawler maintains a queue of URLs known as the URL
frontier or crawl frontier.
 This queue contains URLs discovered during the crawling
process but not yet visited.
 The crawler prioritizes URLs based on factors such as
relevance, freshness, and popularity.

 Exploration:
 The crawler continues to explore the web by recursively
following links from the URL frontier to new pages.
 It may also revisit previously crawled pages to check for
updates or changes.
 Robots Exclusion Protocol:
 During the crawling process, crawlers respect the rules
specified in the Robots Exclusion Protocol (robots.txt).
 This standard allows website owners to control which parts
of their site are accessible to crawlers and which are not.

Challenges and Considerations

Web crawling poses several challenges and considerations, including:

 Politeness:

 Crawlers must be polite and respect the resources of web

servers by adhering to crawl rate limits and avoiding
overloading servers with requests.

 Dynamic Content:

 Crawlers need to handle dynamic content generated by

client-side scripting languages like JavaScript, as well as
content behind forms or login screens.

 Duplicate Content:
 Crawlers must identify and handle duplicate content
effectively to ensure that each page is indexed only once.

In summary, web crawling is a crucial process for collecting data from the
web and is essential for the functioning of search engines and various
other web-based applications.
Types of search engines

1. Crawler based
 Crawler-based search engines use automated software
programs to survey and categorize web pages.
 The programs used by the search engines to access the web
pages are called spiders, crawlers, robots or bots‘.
 Examples:
a) Google (www.google.com) b) Ask Jeeves (www.ask.com)
2. Directories
 A directory uses human editors who decide what category
the site belongs to; they place websites within specific
categories in the directories database.
 The human editors comprehensively check the website and
rank it, based on the information they find, using a pre-
defined set of rules.
 Examples:
 a) Yahoo Directory (www.yahoo.com) b) Open Directory
(www.dmoz.org)
3. Hybrid Search Engines
 Hybrid search engines use a combination of both crawler-
based results and directory results.
 More and more search engines these days are moving to a
hybrid-based model.
 Examples:
a) Yahoo (www.yahoo.com) b) Google (www.google.com)
4. Meta Search Engines
 Meta search engines take the results from all the other search
engines results and combine them into one large listing.
 Examples:
a)Meta crawler (www.metacrawler.com) b) Dogpile
(www.dogpile.com)

Web Search Overview

Introduction to Web Search
 Web search refers to the process of finding relevant information on
the World Wide Web using search engines.
 It's one of the most common and essential activities performed by
internet users every day.
 Web search engines enable users to discover a vast array of
content, including web pages, images, videos, news articles, and
more.
Components of Web Search
 User Query:
 A search query is a set of words or phrases entered by the
user into the search engine's query box.
 The search engine uses this query to retrieve relevant
information from its index.

 Indexing:
 Search engines maintain massive indexes of web pages and
other content crawled from the web.
 These indexes are organized and structured databases that
enable efficient retrieval of relevant information in response
to user queries.

 Crawling:
 Crawling is the process by which search engines
systematically browse the web to discover and collect
information from web pages.
 Web crawlers, also known as spiders or bots, navigate
through hyperlinks on web pages to find new content and
update the search engine's index.

 Ranking Algorithm:
 Search engines use complex ranking algorithms to determine
the relevance and importance of web pages in response to a
user query.
 Factors such as keyword relevance, content quality,
authority, and user engagement metrics are considered when
ranking search results.

 User Interface:
 The search engine's user interface presents search results to
the user in a user-friendly format, typically as a list of links on
a search engine results page (SERP).
 The SERP may also include additional features such as
snippets, images, videos, knowledge panels, and
advertisements.

Popular Web Search Engines

 Google: Google is the most widely used web search engine, known
for its comprehensive index, relevance, and advanced search
capabilities.
 Bing: Bing is Microsoft's search engine, offering similar features to
Google and serving as its primary competitor.
 Yahoo: Yahoo Search is another popular search engine that
provides web search, along with news, email, and other services.
 DuckDuckGo: DuckDuckGo is a privacy-focused search engine
that emphasizes user privacy and does not track user activity.
 Baidu: Baidu is the dominant search engine in China, offering web
search, maps, and other services tailored to the Chinese market.

Web structure

The bow tie structure of the web is a model that describes the
World Wide Web (WWW) as a giant connected component with
three main sectors: the IN set, the OUT set, and the Strongly
Connected Component (SCC). The structure is often depicted as
a bow tie with finger-like projections.

Components of the bow tie structure

 SCC:
 The central core of the Web (the knot of the bow-tie) is the
strongly connected component (SCC), which means that
for any two pages in the SCC, a user can navigate from one of
them to the other and back by clicking on links embedded in
the pages encountered.
 In other words, a user browsing a page in the SCC can always
reach any other page in the SCC by traversing some path of
links.
 The core of the bow tie, made up of a single strongly
connected component.
 This is the largest subgraph where every node is reachable
from any other node.
 IN set:
 The left bow, called IN, contains pages that have a directed
path of links leading to the SCC.
 Contains nodes that can reach the SCC but cannot be
reached by it.
 This set often includes new pages that have not yet been
linked to.
 OUT set:
 The right bow, called OUT, contains pages that can be
reached from the SCC by following a directed path of links
 Contains nodes that can be reached by the SCC but
cannot reach it. This set often includes corporate websites
that only contain internal links.
 Tendrils:
 Contain nodes that can reach the IN set or the OUT set, but
not the SCC.
 A page in Tendrils can either be reached from IN or leads
into OUT.
 Tubes:
 Contain nodes that travel from the IN set to the OUT set,
but not the SCC.
 A web page in Tubes has a directed path from IN to OUT
bypassing the SCC
 Disconnected Set
 The pages in disconnected are not even weakly connected to
the SCC; that is, even if we ignored the fact that hyperlinks
only allow forward navigation, allowing them to be traversed
backwards as well as forwards, we still could not reach the
SCC from them.
User Problems
There are some problems when users use the interface of a search
engine.
 The users do not exactly understand how to provide a
sequence of words for the search.
 The users may get unexpected answers because he/she is not
aware of the input requirement of the search engine. For
example, some search
 engines are case sensitive.
 The users have problems understanding Boolean logic:
therefore, the user cannot perform advanced searching.
 Novice users do not know how to start using a search
engine.
 The users do not care about advertisements, so the search
engine lacks funding.
 Around 85% of users only look at the first page of the result,
so relevant answers might be skipped. In order to solve the
problems above, the search engine must be easy to use and
provide relevant answers to the query.

 Whenever you type a question into Google, or any other search

engine, the list of links that appear below the ads are known as
"organic results."
 These appear purely based on the quality and content of the page.
 Traffic that comes from people finding your links among these
results is classified as "organic search" traffic or just organic traffic.
 Paid search accounts are those that companies have paid to appear
the top of search results

What is SEO
 Search Engine Optimization refers to set of activities that are
performed to increase number of desirable visitors who come to
our site via search engine.
 These activities may include thing we do to our site itself, such as
making changes to our text and HTML code, formatting text or
document to communicate directly to the search engine.
Define SEO
 Search Engine Optimization is the process of improving the visibility
of a website on organic ("natural" or un-paid) search engine result
pages (SERPs), by incorporating search engine friendly elements into
a website.
 Search engine optimization is broken down into two basic areas:
on-page, and off-page optimization.
 On-page optimization refers to website elements which comprise a
web page, such as HTML code, textual content, and images. Offpage
optimization refers, predominantly, to back links (links pointing to
the site which is being optimized, from other relevant websites).
On-Page Factors
1. Title tags<title>
2. Header tags<h1>
3. ALT image tags
4. Content, Content(Body text)<body>
5. Hyperlink text
6. Keyword frequency and density
Off-Page Factors
1. Anchor text
2. Link Popularity (―votes for your site) – adds credibility

//Anchor text is the clickable text in a hyperlink, also known as link text
or link label. It's the part of a link that users see and click to navigate to
another page
Backlinks are links from other websites that point to a page on your
website.
Link popularity is a way to measure how many backlinks a website has
and how high-quality they are.//
The Search Engine Optimization Techniques
1.Domain name Strategies
 Domain naming is important to our overall foundation, use sub-
directory root domains (example.com/awesome)versus sub-
domains (awesome.example.com).
 All inbound links pointing to a website‘s main domain
(www.example.com) as well as its internal pages contribute to the
domain authority .
 Getting a back link from a high domain authority site is always
valuable.

2. Linking strategies
 the text in the links should include keywords
 the more inbound links the higher the SE ranking
 if the site linking to us is already indexed, spiders will also receive
our site
3. Keywords
 the most important in optimizing rankings
 keywords are words that appear the most in a page
 the spider chooses the appropriate keywords for each page, then
sends them back to its SE
 our web site will then be indexed based on our keywords
 can be key phrases or a single keyword
 do not use common words eg ‘the‘ and ‘of‘: spiders ignore them

4. Title tags
 The title tag on pages of your website tells search engines what
the page is about.
 It should be 70 characters or less and include your business or
brand name and keywords that relate to that specific page only.
 This tag is placed between the <HEAD> </HEAD> tags near the
top of the HTML code for the page.
5.Meta tag and description tags
 displayed below the title in search results
 use dynamic, promotional language
 use keywords
<meta name="description" content="Free Web tutorials
on HTML, CSS,XML, and XHTML">

6.Alt tags
-include keywords in your alt tags
<IMG src="star.gif" alt=”star logo">
7.Submit your website to SEs for indexing
-submit your site to search engine directories, directory sites and portal
sites
- indexing takes time

SEO techniques are classified into two broad categories:

White Hat SEO - Techniques that search engines recommend as part
of a good design.
Black Hat SEO - Techniques that search engines do not approve and
attempt to minimize the effect of. These techniques are also known as
spamdexing.

White Hat SEO

An SEO tactic is considered as White Hat if it has the following features:

 It conforms to the search engine's guidelines.

 It does not involve in any deception.
 It ensures that the content a search engine indexes, and
subsequently ranks, is the same content a user will see.
 It ensures that a web page content should have been created for
the users and not just for the search engines.
 It ensures good quality of the web pages.
 It ensures availability of useful content on the web pages.

Always follow a White Hat SEO tactic and do not try to fool your site
visitors. Be honest and you will definitely get something more.

Black Hat or Spamdexing

An SEO tactic, is considered as Black Hat or Spamdexing if it has the

following features:

 Attempting ranking improvements that are disapproved by the

search engines and/or involve deception.
 Redirecting users from a page that is built for search engines to
one that is more human friendly.
 Redirecting users to a page that was different from the page the
search engine ranked.
 Serving one version of a page to search engine spiders/bots and
another version to human visitors. This is called Cloaking SEO
tactic.
 Using hidden or invisible text or with the page background color,
using a tiny font size or hiding them within the HTML code such
as "no frame" sections.
 Repeating keywords in the meta-tags, and using keywords that
are unrelated to the website content. This is called meta-tag
stuffing.
 Calculated placement of keywords within a page to raise the
keyword count, variety, and density of the page. This is called
keyword stuffing.
 Creating low-quality web pages that contain very little content
but are instead stuffed with very similar keywords and phrases.
These pages are called Doorway or Gateway Pages.
 Mirror websites by hosting multiple websites - all with
conceptually similar content but using different URLs.
 Creating a rogue copy of a popular website which shows
contents similar to the original to a web crawler, but redirects
web surfers to unrelated or malicious websites. This is called
page hijacking.

Always stay away from any of the above Black Hat tactics to improve
the rank of our site. Search engines are smart enough to identify all

the above properties of our site and ultimately we are not going to get
anything.
SEO Tools
 Keyword Discovery - Find popular keywords that your site
should be targeting.
 Keyword Volume - Estimate how much search traffic a specific
keyword
 receives each month.
 Keyword Density - Analyze how well a specific webpage is
targeting a keyword.
 Back link Trackers - Keep track of how many sites are linking to
you.
 Site Popularity - Determine how popular your site is on the
internet.
 Keyword Rankings - Track of your site's rankings for a keyword
in the search engines.
 Firefox Add-ons - Install these add-ons to turn your browser
into an SEO research powerhouse. These extensions allow you
to directly analyze website SEO metrics like keyword density,
page authority, backlink profile, and more right within your
browser while browsing different pages.

Search Engine Spam:

Search engine spam, on the other hand, refers to manipulative
and deceptive practices aimed at tricking search engines into
ranking websites higher than they deserve. Search engine spam
can take various forms, including:
 Keyword Stuffing: Overloading web pages with excessive
keywords or irrelevant keywords in an attempt to
manipulate search engine rankings.
 Cloaking: Presenting different content to search engine
crawlers than what is shown to human users, with the
intent of deceiving search engines into ranking the page
higher.
 Link Schemes: Engaging in artificial or manipulative link-
building tactics, such as buying links, participating in link
farms, or exchanging links with unrelated websites.
 Doorway Pages: Creating low-quality, keyword-optimized
pages that serve as entry points to a website but provide
little value to users.
 Duplicate Content: Publishing duplicate or plagiarized
content across multiple web pages or websites in an
attempt to increase visibility in search results.

Search engines continuously update their algorithms to detect

and penalize spammy tactics, aiming to provide users with high-
quality, relevant search results.
//Refer IR notes also(page: 91)
Websites that engage in search engine spam risk being penalized
or banned from search engine results altogether, damaging their
online reputation and visibility.
In summary, while SEO aims to improve website visibility through
legitimate means, search engine spam seeks to manipulate search
engine rankings through deceptive practices. It's crucial for
website owners and marketers to understand the difference and
adhere to ethical SEO practices to achieve sustainable results in
the long run.

Web Size Measurement

Measuring the size of the web is a challenging task due to its
dynamic and decentralized nature. However, various methods
and tools exist to estimate the size of the web, including:
 Crawling and Indexing: Search engines like Google and Bing
continuously crawl the web to discover and index new web
pages. The number of pages indexed by these search engines
can provide an estimate of the size of the visible web.
 Sampling: Researchers may sample a subset of web pages and
extrapolate their findings to estimate the total size of the
web.
 Domain Counts: Counting the number of registered domain
names can provide an estimate of the size of the web,
although many domains may be inactive or contain only a few
pages.
 Content Management Systems (CMS): Analyzing the usage
statistics of popular CMS platforms like WordPress, Joomla, or
Drupal can provide insights into the distribution of web
content across different platforms.

Overall, estimating the size of the web is an ongoing area of

research, and different methodologies may yield varying results.

Web Search Architecture

The architecture of web search in Information Retrieval (IR)
systems involves several layers of technologies and processes
to enable users to find relevant content on the internet. At a
high level, it consists of the following components:

1. Crawling

 Crawlers/Spiders: These are automated programs that

systematically browse the web, following hyperlinks to
discover and fetch web pages.
 The crawlers start from a set of seed URLs and recursively
visit linked pages to build a vast index of the web.
 The crawlers need to consider the freshness and frequency
of content updates on the pages.

2. Indexing
 After crawling the web, the next step is to indexing,
where the collected data is structured for efficient
searching.
 Inverted Index: A key data structure used in indexing that
maps keywords to the list of documents (web pages) in
which they appear. This allows for fast lookup of terms
during query processing.
 Document Parsing: Web pages often contain rich media
(text, images, videos, links). The document's content is
parsed to extract text, metadata, and other relevant
features.
 Metadata Indexing: In addition to the body text,
metadata like titles, headings, meta descriptions, and anchor
text are also indexed to improve search relevance.
 URL Normalization: To avoid indexing duplicates, URL
normalization ensures that URLs pointing to the same
content are treated as identical.

3. Query Processing

 When a user submits a query, the system processes the

query to understand what the user is searching for.
 Tokenization: Breaking the search query into smaller
units (typically words or terms).
 Stemming/Lemmatization: Reducing words to their
base or root form to match variations (e.g., "running" →
"run").
 Query Expansion: Expanding the query with related
terms or synonyms to improve recall.
 Stop Words Removal: Common words like "the," "is," or
"in" are often ignored as they carry little meaning in
retrieval.

4. Ranking

 The retrieval system must decide how to rank documents

based on their relevance to the user’s query. Several
ranking algorithms are employed:
 TF-IDF (Term Frequency-Inverse Document
Frequency): Measures the importance of a term in a
document relative to its frequency across the entire
corpus.
 PageRank: Used by search engines like Google to rank
pages based on their link structure, under the assumption
that pages with more links pointing to them are more
important.
 Vector Space Model: Documents and queries are
represented as vectors in a multi-dimensional space, with
similarity measured (typically using cosine similarity).
 Machine Learning Models: Many modern search engines
apply machine learning to improve ranking, using features
like click-through rates, relevance feedback, and user
behavior.
5. Retrieval
 After ranking, the most relevant documents (or web pages)
are retrieved and presented to the user.
 The system needs to return the most relevant documents
quickly, often from a large-scale distributed index.
 Caching: Frequently accessed pages or search results
might be cached to reduce retrieval time and enhance
performance.

6. User Interface and Result Presentation

 The retrieved documents are presented to the user in a

readable, organized way. This often includes:
o Snippets: Short descriptions or extracts from
documents that show how the page matches the
query.
o Related Search Suggestions: Additional queries
that are related to the original search.
o Multimedia Results: Image, video, and news results
may also be presented in addition to traditional text-
based results.

7. Feedback Loop and Refinement

 Click-through Data: Search engines analyze user

interaction data (e.g., which links are clicked) to refine their
ranking models.
 Relevance Feedback: Users can mark documents as
relevant or irrelevant, contributing to the training of
machine learning models.
 Personalization: Search results may be personalized
based on a user’s search history, location, or preferences to
improve the user experience.

8. Distributed Systems and Scalability

 Given the vast scale of the web, search engines are typically
built on highly distributed systems to handle the massive
amounts of data and traffic.
 Techniques such as sharding, replication, and load
balancing are used to scale the system and ensure
availability and fault tolerance.
// Sharding is a technique for splitting large databases into
smaller parts, or shards, to improve performance and
scalability.

9. Ranking Factors in Web Search (e.g., for Google)

 Content quality: Well-written, original, and relevant

content tends to rank higher.
 User signals: Data like time spent on a page, bounce rate,
and click patterns inform search rankings.
 Mobile-friendliness: Websites optimized for mobile
devices are ranked higher.
 Page Load Speed: Faster-loading pages are prioritized.
 Security: HTTPS websites are given preference.

10. AI and Deep Learning in Web Search

 Advanced AI techniques, including natural language

processing (NLP) and deep learning models, are
increasingly used to improve the accuracy of search results.

Web search architecture in information retrieval is a complex,

multi-layered process that involves data collection, processing,
ranking, and retrieval, with continuous improvements based
on user feedback and AI advancements.

CRAWLING
 Web crawling is the process by which we gather pages from the
Web, in order to index them and support a search engine.
 The objective of crawling is to quickly and efficiently gather as many
useful web pages as possible, together with the link structure that
interconnects them.

Features of Crawler:

 Robustness: Ability to handle spider traps. The Web contains

servers that create spider traps, which are generators of web pages
that mislead crawlers into getting stuck fetching an infinite number
of pages in a particular domain. Crawlers must be designed to be
resilient to such traps.
 Politeness: Web servers have both implicit and explicit policies
regulating the rate at which a crawler can visit them. These
politeness policies must be respected.
 Distributed: The crawler should have the ability to execute in a
distributed fashion across multiple machines.
 Scalable: The crawler architecture should permit scaling up the
crawl rate by adding extra machines and bandwidth.
 Performance and efficiency: The crawl system should make
efficient use of various system resources including processor,
storage and network bandwidth.
 Quality: the crawler should be biased towards fetching “useful”
pages first.
 Freshness: In many applications, the crawler should operate in
continuous mode: it should obtain fresh copies of previously
fetched pages.
 Extensible: Crawlers should be designed to be extensible in many
ways to cope with new data formats, new fetch protocols, and so
on. This demands that the crawler architecture be modular.

Basic Operation:

 The crawler begins with one or more URLs that constitute a seed
set.
 It picks a URL from this seed set, and then fetches the web page at
that URL.
 The fetched page is then parsed, to extract both the text and the links
from the page (each of which points to another URL).
 The extracted text is fed to a text indexer.
 The extracted links (URLs) are then added to a URL frontier,
which at all times consists of URLs whose corresponding pages
have yet to be fetched by the crawler.

 Initially, the URL frontier contains the seed set; as pages are fetched,
the corresponding URLs are deleted from the URL frontier. The
entire process may be viewed as traversing the web graph.

 In continuous crawling, the URL of a fetched page is added back to

the frontier for fetching again in the future.

Web Crawler Architecture:

1) The URL frontier, containing URLs yet to be fetched in the
current crawl (in the case of continuous crawling, a URL may have
been fetched previously but is back in the frontier for re-fetching).

2) A DNS resolution module that determines the web server from

which to fetch the page specified by a URL.

3) A fetch module that uses the http protocol to retrieve the web
page at a URL.

4) A parsing module that extracts the text and set of links from a
fetched web page.

5) A duplicate elimination module that determines whether an

extracted link is already in the URL frontier or has recently been
fetched.

We begin by assuming that the URL frontier is in place and non-empty.

We follow the progress of a single URL through the cycle of being
fetched, passing through various checks and filters, then finally (for
continuous crawling) being returned to the URL frontier.
Meta-crawlers

 Metasearchers / Meta Crawlers are Web servers send a given

query to several search engines, Web directories and other
databases, collect the answers and unify them.
 Example, a decision about which hotel room to book for a trip is
usually not based on price alone. In fact, the average consumer
visits five or six websites before booking a room. Meta-search
engines consolidate results from multiple sites and categorize
results so consumers can view results based on characteristics
other than price.
How Meta Crawlers Work:

1. Query Submission: A user submits a query to the meta crawler.

2. Forwarding Queries: The meta crawler forwards this query to

multiple search engines (such as Google, Bing, Yahoo, etc.) or
databases.

3. Gathering Results: Each search engine returns a list of results

based on its individual indexing and ranking algorithm.

4. Merging Results: The meta crawler aggregates the results into a

single set of results. It may apply additional filtering or ranking
strategies to combine results from different sources.

5. Ranking and Presentation: The merged results are then ranked,

often based on relevance to the user’s query. This ranking can
consider factors like result quality, diversity, and relevance. Finally,
these results are presented to the user in a consolidated format.
Key Features of Meta Crawlers:

 Query Forwarding: They do not index content themselves but

instead forward queries to other search engines.

 Multiple Sources: They gather information from various search

engines, offering users a broader set of results.

 Results Aggregation: The key feature is the merging of data from

different sources and eliminating redundancy.

 Rank Optimization: Some meta crawlers rank the results to

provide more relevant or diverse answers.

 Customization: Some allow users to select which search engines

they prefer to include in the results.

Advantages of Meta Crawlers:

1. Broad Coverage: They allow users to access results from multiple

sources simultaneously, increasing the chances of finding relevant
information.

2. Reduced Redundancy: Since results are aggregated from multiple

search engines, it can avoid duplicating the same result from
different sources.

3. Improved Relevance: By combining results, a meta crawler might

offer a more comprehensive set of relevant answers.

4. Convenience: Users don’t have to visit multiple search engines

manually to compare results.
Disadvantages of Meta Crawlers:

1. Slower Response Time: Since queries must be sent to multiple

engines and results need to be aggregated, this can lead to slower
retrieval times.

2. Limited Control over Search Engines: The meta crawler does

not control the ranking or indexing algorithm of the search engines
it uses.

3. Data Duplication: If the meta crawler is not well-designed, it may

return duplicate results across different search engines.

Eg: Metacrawler, Kayak

Focused Crawling
 Focused Crawling in information retrieval is a technique used to
selectively crawl the web for content that is relevant to a
particular topic, rather than indiscriminately downloading all
available web pages. This method is used to target specific areas
of interest and gather information efficiently
Key Aspects of Focused Crawling:
1. Targeted Information Collection: The goal is to focus on
crawling pages that contain specific information related to a
predefined subject area or topic. This reduces the time and
resources spent on irrelevant or low-quality pages.
2. Topic Modeling: Focused crawlers use topic modeling techniques
to identify relevant content. This can involve analyzing web pages for
keywords, metadata, or structure that indicates a page is about a
particular subject of interest.
3. Resource Efficiency: By narrowing the scope of what is crawled,
focused crawling helps conserve computational resources (e.g.,
bandwidth and processing power), making it more efficient than
general-purpose crawlers.
4. Prioritization of Relevant Links: Instead of blindly following all
links from a page, a focused crawler prioritizes links that are more
likely to lead to pages relevant to the topic. This often involves a
ranking mechanism that assesses the relevance of a link before
following it.
5. Relevance Feedback: Some focused crawlers incorporate
relevance feedback mechanisms, where the system learns which
pages are relevant or not as it crawls. This feedback loop helps
refine the crawling process over time.
6. Depth vs. Breadth: Unlike traditional crawlers that might try to
visit a broad range of domains or URLs, focused crawlers go deep
into specific parts of the web, exploring more pages from a small
set of highly relevant domains or sections of the internet.
Applications of Focused Crawling:
 Topic-Specific Search Engines: To create specialized search
engines, such as ones focused on legal documents, medical papers,
or academic research.
 Web Mining and Knowledge Discovery: For gathering data on
specific subjects or trends.
 Building Domain-Specific Databases: Such as news aggregation
systems, scientific paper repositories, or e-commerce comparison
tools.
 Monitoring and Alerting Systems: For tracking changes or
updates in a specific field, like news, weather, or academic papers.
How It Works:
1. Initial Seed URLs: A set of starting URLs is chosen, usually based
on the topic of interest.
2. Relevance Function: The crawler evaluates the content of each
page using a relevance function (e.g., based on keyword matching,
metadata, or other criteria).
3. Link Following: The crawler follows links from relevant pages and
repeats the process, expanding its crawl to other pages that are
likely to be relevant.
4. Stop Criteria: The crawling process continues until a predefined
stopping condition is met, such as when the crawler has collected
enough data or when there are no more relevant pages to crawl.
Challenges:
 Noise and Irrelevance: Filtering out irrelevant pages while
crawling can be difficult, especially with the vast amount of content
available online.
 Scalability: Focusing on a particular topic might require
sophisticated techniques to scale, especially when the topic spans a
large part of the web.
 Dynamic Nature of the Web: Websites and pages change
frequently, so maintaining an up-to-date index of relevant content is
a challenge.

Near-duplicate detection
 Near-duplicate detection in information retrieval (IR) refers to
the process of identifying documents or pieces of content that
are very similar to one another, though they may not be exactly
the same.
 This concept plays an essential role in improving search engine
results, data quality, and user experience, especially in scenarios
where large amounts of content, such as news articles, research
papers, or user-generated content, need to be processed and
displayed.
Importance
1. Efficient Storage and Retrieval: Identifying near-duplicates
allows for more efficient storage by avoiding the duplication of
content in databases.
2. Improved Search Results: Users are less likely to encounter
redundant information, leading to more varied and relevant search
results.
3. Data Quality and Integrity: In some cases, near-duplicates may
indicate low-quality or spammy content, which should be filtered
out to improve the overall quality of the dataset.
4. Plagiarism Detection: Near-duplicate detection is also widely
used to identify cases of plagiarism in academic writing, online
content, etc.
Techniques for Near-Duplicate Detection
Several methods can be used to detect near-duplicates, and they can be
broadly categorized into exact matching and approximate
matching.
1. Exact Matching:
 Hashing: Documents can be hashed and compared directly for
matches. If two documents have the same hash, they are
considered identical or near-duplicate. However, this method does
not work well for documents that have slight variations.
// Hashing is a process that creates a unique string of characters
that represents the document.
2. Approximate Matching:
This approach focuses on detecting documents that are similar but not
exactly the same.
 Fingerprinting: This technique involves creating a "fingerprint" for
each document (usually using hash functions) and comparing these
fingerprints. A good example is SimHash (used in Google's search
algorithms), which provides a compact representation of
documents while tolerating minor changes.
 Cosine Similarity: This is a popular vector-based approach. Each
document is converted into a vector, and the similarity between
documents is computed using cosine similarity. If the cosine
similarity is above a certain threshold, the documents are
considered similar.
 Jaccard Similarity: Measures the similarity between two sets by
dividing the size of their intersection by the size of their union. This
can be useful for detecting near-duplicates in documents
represented by sets of words or n-grams.
 Minhashing: This technique is an efficient way of approximating
Jaccard similarity and is particularly useful for large datasets. It helps
in detecting similar sets of documents by comparing their hash
values, reducing the dimensionality of the problem.
 Edit Distance (Levenshtein Distance): This measures the
number of insertions, deletions, or substitutions needed to convert
one document into another. It's often used for text comparison,
although it may be computationally expensive for long documents.
 Latent Semantic Analysis (LSA): LSA is used to identify
patterns in the relationships between terms in a set of documents.
It uses singular value decomposition (SVD) to reduce the
dimensionality of the term-document matrix, helping to identify
latent semantic structures that can be used to determine document
similarity.
 Word Embeddings: Modern approaches involve using word
embeddings (like Word2Vec, GloVe, or BERT) to represent words
as vectors in a continuous vector space. These embeddings capture
the semantic meaning of words, allowing for the comparison of
documents based on their semantic content rather than exact
word matches.
3. Machine Learning Approaches:
 Supervised Models: Train a classifier (e.g., logistic regression,
support vector machines) to detect near-duplicates based on
features like cosine similarity, edit distance, or other domain-
specific features.
 Deep Learning: With the advancement of deep learning
techniques, models like Siamese networks can be trained to
detect near-duplicates by learning a similarity function directly from
the data.
Applications
 Web Search Engines: Search engines like Google and Bing use
near-duplicate detection to ensure that search results are diverse
and not cluttered with redundant information.
 Document Deduplication: In large-scale information retrieval
tasks, identifying near-duplicates helps in keeping datasets clean,
such as in news aggregators or academic databases.
 Plagiarism Detection: Systems like Turnitin or Copyscape use
near-duplicate detection to flag copied content across academic
papers, articles, and websites.
 Content Recommendation: In recommendation systems,
identifying similar items or content (e.g., articles, videos, etc.) helps
in providing relevant suggestions to users.

Index Compression
Index compression refers to techniques used to reduce the size of
inverted indexes, which store mappings from terms (words or phrases)
to the documents they appear in. Since large-scale IR systems, such as
search engines, often work with massive datasets, reducing the storage
footprint of indexes without sacrificing retrieval performance is crucial.
Why Index Compression is Important
 Efficiency: Compressed indexes require less disk space, which
makes them faster to load into memory and reduces the number of
disk accesses, improving retrieval time.
 Cost-effective: Storing compressed indexes is cheaper, especially
for large-scale systems.
 Scalability: As data grows, compressed indexes help the system
scale more efficiently without needing disproportionately large
storage systems.
Types of Index Compression
1. Dictionary Compression:
o Goal: Compress the list of terms (the dictionary).
o Common Techniques:
 Huffman Coding: Assigns shorter codes to frequent
terms and longer codes to infrequent terms.
 Front Coding: Stores the common prefixes of terms
efficiently, reducing redundancy.
 Delta Encoding: Instead of storing the entire term,
stores only the differences (deltas) between consecutive
terms in a sorted order. This is effective when terms
are lexicographically ordered.
 Variable-Length Encoding: Allocates different bit
lengths for terms depending on their frequency or
length.
2. Postings List Compression:
o Goal: Compress the lists of document IDs (postings) for each
term.
o Common Techniques:
 Gap Encoding (Delta Encoding): Instead of storing
absolute document IDs, stores the differences between
consecutive document IDs. This leads to smaller values
when the document IDs are close to each other.
 Gamma Coding: Encodes integers using a
combination of binary and unary codes, suitable for
small integers (like document IDs).
 Run-Length Encoding (RLE): If there are long
stretches of consecutive document IDs, RLE stores the
number of consecutive occurrences, reducing
redundancy.
 Simple9 and PForDelta: Special encoding techniques
optimized for storing postings lists that achieve high
compression ratios for certain types of data.
//Simple9 and PForDelta are both integer compression
algorithms that are used to pack numbers into words or
compress groups of numbers
3. Combined Compression:
o Goal: Integrate multiple compression techniques to compress
the entire inverted index effectively.
o Common Techniques:
 Block-based compression: Postings lists are divided
into blocks, and each block is compressed separately
using various techniques like PForDelta or Gamma
coding.
 Static vs. Dynamic Compression: Static
compression methods assume the index does not
change over time, while dynamic methods handle
updates and deletions efficiently.
Benefits of Index Compression
 Reduced Storage: Compressed indexes take up significantly less
space, which is important for systems that manage huge amounts of
data.
 Improved Speed: Although compression involves some overhead
for decompression, the reduced index size allows for faster
retrieval operations due to reduced disk I/O and memory
consumption.
 Faster Search: Smaller index sizes mean less data needs to be
processed during query evaluation, speeding up the search process.

DS Lecture 2 & 3
No ratings yet
DS Lecture 2 & 3
107 pages
Search Engines
83% (6)
Search Engines
23 pages
Internet and Internet Protocols
No ratings yet
Internet and Internet Protocols
21 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
32 pages
Web Search. Web Spidering
No ratings yet
Web Search. Web Spidering
44 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
10-Searching The Web
100% (1)
10-Searching The Web
27 pages
Wad Module3
No ratings yet
Wad Module3
38 pages
SEO
No ratings yet
SEO
7 pages
Search Engine
No ratings yet
Search Engine
9 pages
Meeting 14 OK
No ratings yet
Meeting 14 OK
12 pages
Search Engines
No ratings yet
Search Engines
24 pages
Search Engine Optimization After Mids
No ratings yet
Search Engine Optimization After Mids
35 pages
Unit 1
No ratings yet
Unit 1
47 pages
Web Crawler A Review
No ratings yet
Web Crawler A Review
5 pages
Search Engines
No ratings yet
Search Engines
30 pages
Web Mining
No ratings yet
Web Mining
26 pages
Detailed Explanation: IR Vs Web Search Vs Web
No ratings yet
Detailed Explanation: IR Vs Web Search Vs Web
15 pages
Crawler, Index, Ranking
No ratings yet
Crawler, Index, Ranking
20 pages
Ir 5
No ratings yet
Ir 5
18 pages
BA4029 SOCIAL MEDIA WEB ANALYTICS Unit 5
No ratings yet
BA4029 SOCIAL MEDIA WEB ANALYTICS Unit 5
23 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
WEB BROWSERS+search Engine
No ratings yet
WEB BROWSERS+search Engine
10 pages
Search Engine
No ratings yet
Search Engine
17 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
Search Engine
No ratings yet
Search Engine
35 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
4 pages
Unit 8 - Search Engines
No ratings yet
Unit 8 - Search Engines
8 pages
Web Crawling & SEO
No ratings yet
Web Crawling & SEO
20 pages
Internet Searching Technique - Last Edited
No ratings yet
Internet Searching Technique - Last Edited
36 pages
How Google Works
No ratings yet
How Google Works
61 pages
Search Engines
No ratings yet
Search Engines
19 pages
Bharata Rajyangam Final PDF
No ratings yet
Bharata Rajyangam Final PDF
117 pages
Seach Engine
50% (2)
Seach Engine
18 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
IR Unit 3
No ratings yet
IR Unit 3
47 pages
Seminar Formatkhjj
No ratings yet
Seminar Formatkhjj
24 pages
SEO Trends 2020 PDF
No ratings yet
SEO Trends 2020 PDF
173 pages
SEARCH ENGINES and PAGERANK
No ratings yet
SEARCH ENGINES and PAGERANK
29 pages
LTM Bsi Bahasa Inggis Pertemuan 14
No ratings yet
LTM Bsi Bahasa Inggis Pertemuan 14
11 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
Pre 5 Midterm Reviewer Nerfed
No ratings yet
Pre 5 Midterm Reviewer Nerfed
6 pages
fz50 Service Manual PDF
No ratings yet
fz50 Service Manual PDF
141 pages
Google Classroom PPT For Teachers
No ratings yet
Google Classroom PPT For Teachers
40 pages
Module 2
No ratings yet
Module 2
18 pages
Web Search-Engines: Preksha Mangal B-Tech CS-3 Year
No ratings yet
Web Search-Engines: Preksha Mangal B-Tech CS-3 Year
43 pages
Advanced Technical SEO A Complete Guide
100% (2)
Advanced Technical SEO A Complete Guide
385 pages
Chase Reiner SEO Roadmap v2.1
No ratings yet
Chase Reiner SEO Roadmap v2.1
34 pages
Search Engine Comparisons
No ratings yet
Search Engine Comparisons
23 pages
SPPM 1002 Web Searching
No ratings yet
SPPM 1002 Web Searching
12 pages
General Discussion - : Menah
No ratings yet
General Discussion - : Menah
12 pages
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
No ratings yet
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
25 pages
How Do Search Engines Work
No ratings yet
How Do Search Engines Work
3 pages
Empowerment Technology: Empowerment - Authority or Power Given To Someone
No ratings yet
Empowerment Technology: Empowerment - Authority or Power Given To Someone
14 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Lecture 2 - Exploring WordPress Dashboard & Creating Blog
No ratings yet
Lecture 2 - Exploring WordPress Dashboard & Creating Blog
15 pages
(NEW!) SEO Analyzer - Generate A Free SEO Report of Your Website
No ratings yet
(NEW!) SEO Analyzer - Generate A Free SEO Report of Your Website
12 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
10 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Kaylakayden 04 05 2020 36921560 Omg Fuck Me Onlyfans XXX Porn Videos
No ratings yet
Kaylakayden 04 05 2020 36921560 Omg Fuck Me Onlyfans XXX Porn Videos
1 page
Internet Marketing-Gyros
No ratings yet
Internet Marketing-Gyros
4 pages
Expertbook - In: Overview of The SEO Check SEO Score
No ratings yet
Expertbook - In: Overview of The SEO Check SEO Score
17 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Social Networking
No ratings yet
Social Networking
18 pages
20 Minute Countdown Timer: The Next Session Will Start in
No ratings yet
20 Minute Countdown Timer: The Next Session Will Start in
24 pages
Profile
No ratings yet
Profile
2 pages
SEO's For CEO's
No ratings yet
SEO's For CEO's
21 pages
CV Mardian Rinaldi - Natasya Auliayahya
No ratings yet
CV Mardian Rinaldi - Natasya Auliayahya
1 page
Assignment On SEO
No ratings yet
Assignment On SEO
2 pages
ALPHABET Google - History - Timeline - 2f04d200c3
No ratings yet
ALPHABET Google - History - Timeline - 2f04d200c3
2 pages
Akshay Digital-Marketing-Portfolio
No ratings yet
Akshay Digital-Marketing-Portfolio
8 pages
Effective Searching Policies For Web Crawler
No ratings yet
Effective Searching Policies For Web Crawler
3 pages
Todosdays Budget - Google Search
No ratings yet
Todosdays Budget - Google Search
1 page
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Local SEO-10
No ratings yet
Local SEO-10
2 pages
Mario Bytyçi - Wikipedia
No ratings yet
Mario Bytyçi - Wikipedia
1 page
Upload 1 Document To Download: Calculation of Proposed 2-Storey Residential BLDG
No ratings yet
Upload 1 Document To Download: Calculation of Proposed 2-Storey Residential BLDG
3 pages
Digital Maketing Broucher
No ratings yet
Digital Maketing Broucher
12 pages
Preparation
No ratings yet
Preparation
10 pages
FiTHGaming Discord Bot-2
No ratings yet
FiTHGaming Discord Bot-2
2 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
3 pages
SAP Material Management
No ratings yet
SAP Material Management
3 pages
Web Crawler A Survey
No ratings yet
Web Crawler A Survey
3 pages
Flowerpod Lifestyle, Entertainment & Special Interests Relationship Talk
No ratings yet
Flowerpod Lifestyle, Entertainment & Special Interests Relationship Talk
8 pages
Mastering Search Engine Marketing: A Guide for SEM Campaign Success
From Everand
Mastering Search Engine Marketing: A Guide for SEM Campaign Success
Rebecca Cox
No ratings yet
Seo Learning Guide
From Everand
Seo Learning Guide
ngencoband
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IR Module 3

Uploaded by

IR Module 3

Uploaded by

WEB SEARCH ENGINE

 A web search engine is a software system designed to search for

Here's an overview of how a typical web search engine works:

Introduction to Web Crawling

 Web crawling, also known as web scraping or spidering, is the

Purpose of Web Crawling

 The primary purpose of web crawling is to gather data from across

How Web Crawling Works

Challenges and Considerations

Web crawling poses several challenges and considerations, including:

 Crawlers must be polite and respect the resources of web

 Crawlers need to handle dynamic content generated by

Web Search Overview

Popular Web Search Engines

Components of the bow tie structure

Sponsored Search and Paid Placement

 Whenever you type a question into Google, or any other search

SEO techniques are classified into two broad categories:

White Hat SEO

 It conforms to the search engine's guidelines.

Black Hat or Spamdexing

An SEO tactic, is considered as Black Hat or Spamdexing if it has the

 Attempting ranking improvements that are disapproved by the

Search Engine Spam:

Search engines continuously update their algorithms to detect

Web Size Measurement

Overall, estimating the size of the web is an ongoing area of

Web Search Architecture

 Crawlers/Spiders: These are automated programs that

 When a user submits a query, the system processes the

 The retrieval system must decide how to rank documents

6. User Interface and Result Presentation

 The retrieved documents are presented to the user in a

7. Feedback Loop and Refinement

 Click-through Data: Search engines analyze user

8. Distributed Systems and Scalability

9. Ranking Factors in Web Search (e.g., for Google)

 Content quality: Well-written, original, and relevant

10. AI and Deep Learning in Web Search

 Advanced AI techniques, including natural language

Web search architecture in information retrieval is a complex,

 Robustness: Ability to handle spider traps. The Web contains

 In continuous crawling, the URL of a fetched page is added back to

Web Crawler Architecture:

2) A DNS resolution module that determines the web server from

5) A duplicate elimination module that determines whether an

We begin by assuming that the URL frontier is in place and non-empty.

 Metasearchers / Meta Crawlers are Web servers send a given

1. Query Submission: A user submits a query to the meta crawler.

2. Forwarding Queries: The meta crawler forwards this query to

3. Gathering Results: Each search engine returns a list of results

4. Merging Results: The meta crawler aggregates the results into a

5. Ranking and Presentation: The merged results are then ranked,

 Query Forwarding: They do not index content themselves but

 Multiple Sources: They gather information from various search

 Results Aggregation: The key feature is the merging of data from

 Rank Optimization: Some meta crawlers rank the results to

 Customization: Some allow users to select which search engines

Advantages of Meta Crawlers:

1. Broad Coverage: They allow users to access results from multiple

2. Reduced Redundancy: Since results are aggregated from multiple

3. Improved Relevance: By combining results, a meta crawler might

4. Convenience: Users don’t have to visit multiple search engines

1. Slower Response Time: Since queries must be sent to multiple

2. Limited Control over Search Engines: The meta crawler does

3. Data Duplication: If the meta crawler is not well-designed, it may

Eg: Metacrawler, Kayak

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.