IR Module 3
IR Module 3
Crawling:
The search engine sends out automated programs called web
crawlers or spiders to systematically browse the web,
following links from one page to another.
These crawlers collect information about the content and
structure of web pages, indexing the content they find.
Indexing:
The collected information is then processed and stored in an
index, which is essentially a large database.
This index contains information about the words or phrases
found on each web page, along with their location (URL),
metadata, and other relevant data.
Query Processing:
When a user enters a search query, the search engine
analyzes the query and retrieves relevant documents from its
index.
This process involves matching the query terms with the
indexed content and ranking the results based on various
factors such as relevance, authority, and freshness.
Ranking:
Once the relevant documents are retrieved, the search
engine ranks them based on their perceived relevance to the
user's query.
This ranking is typically determined by complex algorithms
that take into account factors such as keyword frequency,
inbound links, page quality, user engagement metrics, and
more.
Presentation of Results:
Finally, the search engine displays the search results to the
user, usually in a list format on a search engine results page
(SERP).
Each result typically includes a title, snippet (a brief summary
of the page content), and URL, along with other optional
features like images, videos, featured snippets, and ads.
Popular web search engines include Google, Bing, Yahoo, Baidu, and
DuckDuckGo, among others. These search engines continuously update
and refine their algorithms to provide users with the most relevant and
useful search results possible. Additionally, they offer various features and
tools to enhance the search experience, such as filters, advanced search
operators, and personalized recommendations.
CRAWLING
Seed URLs:
The crawling process typically begins with a set of seed URLs,
which are the starting points for the crawler. These URLs
could be provided manually or generated programmatically.
Fetching:
The crawler starts by fetching the content of the seed URLs.
It makes HTTP requests to the web servers hosting the
pages and retrieves the HTML, CSS, JavaScript, and other
resources associated with each URL.
Parsing:
Once the content is fetched, the crawler parses the HTML to
extract links to other pages within the same website (internal
links) and links to external websites (external links).
It also extracts other relevant information, such as text
content, metadata, and structural elements.
URL Frontier:
The crawler maintains a queue of URLs known as the URL
frontier or crawl frontier.
This queue contains URLs discovered during the crawling
process but not yet visited.
The crawler prioritizes URLs based on factors such as
relevance, freshness, and popularity.
Exploration:
The crawler continues to explore the web by recursively
following links from the URL frontier to new pages.
It may also revisit previously crawled pages to check for
updates or changes.
Robots Exclusion Protocol:
During the crawling process, crawlers respect the rules
specified in the Robots Exclusion Protocol (robots.txt).
This standard allows website owners to control which parts
of their site are accessible to crawlers and which are not.
Politeness:
Dynamic Content:
Duplicate Content:
Crawlers must identify and handle duplicate content
effectively to ensure that each page is indexed only once.
In summary, web crawling is a crucial process for collecting data from the
web and is essential for the functioning of search engines and various
other web-based applications.
Types of search engines
1. Crawler based
Crawler-based search engines use automated software
programs to survey and categorize web pages.
The programs used by the search engines to access the web
pages are called spiders, crawlers, robots or bots‘.
Examples:
a) Google (www.google.com) b) Ask Jeeves (www.ask.com)
2. Directories
A directory uses human editors who decide what category
the site belongs to; they place websites within specific
categories in the directories database.
The human editors comprehensively check the website and
rank it, based on the information they find, using a pre-
defined set of rules.
Examples:
a) Yahoo Directory (www.yahoo.com) b) Open Directory
(www.dmoz.org)
3. Hybrid Search Engines
Hybrid search engines use a combination of both crawler-
based results and directory results.
More and more search engines these days are moving to a
hybrid-based model.
Examples:
a) Yahoo (www.yahoo.com) b) Google (www.google.com)
4. Meta Search Engines
Meta search engines take the results from all the other search
engines results and combine them into one large listing.
Examples:
a)Meta crawler (www.metacrawler.com) b) Dogpile
(www.dogpile.com)
Indexing:
Search engines maintain massive indexes of web pages and
other content crawled from the web.
These indexes are organized and structured databases that
enable efficient retrieval of relevant information in response
to user queries.
Crawling:
Crawling is the process by which search engines
systematically browse the web to discover and collect
information from web pages.
Web crawlers, also known as spiders or bots, navigate
through hyperlinks on web pages to find new content and
update the search engine's index.
Ranking Algorithm:
Search engines use complex ranking algorithms to determine
the relevance and importance of web pages in response to a
user query.
Factors such as keyword relevance, content quality,
authority, and user engagement metrics are considered when
ranking search results.
User Interface:
The search engine's user interface presents search results to
the user in a user-friendly format, typically as a list of links on
a search engine results page (SERP).
The SERP may also include additional features such as
snippets, images, videos, knowledge panels, and
advertisements.
Web structure
The bow tie structure of the web is a model that describes the
World Wide Web (WWW) as a giant connected component with
three main sectors: the IN set, the OUT set, and the Strongly
Connected Component (SCC). The structure is often depicted as
a bow tie with finger-like projections.
What is SEO
Search Engine Optimization refers to set of activities that are
performed to increase number of desirable visitors who come to
our site via search engine.
These activities may include thing we do to our site itself, such as
making changes to our text and HTML code, formatting text or
document to communicate directly to the search engine.
Define SEO
Search Engine Optimization is the process of improving the visibility
of a website on organic ("natural" or un-paid) search engine result
pages (SERPs), by incorporating search engine friendly elements into
a website.
Search engine optimization is broken down into two basic areas:
on-page, and off-page optimization.
On-page optimization refers to website elements which comprise a
web page, such as HTML code, textual content, and images. Offpage
optimization refers, predominantly, to back links (links pointing to
the site which is being optimized, from other relevant websites).
On-Page Factors
1. Title tags<title>
2. Header tags<h1>
3. ALT image tags
4. Content, Content(Body text)<body>
5. Hyperlink text
6. Keyword frequency and density
Off-Page Factors
1. Anchor text
2. Link Popularity (―votes for your site) – adds credibility
//Anchor text is the clickable text in a hyperlink, also known as link text
or link label. It's the part of a link that users see and click to navigate to
another page
Backlinks are links from other websites that point to a page on your
website.
Link popularity is a way to measure how many backlinks a website has
and how high-quality they are.//
The Search Engine Optimization Techniques
1.Domain name Strategies
Domain naming is important to our overall foundation, use sub-
directory root domains (example.com/awesome)versus sub-
domains (awesome.example.com).
All inbound links pointing to a website‘s main domain
(www.example.com) as well as its internal pages contribute to the
domain authority .
Getting a back link from a high domain authority site is always
valuable.
2. Linking strategies
the text in the links should include keywords
the more inbound links the higher the SE ranking
if the site linking to us is already indexed, spiders will also receive
our site
3. Keywords
the most important in optimizing rankings
keywords are words that appear the most in a page
the spider chooses the appropriate keywords for each page, then
sends them back to its SE
our web site will then be indexed based on our keywords
can be key phrases or a single keyword
do not use common words eg ‘the‘ and ‘of‘: spiders ignore them
4. Title tags
The title tag on pages of your website tells search engines what
the page is about.
It should be 70 characters or less and include your business or
brand name and keywords that relate to that specific page only.
This tag is placed between the <HEAD> </HEAD> tags near the
top of the HTML code for the page.
5.Meta tag and description tags
displayed below the title in search results
use dynamic, promotional language
use keywords
<meta name="description" content="Free Web tutorials
on HTML, CSS,XML, and XHTML">
6.Alt tags
-include keywords in your alt tags
<IMG src="star.gif" alt=”star logo">
7.Submit your website to SEs for indexing
-submit your site to search engine directories, directory sites and portal
sites
- indexing takes time
Always follow a White Hat SEO tactic and do not try to fool your site
visitors. Be honest and you will definitely get something more.
Always stay away from any of the above Black Hat tactics to improve
the rank of our site. Search engines are smart enough to identify all
the above properties of our site and ultimately we are not going to get
anything.
SEO Tools
Keyword Discovery - Find popular keywords that your site
should be targeting.
Keyword Volume - Estimate how much search traffic a specific
keyword
receives each month.
Keyword Density - Analyze how well a specific webpage is
targeting a keyword.
Back link Trackers - Keep track of how many sites are linking to
you.
Site Popularity - Determine how popular your site is on the
internet.
Keyword Rankings - Track of your site's rankings for a keyword
in the search engines.
Firefox Add-ons - Install these add-ons to turn your browser
into an SEO research powerhouse. These extensions allow you
to directly analyze website SEO metrics like keyword density,
page authority, backlink profile, and more right within your
browser while browsing different pages.
1. Crawling
2. Indexing
After crawling the web, the next step is to indexing,
where the collected data is structured for efficient
searching.
Inverted Index: A key data structure used in indexing that
maps keywords to the list of documents (web pages) in
which they appear. This allows for fast lookup of terms
during query processing.
Document Parsing: Web pages often contain rich media
(text, images, videos, links). The document's content is
parsed to extract text, metadata, and other relevant
features.
Metadata Indexing: In addition to the body text,
metadata like titles, headings, meta descriptions, and anchor
text are also indexed to improve search relevance.
URL Normalization: To avoid indexing duplicates, URL
normalization ensures that URLs pointing to the same
content are treated as identical.
3. Query Processing
4. Ranking
Given the vast scale of the web, search engines are typically
built on highly distributed systems to handle the massive
amounts of data and traffic.
Techniques such as sharding, replication, and load
balancing are used to scale the system and ensure
availability and fault tolerance.
// Sharding is a technique for splitting large databases into
smaller parts, or shards, to improve performance and
scalability.
CRAWLING
Web crawling is the process by which we gather pages from the
Web, in order to index them and support a search engine.
The objective of crawling is to quickly and efficiently gather as many
useful web pages as possible, together with the link structure that
interconnects them.
Features of Crawler:
Basic Operation:
The crawler begins with one or more URLs that constitute a seed
set.
It picks a URL from this seed set, and then fetches the web page at
that URL.
The fetched page is then parsed, to extract both the text and the links
from the page (each of which points to another URL).
The extracted text is fed to a text indexer.
The extracted links (URLs) are then added to a URL frontier,
which at all times consists of URLs whose corresponding pages
have yet to be fetched by the crawler.
Initially, the URL frontier contains the seed set; as pages are fetched,
the corresponding URLs are deleted from the URL frontier. The
entire process may be viewed as traversing the web graph.
3) A fetch module that uses the http protocol to retrieve the web
page at a URL.
4) A parsing module that extracts the text and set of links from a
fetched web page.
Focused Crawling
Focused Crawling in information retrieval is a technique used to
selectively crawl the web for content that is relevant to a
particular topic, rather than indiscriminately downloading all
available web pages. This method is used to target specific areas
of interest and gather information efficiently
Key Aspects of Focused Crawling:
1. Targeted Information Collection: The goal is to focus on
crawling pages that contain specific information related to a
predefined subject area or topic. This reduces the time and
resources spent on irrelevant or low-quality pages.
2. Topic Modeling: Focused crawlers use topic modeling techniques
to identify relevant content. This can involve analyzing web pages for
keywords, metadata, or structure that indicates a page is about a
particular subject of interest.
3. Resource Efficiency: By narrowing the scope of what is crawled,
focused crawling helps conserve computational resources (e.g.,
bandwidth and processing power), making it more efficient than
general-purpose crawlers.
4. Prioritization of Relevant Links: Instead of blindly following all
links from a page, a focused crawler prioritizes links that are more
likely to lead to pages relevant to the topic. This often involves a
ranking mechanism that assesses the relevance of a link before
following it.
5. Relevance Feedback: Some focused crawlers incorporate
relevance feedback mechanisms, where the system learns which
pages are relevant or not as it crawls. This feedback loop helps
refine the crawling process over time.
6. Depth vs. Breadth: Unlike traditional crawlers that might try to
visit a broad range of domains or URLs, focused crawlers go deep
into specific parts of the web, exploring more pages from a small
set of highly relevant domains or sections of the internet.
Applications of Focused Crawling:
Topic-Specific Search Engines: To create specialized search
engines, such as ones focused on legal documents, medical papers,
or academic research.
Web Mining and Knowledge Discovery: For gathering data on
specific subjects or trends.
Building Domain-Specific Databases: Such as news aggregation
systems, scientific paper repositories, or e-commerce comparison
tools.
Monitoring and Alerting Systems: For tracking changes or
updates in a specific field, like news, weather, or academic papers.
How It Works:
1. Initial Seed URLs: A set of starting URLs is chosen, usually based
on the topic of interest.
2. Relevance Function: The crawler evaluates the content of each
page using a relevance function (e.g., based on keyword matching,
metadata, or other criteria).
3. Link Following: The crawler follows links from relevant pages and
repeats the process, expanding its crawl to other pages that are
likely to be relevant.
4. Stop Criteria: The crawling process continues until a predefined
stopping condition is met, such as when the crawler has collected
enough data or when there are no more relevant pages to crawl.
Challenges:
Noise and Irrelevance: Filtering out irrelevant pages while
crawling can be difficult, especially with the vast amount of content
available online.
Scalability: Focusing on a particular topic might require
sophisticated techniques to scale, especially when the topic spans a
large part of the web.
Dynamic Nature of the Web: Websites and pages change
frequently, so maintaining an up-to-date index of relevant content is
a challenge.
Near-duplicate detection
Near-duplicate detection in information retrieval (IR) refers to
the process of identifying documents or pieces of content that
are very similar to one another, though they may not be exactly
the same.
This concept plays an essential role in improving search engine
results, data quality, and user experience, especially in scenarios
where large amounts of content, such as news articles, research
papers, or user-generated content, need to be processed and
displayed.
Importance
1. Efficient Storage and Retrieval: Identifying near-duplicates
allows for more efficient storage by avoiding the duplication of
content in databases.
2. Improved Search Results: Users are less likely to encounter
redundant information, leading to more varied and relevant search
results.
3. Data Quality and Integrity: In some cases, near-duplicates may
indicate low-quality or spammy content, which should be filtered
out to improve the overall quality of the dataset.
4. Plagiarism Detection: Near-duplicate detection is also widely
used to identify cases of plagiarism in academic writing, online
content, etc.
Techniques for Near-Duplicate Detection
Several methods can be used to detect near-duplicates, and they can be
broadly categorized into exact matching and approximate
matching.
1. Exact Matching:
Hashing: Documents can be hashed and compared directly for
matches. If two documents have the same hash, they are
considered identical or near-duplicate. However, this method does
not work well for documents that have slight variations.
// Hashing is a process that creates a unique string of characters
that represents the document.
2. Approximate Matching:
This approach focuses on detecting documents that are similar but not
exactly the same.
Fingerprinting: This technique involves creating a "fingerprint" for
each document (usually using hash functions) and comparing these
fingerprints. A good example is SimHash (used in Google's search
algorithms), which provides a compact representation of
documents while tolerating minor changes.
Cosine Similarity: This is a popular vector-based approach. Each
document is converted into a vector, and the similarity between
documents is computed using cosine similarity. If the cosine
similarity is above a certain threshold, the documents are
considered similar.
Jaccard Similarity: Measures the similarity between two sets by
dividing the size of their intersection by the size of their union. This
can be useful for detecting near-duplicates in documents
represented by sets of words or n-grams.
Minhashing: This technique is an efficient way of approximating
Jaccard similarity and is particularly useful for large datasets. It helps
in detecting similar sets of documents by comparing their hash
values, reducing the dimensionality of the problem.
Edit Distance (Levenshtein Distance): This measures the
number of insertions, deletions, or substitutions needed to convert
one document into another. It's often used for text comparison,
although it may be computationally expensive for long documents.
Latent Semantic Analysis (LSA): LSA is used to identify
patterns in the relationships between terms in a set of documents.
It uses singular value decomposition (SVD) to reduce the
dimensionality of the term-document matrix, helping to identify
latent semantic structures that can be used to determine document
similarity.
Word Embeddings: Modern approaches involve using word
embeddings (like Word2Vec, GloVe, or BERT) to represent words
as vectors in a continuous vector space. These embeddings capture
the semantic meaning of words, allowing for the comparison of
documents based on their semantic content rather than exact
word matches.
3. Machine Learning Approaches:
Supervised Models: Train a classifier (e.g., logistic regression,
support vector machines) to detect near-duplicates based on
features like cosine similarity, edit distance, or other domain-
specific features.
Deep Learning: With the advancement of deep learning
techniques, models like Siamese networks can be trained to
detect near-duplicates by learning a similarity function directly from
the data.
Applications
Web Search Engines: Search engines like Google and Bing use
near-duplicate detection to ensure that search results are diverse
and not cluttered with redundant information.
Document Deduplication: In large-scale information retrieval
tasks, identifying near-duplicates helps in keeping datasets clean,
such as in news aggregators or academic databases.
Plagiarism Detection: Systems like Turnitin or Copyscape use
near-duplicate detection to flag copied content across academic
papers, articles, and websites.
Content Recommendation: In recommendation systems,
identifying similar items or content (e.g., articles, videos, etc.) helps
in providing relevant suggestions to users.
Index Compression
Index compression refers to techniques used to reduce the size of
inverted indexes, which store mappings from terms (words or phrases)
to the documents they appear in. Since large-scale IR systems, such as
search engines, often work with massive datasets, reducing the storage
footprint of indexes without sacrificing retrieval performance is crucial.
Why Index Compression is Important
Efficiency: Compressed indexes require less disk space, which
makes them faster to load into memory and reduces the number of
disk accesses, improving retrieval time.
Cost-effective: Storing compressed indexes is cheaper, especially
for large-scale systems.
Scalability: As data grows, compressed indexes help the system
scale more efficiently without needing disproportionately large
storage systems.
Types of Index Compression
1. Dictionary Compression:
o Goal: Compress the list of terms (the dictionary).
o Common Techniques:
Huffman Coding: Assigns shorter codes to frequent
terms and longer codes to infrequent terms.
Front Coding: Stores the common prefixes of terms
efficiently, reducing redundancy.
Delta Encoding: Instead of storing the entire term,
stores only the differences (deltas) between consecutive
terms in a sorted order. This is effective when terms
are lexicographically ordered.
Variable-Length Encoding: Allocates different bit
lengths for terms depending on their frequency or
length.
2. Postings List Compression:
o Goal: Compress the lists of document IDs (postings) for each
term.
o Common Techniques:
Gap Encoding (Delta Encoding): Instead of storing
absolute document IDs, stores the differences between
consecutive document IDs. This leads to smaller values
when the document IDs are close to each other.
Gamma Coding: Encodes integers using a
combination of binary and unary codes, suitable for
small integers (like document IDs).
Run-Length Encoding (RLE): If there are long
stretches of consecutive document IDs, RLE stores the
number of consecutive occurrences, reducing
redundancy.
Simple9 and PForDelta: Special encoding techniques
optimized for storing postings lists that achieve high
compression ratios for certain types of data.
//Simple9 and PForDelta are both integer compression
algorithms that are used to pack numbers into words or
compress groups of numbers
3. Combined Compression:
o Goal: Integrate multiple compression techniques to compress
the entire inverted index effectively.
o Common Techniques:
Block-based compression: Postings lists are divided
into blocks, and each block is compressed separately
using various techniques like PForDelta or Gamma
coding.
Static vs. Dynamic Compression: Static
compression methods assume the index does not
change over time, while dynamic methods handle
updates and deletions efficiently.
Benefits of Index Compression
Reduced Storage: Compressed indexes take up significantly less
space, which is important for systems that manage huge amounts of
data.
Improved Speed: Although compression involves some overhead
for decompression, the reduced index size allows for faster
retrieval operations due to reduced disk I/O and memory
consumption.
Faster Search: Smaller index sizes mean less data needs to be
processed during query evaluation, speeding up the search process.