10-Searching The Web
10-Searching The Web
Web
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Content
2
Web
Besides the grows of the page number, the pages are also
continuously updated or removed.
About the 23% of all the pages are modified daily.
In the .com domain, this percentage rises to 40%.
On the average, after 10 days, half of the new pages are removed.
Their URL are no longer valid
4
Web Search
It does this by looking through other web pages for the text the
user wants to find.
The Search engines follows the links these pages contain, and
add information to search engines database.
It is also called "web crawler".
6
Web Search
How do the web search engines get all of the items they index?
7
Types of Search Engines
Keyword Search:
Uses keywords to perform search.
Multimedia Search Engines:
Used to find graphics, video clips, animation, and music files.
8
Different Search Engines
9
Web Search as a Huge IR System
10
Anatomy of a modern Web Search
Engine
11
Crawler
12
Crawler
13
Crawler
Given a page P, define how “good” that page is, on the basis of
several metrics (combination of them):
Popularity driven: Incoming-link counts (or PageRank)
Location driven: Deepness of the page in a site
Usage driven: Click counts of the pages (feedback)
Interest driven: driven from a query, based on the similarity with page
contents (focused crawling)
14
Indexer and Page Repository
15
Storage: Page Repository
16
Designing a Distributed Page
Repository
Repository designed to work over a cluster of interconnected
nodes.
Page distribution across nodes:
Uniform distribution – any page can be sent to any node.
Hash distribution policy – hash page ID space into node ID space.
Physical organization within a node.
Update strategy:
Batch (Periodically executed)
Steady (Run all the time)
17
Indexer and Collection Analysis
Modules
The Indexer module creates two indexes:
Text (content) index : Uses “Traditional” indexing methods like
Inverted Indexing.
Structure (links) index : Uses a directed graph of pages and links.
Sometimes also creates an inverted graph, in order to answer queries
that ask for all the pages that have hyperlinks pointing to a given
page.
19
Indexer Partitioning
21
Query Engine
22
Query Engine
23
Query Engine
Ranking:
Not only based on traditional IR content-based approaches.
Terms may be of poor quality or not relevant.
Insufficient self-description of user intent.
Combat spam:
Link analysis, e.g. PageRank that exploits incoming links from
“important” pages to raise the rank of pages.
Exploit proximity of query terms in the pages.
Learning to rank.
24
Web Browsing
06/20/24 26
Thank You !!!
06/20/24 27