100% found this document useful (1 vote)
24 views27 pages

10-Searching The Web

Information Storage and Retrieval course material chapter 10.

Uploaded by

Samuel Ketema
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
24 views27 pages

10-Searching The Web

Information Storage and Retrieval course material chapter 10.

Uploaded by

Samuel Ketema
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Chapter 10: Searching the

Web
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Content

 Searching the Web:


 Search engines
 Browsing

2
Web

 Web: is a huge, widely-distributed, highly heterogeneous,


semistructured, interconnected, evolving, hypertext/hypermedia
information repository.
 Main issues on the Web:
 Abundance of information: the 99% of all the information are not
interesting for the 99% of all users.
 The static Web is a very small part of all the Web:
 Dynamic Website
 To access the Web user need to exploit Search Engines (SE)
 SE must be improved,
 To help people to better formulate their information needs.
 More personalization is needed
3
Web: Trends and Features

 Besides the grows of the page number, the pages are also
continuously updated or removed.
 About the 23% of all the pages are modified daily.
 In the .com domain, this percentage rises to 40%.
 On the average, after 10 days, half of the new pages are removed.
 Their URL are no longer valid

4
Web Search

 The Internet has become the largest source of information.


 A web search engine is a type of website that helps computer
user to find information on the Internet.

 It does this by looking through other web pages for the text the
user wants to find.

 The software that does this is known as a search engine.


 Search engines deploys a robot program called a spider or robot
designed to track down web pages.
5
Web Search

 The Search engines follows the links these pages contain, and
add information to search engines database.
 It is also called "web crawler".

 Index: database containing a copy of each Web page gathered


by the spider.
 Search engine software: technology that enables users to query
the index and that returns results in a schematic order.

6
Web Search

 How do the web search engines get all of the items they index?

 Main idea behind the web search engines:


 Start with known sites,
 Record information for these sites,
 Follow the links from each site,
 Record information found at new sites,
 Repeat.

7
Types of Search Engines

 Keyword Search:
 Uses keywords to perform search.
 Multimedia Search Engines:
 Used to find graphics, video clips, animation, and music files.

 Meta Search Engines:


 Search several major search engines at one time.
 Subject Directories:
 Organized by subject categories and displayed in a series of menus.

8
Different Search Engines

 The real differences between different search engines are:


 Their index weighting schemes:
Including context where terms appear, e.g., title, body, emphasized
words, etc.
 Their query processing methods (e.g., query classification, expansion,
etc)
 Their ranking algorithms
 Few of these are published by any of the search engine companies.
They are tightly guarded secrets.

9
Web Search as a Huge IR System

10
Anatomy of a modern Web Search
Engine

11
Crawler

12
Crawler

 It is a program that navigates the Web following the hyperlinks


and stores them in a page repository.

 Design issues of the Crawl module:


 What pages to download,
 When to refresh,
 Minimize load on web sites,
 How to parallelize the process.

13
Crawler

 Page selection during crawling: Importance metric:

 Given a page P, define how “good” that page is, on the basis of
several metrics (combination of them):
 Popularity driven: Incoming-link counts (or PageRank)
 Location driven: Deepness of the page in a site
 Usage driven: Click counts of the pages (feedback)
 Interest driven: driven from a query, based on the similarity with page
contents (focused crawling)

14
Indexer and Page Repository

15
Storage: Page Repository

 The Page Repository is a scalable storage system for web


pages:
 Allows the Crawler to store pages.
 Allows the Indexer and Collection Analysis to retrieve them.
 Similar to other data storage systems-DB or file systems.
 Does not have to provide some of the other systems’ features:
transactions, logging, directory.

16
Designing a Distributed Page
Repository
 Repository designed to work over a cluster of interconnected
nodes.
 Page distribution across nodes:
 Uniform distribution – any page can be sent to any node.
 Hash distribution policy – hash page ID space into node ID space.
 Physical organization within a node.
 Update strategy:
 Batch (Periodically executed)
 Steady (Run all the time)

17
Indexer and Collection Analysis
Modules
 The Indexer module creates two indexes:
 Text (content) index : Uses “Traditional” indexing methods like
Inverted Indexing.
 Structure (links) index : Uses a directed graph of pages and links.
Sometimes also creates an inverted graph, in order to answer queries
that ask for all the pages that have hyperlinks pointing to a given
page.

 The collection analysis module uses the 2 basic indexes created


by the indexer module in order to assemble “Utility Indexes”.
 e.g.: a site index.
18
Indexer: Design Issues and Challenges

 Index build must be :


 Fast,
 Economic (unlike traditional index builds)

 Incremental indexing must be supported


 Personalization
 Storage : compression vs. speed

19
Indexer Partitioning

 Partitioning Inverted Files


 Local inverted file:
Each node contains indexes of a
disjoint partition of the document
collection .
Query is broadcasted and answers
are obtained by merging local
results.
 Global inverted file:
Each node is only responsible for a
subset of terms in the collection.
Query is selectively sent to the
appropriate nodes only. 20
Query Engine

21
Query Engine

22
Query Engine

 The query engine module accepts queries from multitudes of


users and returns the results.
 Exploits the partitioned index to quickly find the relevant pages.
 Use Page Repository to prepare the page of the (10) results.
Snippet construction is query-based
 Since the possible results are a huge number, the ranking module has
to order the results according to their relevance.

23
Query Engine

 Ranking:
 Not only based on traditional IR content-based approaches.
 Terms may be of poor quality or not relevant.
 Insufficient self-description of user intent.
 Combat spam:
Link analysis, e.g. PageRank that exploits incoming links from
“important” pages to raise the rank of pages.
Exploit proximity of query terms in the pages.
Learning to rank.

24
Web Browsing

 Web browsing or surfing usually refers to simply using the


internet for browsing the Internet.
 Browsing does not have a specific goals.
 Example: Looking any website such as amazon through any web
browser.

 Web searching is the use of search engines to do a research or


planned information seeking on the Internet.
 Searching has a specific goals.
 Example: you want to find the difference between web searching
and browsing using ant SE.
25
Question & Answer

06/20/24 26
Thank You !!!

06/20/24 27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy