100% found this document useful (1 vote)

24 views27 pages

10-Searching The Web

Information Storage and Retrieval course material chapter 10.

Uploaded by

Samuel Ketema

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

24 views27 pages

10-Searching The Web

Information Storage and Retrieval course material chapter 10.

Uploaded by

Samuel Ketema

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 27

Chapter 10: Searching the

Web
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Content

 Searching the Web:

 Search engines
 Browsing

2
Web

 Web: is a huge, widely-distributed, highly heterogeneous,

semistructured, interconnected, evolving, hypertext/hypermedia
information repository.
 Main issues on the Web:
 Abundance of information: the 99% of all the information are not
interesting for the 99% of all users.
 The static Web is a very small part of all the Web:
 Dynamic Website
 To access the Web user need to exploit Search Engines (SE)
 SE must be improved,
 To help people to better formulate their information needs.
 More personalization is needed
3
Web: Trends and Features

 Besides the grows of the page number, the pages are also
continuously updated or removed.
 About the 23% of all the pages are modified daily.
 In the .com domain, this percentage rises to 40%.
 On the average, after 10 days, half of the new pages are removed.
 Their URL are no longer valid

4
Web Search

 The Internet has become the largest source of information.

 A web search engine is a type of website that helps computer
user to find information on the Internet.

 It does this by looking through other web pages for the text the
user wants to find.

 The software that does this is known as a search engine.

 Search engines deploys a robot program called a spider or robot
designed to track down web pages.
5
Web Search

 The Search engines follows the links these pages contain, and
add information to search engines database.
 It is also called "web crawler".

 Index: database containing a copy of each Web page gathered

by the spider.
 Search engine software: technology that enables users to query
the index and that returns results in a schematic order.

6
Web Search

 How do the web search engines get all of the items they index?

 Main idea behind the web search engines:

 Start with known sites,
 Record information for these sites,
 Follow the links from each site,
 Record information found at new sites,
 Repeat.

7
Types of Search Engines

 Keyword Search:
 Uses keywords to perform search.
 Multimedia Search Engines:
 Used to find graphics, video clips, animation, and music files.

 Meta Search Engines:

 Search several major search engines at one time.
 Subject Directories:
 Organized by subject categories and displayed in a series of menus.

8
Different Search Engines

 The real differences between different search engines are:

 Their index weighting schemes:
Including context where terms appear, e.g., title, body, emphasized
words, etc.
 Their query processing methods (e.g., query classification, expansion,
etc)
 Their ranking algorithms
 Few of these are published by any of the search engine companies.
They are tightly guarded secrets.

9
Web Search as a Huge IR System

10
Anatomy of a modern Web Search
Engine

11
Crawler

12
Crawler

 It is a program that navigates the Web following the hyperlinks

and stores them in a page repository.

 Design issues of the Crawl module:

 What pages to download,
 When to refresh,
 Minimize load on web sites,
 How to parallelize the process.

13
Crawler

 Page selection during crawling: Importance metric:

 Given a page P, define how “good” that page is, on the basis of
several metrics (combination of them):
 Popularity driven: Incoming-link counts (or PageRank)
 Location driven: Deepness of the page in a site
 Usage driven: Click counts of the pages (feedback)
 Interest driven: driven from a query, based on the similarity with page
contents (focused crawling)

14
Indexer and Page Repository

15
Storage: Page Repository

 The Page Repository is a scalable storage system for web

pages:
 Allows the Crawler to store pages.
 Allows the Indexer and Collection Analysis to retrieve them.
 Similar to other data storage systems-DB or file systems.
 Does not have to provide some of the other systems’ features:
transactions, logging, directory.

16
Designing a Distributed Page
Repository
 Repository designed to work over a cluster of interconnected
nodes.
 Page distribution across nodes:
 Uniform distribution – any page can be sent to any node.
 Hash distribution policy – hash page ID space into node ID space.
 Physical organization within a node.
 Update strategy:
 Batch (Periodically executed)
 Steady (Run all the time)

17
Indexer and Collection Analysis
Modules
 The Indexer module creates two indexes:
 Text (content) index : Uses “Traditional” indexing methods like
Inverted Indexing.
 Structure (links) index : Uses a directed graph of pages and links.
Sometimes also creates an inverted graph, in order to answer queries
that ask for all the pages that have hyperlinks pointing to a given
page.

 The collection analysis module uses the 2 basic indexes created

by the indexer module in order to assemble “Utility Indexes”.
 e.g.: a site index.
18
Indexer: Design Issues and Challenges

 Index build must be :

 Fast,
 Economic (unlike traditional index builds)

 Incremental indexing must be supported

 Personalization
 Storage : compression vs. speed

19
Indexer Partitioning

 Partitioning Inverted Files

 Local inverted file:
Each node contains indexes of a
disjoint partition of the document
collection .
Query is broadcasted and answers
are obtained by merging local
results.
 Global inverted file:
Each node is only responsible for a
subset of terms in the collection.
Query is selectively sent to the
appropriate nodes only. 20
Query Engine

21
Query Engine

22
Query Engine

 The query engine module accepts queries from multitudes of

users and returns the results.
 Exploits the partitioned index to quickly find the relevant pages.
 Use Page Repository to prepare the page of the (10) results.
Snippet construction is query-based
 Since the possible results are a huge number, the ranking module has
to order the results according to their relevance.

23
Query Engine

 Ranking:
 Not only based on traditional IR content-based approaches.
 Terms may be of poor quality or not relevant.
 Insufficient self-description of user intent.
 Combat spam:
Link analysis, e.g. PageRank that exploits incoming links from
“important” pages to raise the rank of pages.
Exploit proximity of query terms in the pages.
Learning to rank.

24
Web Browsing

 Web browsing or surfing usually refers to simply using the

internet for browsing the Internet.
 Browsing does not have a specific goals.
 Example: Looking any website such as amazon through any web
browser.

 Web searching is the use of search engines to do a research or

planned information seeking on the Internet.
 Searching has a specific goals.
 Example: you want to find the difference between web searching
and browsing using ant SE.
25
Question & Answer

06/20/24 26
Thank You !!!

06/20/24 27

Web Content Mining
100% (1)
Web Content Mining
112 pages
Internet and Internet Protocols
No ratings yet
Internet and Internet Protocols
21 pages
Salesforce Certified Tableau Consultant Dumps
No ratings yet
Salesforce Certified Tableau Consultant Dumps
11 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
32 pages
Web Search. Web Spidering
No ratings yet
Web Search. Web Spidering
44 pages
WebTechnology - Puducherry University Unit 2
No ratings yet
WebTechnology - Puducherry University Unit 2
75 pages
100 HTML
No ratings yet
100 HTML
36 pages
Natural Language Processing in Artificial Intelligence 1st Edition Brojo Kishore Mishra Ebook All Chapters PDF
100% (1)
Natural Language Processing in Artificial Intelligence 1st Edition Brojo Kishore Mishra Ebook All Chapters PDF
65 pages
Web Semantics: Science, Services and Agents On The World Wide Web
100% (1)
Web Semantics: Science, Services and Agents On The World Wide Web
22 pages
Information Retrieval QA
No ratings yet
Information Retrieval QA
8 pages
Lecture 11 - Web Search, Crawling, and Indexes
No ratings yet
Lecture 11 - Web Search, Crawling, and Indexes
62 pages
IR Module 3
No ratings yet
IR Module 3
45 pages
Types of Search Engines and How It Works
100% (2)
Types of Search Engines and How It Works
42 pages
Arasu 2001
No ratings yet
Arasu 2001
42 pages
Convergence of It and Ot
No ratings yet
Convergence of It and Ot
2 pages
Search ENgine
No ratings yet
Search ENgine
28 pages
Rancang Bangun Sistem Informasi E-Arsip Berbasis Microsoft Access
No ratings yet
Rancang Bangun Sistem Informasi E-Arsip Berbasis Microsoft Access
18 pages
Webmininglec
100% (1)
Webmininglec
75 pages
Building Digital Libraries An Overview
No ratings yet
Building Digital Libraries An Overview
22 pages
Sandisk SDCF2B-160 PDF
No ratings yet
Sandisk SDCF2B-160 PDF
124 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Web Mining
No ratings yet
Web Mining
26 pages
Access SQL - Visual Basic 6 (VB6)
No ratings yet
Access SQL - Visual Basic 6 (VB6)
21 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
6-Query Languages
No ratings yet
6-Query Languages
19 pages
Assignment
No ratings yet
Assignment
4 pages
Customer Data and PegaDATA
No ratings yet
Customer Data and PegaDATA
3 pages
Unit3 (Search Engine)
No ratings yet
Unit3 (Search Engine)
7 pages
WEB BROWSERS+search Engine
No ratings yet
WEB BROWSERS+search Engine
10 pages
Information Storage and Retrival (Course Outline) - New
No ratings yet
Information Storage and Retrival (Course Outline) - New
7 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Assignment 3 of DM
No ratings yet
Assignment 3 of DM
7 pages
Js Lab
No ratings yet
Js Lab
8 pages
Crawler, Index, Ranking
No ratings yet
Crawler, Index, Ranking
20 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
4 pages
CS8080 Irt Unit 4 23 24
No ratings yet
CS8080 Irt Unit 4 23 24
36 pages
Highway On My Plate II
0% (2)
Highway On My Plate II
3 pages
Data-Base: Punjab University College of Information Technology
No ratings yet
Data-Base: Punjab University College of Information Technology
10 pages
Unit 8 - Search Engines
No ratings yet
Unit 8 - Search Engines
8 pages
Web Application Requirement Engineering
No ratings yet
Web Application Requirement Engineering
7 pages
Datamining
No ratings yet
Datamining
21 pages
Barcode
No ratings yet
Barcode
10 pages
Search Engine
No ratings yet
Search Engine
20 pages
Myanmar Book Adult e
No ratings yet
Myanmar Book Adult e
45 pages
Unit-3 ERP and Related Technology (E-Next - In)
No ratings yet
Unit-3 ERP and Related Technology (E-Next - In)
19 pages
Geez Grammar by Carl Bezold
No ratings yet
Geez Grammar by Carl Bezold
641 pages
Seach Engine
50% (2)
Seach Engine
18 pages
Search Engine
No ratings yet
Search Engine
35 pages
Semantic Web: (An Introduction)
100% (1)
Semantic Web: (An Introduction)
39 pages
Internet Searching Technique - Last Edited
No ratings yet
Internet Searching Technique - Last Edited
36 pages
Search Engines Information Retrieval in Practice PDF
No ratings yet
Search Engines Information Retrieval in Practice PDF
542 pages
Security Logging Standard
No ratings yet
Security Logging Standard
6 pages
Chapter 4 - Javascript
No ratings yet
Chapter 4 - Javascript
29 pages
Data Manipulation
100% (1)
Data Manipulation
2 pages
Power Point - Web Searching Techniques
No ratings yet
Power Point - Web Searching Techniques
27 pages
LTM Bsi Bahasa Inggis Pertemuan 14
No ratings yet
LTM Bsi Bahasa Inggis Pertemuan 14
11 pages
Hiring Data Librarians
100% (1)
Hiring Data Librarians
5 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
ACE Sim 03 PDF
No ratings yet
ACE Sim 03 PDF
32 pages
Pre 5 Midterm Reviewer Nerfed
No ratings yet
Pre 5 Midterm Reviewer Nerfed
6 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Search Engine: Amit Kamath Ancy Alphonso
No ratings yet
Search Engine: Amit Kamath Ancy Alphonso
22 pages
Mastering Search Engine Marketing: A Guide for SEM Campaign Success
From Everand
Mastering Search Engine Marketing: A Guide for SEM Campaign Success
Rebecca Cox
No ratings yet
CC5051 Database Coursework Guidelines
No ratings yet
CC5051 Database Coursework Guidelines
7 pages
Search Engine Comparisons
No ratings yet
Search Engine Comparisons
23 pages
SearchLand: Search Quality For Beginners
No ratings yet
SearchLand: Search Quality For Beginners
29 pages
Web Search Engines: Part 1
No ratings yet
Web Search Engines: Part 1
6 pages
Ip Database Management System
No ratings yet
Ip Database Management System
13 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
The Wisdom of Crowds: Web Mining or
No ratings yet
The Wisdom of Crowds: Web Mining or
50 pages
Search Engine Description
No ratings yet
Search Engine Description
17 pages
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
No ratings yet
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
25 pages
Search Tools: Presented By: ISHA
No ratings yet
Search Tools: Presented By: ISHA
22 pages
Job HDL
No ratings yet
Job HDL
17 pages
SPPM 1002 Web Searching
No ratings yet
SPPM 1002 Web Searching
12 pages
Indexing and Search Engines For The Intranets: by Suvarsha Walters (Suvarsha@ncsi - Iisc.ernet - In)
No ratings yet
Indexing and Search Engines For The Intranets: by Suvarsha Walters (Suvarsha@ncsi - Iisc.ernet - In)
33 pages
Search Engine Optimization - Using Data Mining Approach
No ratings yet
Search Engine Optimization - Using Data Mining Approach
5 pages
Python in Excel (2024)
100% (10)
Python in Excel (2024)
607 pages
Search and Meta Search Engines
No ratings yet
Search and Meta Search Engines
9 pages
Business Process Management Workshops - BPM 2015, 13th International Workshops PDF
No ratings yet
Business Process Management Workshops - BPM 2015, 13th International Workshops PDF
600 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
10 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Section2 Exercise2 Exploring Data For Classification
No ratings yet
Section2 Exercise2 Exploring Data For Classification
9 pages
Goms and KLM: Human Computer Interaction
No ratings yet
Goms and KLM: Human Computer Interaction
36 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Reverse Image Search: Unlocking the Secrets of Visual Recognition
From Everand
Reverse Image Search: Unlocking the Secrets of Visual Recognition
Fouad Sabry
No ratings yet
Effective Searching Policies For Web Crawler
No ratings yet
Effective Searching Policies For Web Crawler
3 pages
Improving Existing Bad Design Into Good Design
No ratings yet
Improving Existing Bad Design Into Good Design
4 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Oracle Practice Question
No ratings yet
Oracle Practice Question
0 pages
Preparation
No ratings yet
Preparation
10 pages
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
100% (10)
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
244 pages
Microsoft Power BI Cookbook by Greg Deckler
100% (19)
Microsoft Power BI Cookbook by Greg Deckler
655 pages
Fundamentals of Neural Networks PDF
100% (4)
Fundamentals of Neural Networks PDF
476 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
93% (43)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
Project Report Class X
No ratings yet
Project Report Class X
15 pages
The Speech Analytics Experience Use Case
No ratings yet
The Speech Analytics Experience Use Case
5 pages
Python 3 Cheat Sheet
94% (51)
Python 3 Cheat Sheet
2 pages
Data Visualization With Python PDF
93% (14)
Data Visualization With Python PDF
662 pages
Collect, Transform and Combine Data Using Power BI and Power Query in Excel (Business Skills)
85% (13)
Collect, Transform and Combine Data Using Power BI and Power Query in Excel (Business Skills)
543 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (15)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
100% (13)
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
171 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
Mobile Shop
33% (3)
Mobile Shop
8 pages
100 Excel VBA Simulations
92% (12)
100 Excel VBA Simulations
460 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Python Cheat Sheet: Ata Tructures
100% (12)
Python Cheat Sheet: Ata Tructures
2 pages
Learn Excel Dashboard
100% (15)
Learn Excel Dashboard
233 pages
Coffee Break NumPy PDF
100% (5)
Coffee Break NumPy PDF
211 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (21)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
EBOOK - Python Crash Course For Data Analysis
100% (12)
EBOOK - Python Crash Course For Data Analysis
168 pages
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
100% (21)
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
133 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Understanding Machine Learning
100% (69)
Understanding Machine Learning
416 pages
The Python Manual
97% (31)
The Python Manual
196 pages
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
94% (16)
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
438 pages
(Hunt, J.) A Beginners Guide To Python 3 Programming
96% (47)
(Hunt, J.) A Beginners Guide To Python 3 Programming
440 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
Practical Projects
100% (30)
Practical Projects
478 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

10-Searching The Web

Uploaded by

10-Searching The Web

Uploaded by

Chapter 10: Searching the

 Searching the Web:

 Web: is a huge, widely-distributed, highly heterogeneous,

 The Internet has become the largest source of information.

 The software that does this is known as a search engine.

 Index: database containing a copy of each Web page gathered

 Main idea behind the web search engines:

 Meta Search Engines:

 The real differences between different search engines are:

 It is a program that navigates the Web following the hyperlinks

 Design issues of the Crawl module:

 Page selection during crawling: Importance metric:

 The Page Repository is a scalable storage system for web

 The collection analysis module uses the 2 basic indexes created

 Index build must be :

 Incremental indexing must be supported

 Partitioning Inverted Files

 The query engine module accepts queries from multitudes of

 Web browsing or surfing usually refers to simply using the

 Web searching is the use of search engines to do a research or

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.