0% found this document useful (0 votes)

76 views29 pages

Information Retrieval and Web Search

The document provides an introduction to information retrieval (IR), which involves indexing and retrieving relevant textual documents in response to user queries. IR systems aim to efficiently retrieve relevant documents from large corpora in response to queries. The document outlines the key components of IR systems, including indexing documents, processing user queries, searching for relevant documents, ranking results, and evaluating relevance.

Uploaded by

aymancva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views29 pages

Information Retrieval and Web Search

Uploaded by

aymancva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 29

Information Retrieval

and Web Search

Introduction

1
Information Retrieval
(IR)
• The indexing and retrieval of textual
documents.
• Searching for pages on the World Wide Web
is the most recent “killer app.”
• Concerned firstly with retrieving relevant
documents to a query.
• Concerned secondly with retrieving from
large sets of documents efficiently.

2
Typical IR Task
• Given:
– A corpus of textual natural-language
documents.
– A user query in the form of a textual string.
• Find:
– A ranked set of documents that are relevant to
the query.

3
IR System

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
.
Documents .

4
Relevance
• Relevance is a subjective judgment and
may include:
– Being on the proper subject.
– Being timely (recent information).
– Being authoritative (from a trusted source).
– Satisfying the goals of the user and his/her
intended use of the information (information
need).

5
Keyword Search
• Simplest notion of relevance is that the
query string appears verbatim in the
document.
• Slightly less strict notion is that the words
in the query appear frequently in the
document, in any order (bag of words).

6
Problems with Keywords

• May not retrieve relevant documents that

include synonymous terms.
– “restaurant” vs. “café”
– “PRC” vs. “China”
• May retrieve irrelevant documents that
include ambiguous terms.
– “bat” (baseball vs. mammal)
– “Apple” (company vs. fruit)
– “bit” (unit of data vs. act of eating)
7
Beyond Keywords

• We will cover the basics of keyword-based

IR, but…
• We will focus on extensions and recent
developments that go beyond keywords.
• We will cover the basics of building an
efficient IR system, but…
• We will focus on basic capabilities and
algorithms rather than systems issues that
allow scaling to industrial size databases.
8
Intelligent IR
• Taking into account the meaning of the
words used.
• Taking into account the order of words in
the query.
• Adapting to the user based on direct or
indirect feedback.
• Taking into account the authority of the
source.

9
IR System Architecture

User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
10
IR System Components
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming
• Indexing constructs an inverted index of
word to document pointers.
• Searching retrieves documents that contain a
given query token from the inverted index.
• Ranking scores all retrieved documents
according to a relevance metric.

11
IR System Components (continued)
• User Interface manages interaction with the
user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query to
improve retrieval:
– Query expansion using a thesaurus.
– Query transformation using relevance feedback.

12
Web Search

• Application of IR to HTML documents on

the World Wide Web.
• Differences:
– Must assemble document corpus by spidering
the web.
– Can exploit the structural layout information
in HTML (XML).
– Documents change uncontrollably.
– Can exploit the link structure of the web.

13
Web Search System

Web Spider Document

corpus

Query IR
String System

1. Page1
2. Page2
3. Page3
Ranked
. Documents
.

14
Other IR-Related Tasks

• Automated document categorization

• Information filtering (spam filtering)
• Information routing
• Automated document clustering
• Recommending information or products
• Information extraction
• Information integration
• Question answering
15
History of IR
• 1960-70’s:
– Initial exploration of text retrieval systems for
“small” corpora of scientific abstracts, and law
and business documents.
– Development of the basic Boolean and vector-
space models of retrieval.
– Prof. Salton and his students at Cornell
University are the leading researchers in the
area.

16
IR History Continued
• 1980’s:
– Large document database systems, many run by
companies:
• Lexis-Nexis
• Dialog
• MEDLINE

17
IR History Continued
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista

18
IR History Continued
• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering

19
Recent IR History
• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track

20
Recent IR History
• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization

21
Related Areas
• Database Management
• Library and Information Science
• Artificial Intelligence
• Natural Language Processing
• Machine Learning

22
Database Management

• Focused on structured data stored in

relational tables rather than free-form text.
• Focused on efficient processing of well-
defined queries in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data
(XML) brings it closer to IR.

23
Library and Information Science

• Focused on the human user aspects of

information retrieval (human-computer
interaction, user interface, visualization).
• Concerned with effective categorization of
human knowledge.
• Concerned with citation analysis and
bibliometrics (structure of information).
• Recent work on digital libraries brings it
closer to CS & IR.
24
Artificial Intelligence

• Focused on the representation of knowledge,

reasoning, and intelligent action.
• Formalisms for representing knowledge and
queries:
– First-order Predicate Logic
– Bayesian Networks
• Recent work on web ontologies and
intelligent information agents brings it
closer to IR.
25
Natural Language Processing
• Focused on the syntactic, semantic, and
pragmatic analysis of natural language text
and discourse.
• Ability to analyze syntax (phrase structure)
and semantics could allow retrieval based
on meaning rather than keywords.

26
Natural Language Processing:
IR Directions
• Methods for determining the sense of an
ambiguous word based on context (word
sense disambiguation).
• Methods for identifying specific pieces of
information in a document (information
extraction).
• Methods for answering specific NL
questions from document corpora.

27
Machine Learning

• Focused on the development of

computational systems that improve their
performance with experience.
• Automated classification of examples
based on learning concepts from labeled
training examples (supervised learning).
• Automated methods for clustering
unlabeled examples into meaningful
groups (unsupervised learning).
28
Machine Learning:
IR Directions

• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
29

ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
2 Mod-1 - Lec-2
No ratings yet
2 Mod-1 - Lec-2
58 pages
Lec5 Ir Introduction
No ratings yet
Lec5 Ir Introduction
37 pages
NLP M5 Part-1 SPP
No ratings yet
NLP M5 Part-1 SPP
55 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
Intro Notes
No ratings yet
Intro Notes
11 pages
Ir - Chapter 1
No ratings yet
Ir - Chapter 1
7 pages
1 introIR
No ratings yet
1 introIR
15 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
15 pages
Ir Mod1 Notes
No ratings yet
Ir Mod1 Notes
20 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
63 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Intelligent
No ratings yet
Intelligent
20 pages
Chap 1
No ratings yet
Chap 1
23 pages
1-Overview of Information Retrieval - New
No ratings yet
1-Overview of Information Retrieval - New
47 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
UNIT I IR Final
No ratings yet
UNIT I IR Final
26 pages
Information Retrieval 1
100% (2)
Information Retrieval 1
12 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
Module 1print
No ratings yet
Module 1print
5 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Issues in Applied Linguistics
No ratings yet
Issues in Applied Linguistics
183 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
CS8080 Irt Q&a
No ratings yet
CS8080 Irt Q&a
54 pages
1 IR Introduction
No ratings yet
1 IR Introduction
23 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
Irs Ia 1
No ratings yet
Irs Ia 1
12 pages
1 IRIntro
No ratings yet
1 IRIntro
95 pages
Cs8080irtunitinotes 220515215754 E06d144b
No ratings yet
Cs8080irtunitinotes 220515215754 E06d144b
43 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Practical Research 2 Final
No ratings yet
Practical Research 2 Final
13 pages
1stunit GN
No ratings yet
1stunit GN
36 pages
Mathematical Language and Symbols
No ratings yet
Mathematical Language and Symbols
26 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Oral Communication - Worksheet No.4
No ratings yet
Oral Communication - Worksheet No.4
2 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Personality and Social Psychology
No ratings yet
Personality and Social Psychology
291 pages
Difference Between Stylistics and Literary Criticism
100% (1)
Difference Between Stylistics and Literary Criticism
3 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
No ratings yet
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
24 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Part I IR VTU M Tech SSE
No ratings yet
Part I IR VTU M Tech SSE
72 pages
Philosophies in Research
No ratings yet
Philosophies in Research
6 pages
Chapter 10 - The Nature of Intercultural Communication
No ratings yet
Chapter 10 - The Nature of Intercultural Communication
52 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Teacher Observation 2021-2022
No ratings yet
Teacher Observation 2021-2022
4 pages
0.1 SC CYRP ME Guidelines 2020
No ratings yet
0.1 SC CYRP ME Guidelines 2020
171 pages
IR UNIT I - Notes
No ratings yet
IR UNIT I - Notes
23 pages
Gayatri Vidya Parishad College of Engineering TCS NQT 2025 Results
No ratings yet
Gayatri Vidya Parishad College of Engineering TCS NQT 2025 Results
10 pages
LESSON 4 Activity and Child Centered Curriculum 3
No ratings yet
LESSON 4 Activity and Child Centered Curriculum 3
11 pages
Models of Teaching Literature
100% (2)
Models of Teaching Literature
7 pages
Impact of Radio Advertising On Consumer Perception
No ratings yet
Impact of Radio Advertising On Consumer Perception
57 pages
Chapter 3: Research Methodology
No ratings yet
Chapter 3: Research Methodology
4 pages
The Purpose of The SBA Project
No ratings yet
The Purpose of The SBA Project
3 pages
Health Expectations - 2016 - Chinn - Easy Read and Accessible Information For People With Intellectual Disabilities Is It
No ratings yet
Health Expectations - 2016 - Chinn - Easy Read and Accessible Information For People With Intellectual Disabilities Is It
12 pages
01 IntroCogNeuroHistory B55 02
No ratings yet
01 IntroCogNeuroHistory B55 02
20 pages
Agora BUS 302 Final
No ratings yet
Agora BUS 302 Final
31 pages
Aerospace Technology - Bachelor
No ratings yet
Aerospace Technology - Bachelor
4 pages
018 Love Channie - Jennie Fuellas Sameon
No ratings yet
018 Love Channie - Jennie Fuellas Sameon
7 pages
Madriaga, John Vincent BSCS 1D Lesson1
No ratings yet
Madriaga, John Vincent BSCS 1D Lesson1
8 pages
RM CA-2 Group 15
No ratings yet
RM CA-2 Group 15
5 pages
HISTORY AND HISTORIOGRAPHY. AN OVERVIEW May 2023 1
No ratings yet
HISTORY AND HISTORIOGRAPHY. AN OVERVIEW May 2023 1
4 pages
Scope and Delimitation
No ratings yet
Scope and Delimitation
29 pages
Evidence-Based Mathematical Maintenance Model For Medical Equipment
No ratings yet
Evidence-Based Mathematical Maintenance Model For Medical Equipment
5 pages
Singapore International School Mumbai TOK Essay Planning Form-May 2023
No ratings yet
Singapore International School Mumbai TOK Essay Planning Form-May 2023
3 pages
LSPD Uts
No ratings yet
LSPD Uts
3 pages
Senior High School Action Plan For Work Immersion
No ratings yet
Senior High School Action Plan For Work Immersion
1 page
Group 1 Math 100L Hqt2 HW
No ratings yet
Group 1 Math 100L Hqt2 HW
5 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Information Retrieval and Web Search

Uploaded by

Information Retrieval and Web Search

Uploaded by

Information Retrieval

and Web Search

• May not retrieve relevant documents that

• We will cover the basics of keyword-based

• Application of IR to HTML documents on

Web Spider Document

• Automated document categorization

• Focused on structured data stored in

• Focused on the human user aspects of

• Focused on the representation of knowledge,

• Focused on the development of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.