Information Retrieval and Web Search
Information Retrieval and Web Search
1
Information Retrieval
(IR)
• The indexing and retrieval of textual
documents.
• Searching for pages on the World Wide Web
is the most recent “killer app.”
• Concerned firstly with retrieving relevant
documents to a query.
• Concerned secondly with retrieving from
large sets of documents efficiently.
2
Typical IR Task
• Given:
– A corpus of textual natural-language
documents.
– A user query in the form of a textual string.
• Find:
– A ranked set of documents that are relevant to
the query.
3
IR System
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
.
Documents .
4
Relevance
• Relevance is a subjective judgment and
may include:
– Being on the proper subject.
– Being timely (recent information).
– Being authoritative (from a trusted source).
– Satisfying the goals of the user and his/her
intended use of the information (information
need).
5
Keyword Search
• Simplest notion of relevance is that the
query string appears verbatim in the
document.
• Slightly less strict notion is that the words
in the query appear frequently in the
document, in any order (bag of words).
6
Problems with Keywords
9
IR System Architecture
User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
10
IR System Components
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming
• Indexing constructs an inverted index of
word to document pointers.
• Searching retrieves documents that contain a
given query token from the inverted index.
• Ranking scores all retrieved documents
according to a relevance metric.
11
IR System Components (continued)
• User Interface manages interaction with the
user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query to
improve retrieval:
– Query expansion using a thesaurus.
– Query transformation using relevance feedback.
12
Web Search
13
Web Search System
Query IR
String System
1. Page1
2. Page2
3. Page3
Ranked
. Documents
.
14
Other IR-Related Tasks
16
IR History Continued
• 1980’s:
– Large document database systems, many run by
companies:
• Lexis-Nexis
• Dialog
• MEDLINE
17
IR History Continued
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
18
IR History Continued
• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering
19
Recent IR History
• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track
20
Recent IR History
• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization
21
Related Areas
• Database Management
• Library and Information Science
• Artificial Intelligence
• Natural Language Processing
• Machine Learning
22
Database Management
23
Library and Information Science
26
Natural Language Processing:
IR Directions
• Methods for determining the sense of an
ambiguous word based on context (word
sense disambiguation).
• Methods for identifying specific pieces of
information in a document (information
extraction).
• Methods for answering specific NL
questions from document corpora.
27
Machine Learning
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
29