Intro Notes
Intro Notes
Information Retrieval
(IR)
• The indexing and retrieval of textual
documents.
• Searching for pages on the World Wide
Web is the “killer app.”
• Concerned firstly with retrieving relevant
documents to a query.
• Concerned secondly with retrieving from
large sets of documents efficiently.
Typical IR Task
• Given:
– A corpus of textual natural-language
documents.
– A user query in the form of a textual string.
• Find:
– A ranked set of documents that are relevant to
the query.
1
IR System
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.
Relevance
Keyword Search
2
Problems with Keywords
Beyond Keywords
Intelligent IR
3
IR System Architecture
User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
10
10
IR System Components
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming
• Indexing constructs an inverted index of
word to document pointers.
• Searching retrieves documents that contain a
given query token from the inverted index.
• Ranking scores all retrieved documents
according to a relevance metric.
11
11
12
12
4
Web Search
13
13
Query IR
String System
1. Page1
2. Page2
3. Page3
Ranked
. Documents
.
14
14
15
5
History of IR
• 1960-70’s:
– Initial exploration of text retrieval systems for
“small” corpora of scientific abstracts, and law
and business documents.
– Development of the basic Boolean and vector-
space models of retrieval.
– Prof. Salton and his students at Cornell
University are the leading researchers in the
area.
16
16
IR History Continued
• 1980’s:
– Large document database systems, many run by
companies:
• Lexis-Nexis
• Dialog
• MEDLINE
17
17
IR History Continued
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
18
18
6
IR History Continued
• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering
19
19
IR History Continued
• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
– Parallel Processing
• Map/Reduce
– Question Answering
• TREC Q/A track
20
20
IR History Continued
• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization
– Learning to Rank
21
21
7
IR History Continued
• 2010’s
– Intelligent Personal Assistants
• Siri
• Cortana
• Google Now
• Alexa
– Complex Question Answering
• IBM Watson
– Distributional Semantics
– Deep Learning
22
22
Recent IR History
• 2020’s
– Large Language Models (LLM’s)
• ELMO
• BERT
• GPT 1, 2, 3
– ChatBots
• ChatGPT, GPT 4
• Reinforcement Learning from Human Feedback
(RLHF)
23
23
Related Areas
• Database Management
• Library and Information Science
• Artificial Intelligence
• Natural Language Processing
• Machine Learning
24
24
8
Database Management
25
25
26
Artificial Intelligence
27
9
Natural Language Processing
28
28
29
Machine Learning
30
10
Machine Learning:
IR Directions
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
• Learning to Rank 31
31
11