0% found this document useful (0 votes)

49 views16 pages

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

The document discusses information retrieval and provides an introduction to the topic. It defines information retrieval as searching for information within documents or databases. It explains that information retrieval deals with representing, storing, organizing, and accessing information items. It also discusses some key aspects of modern information retrieval systems, including ad hoc retrieval of text documents, searching user-generated web content, and other hot topics like image search, multilingual issues, and spoken language retrieval.

Uploaded by

api-20013624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views16 pages

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

Uploaded by

api-20013624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

CSCI 7000

Modern Information Retrieval

Lecture 1: Introduction

Information Retrieval
Information retrieval is the science of searching for information in
documents, searching for documents themselves, searching for
metadata which describe documents, or searching within databases,
whether relational stand-alone databases or hypertextually-networked
databases such as the World Wide Web.
Wikipedia
Finding material of an unstructured nature that satisfies an information
need from within large collections.
Manning et al 2008
The study of methods and structures used to represent and access
information.
Witten et al
The IR definition can be found in this book.
Salton
IR deals with the representation, storage, organization of, and access to
information items.
Salton
Information retrieval is the term conventionally, though somewhat
inaccurately, applied to the type of activity discussed in this volume.
2
van Rijsbergen

1
IR is now largely what Google does…

 Ad hoc retrieval is the core task that

modern IR systems need to start from.
 One-shot information seeking attempts by
ignorant users
 Ignorant about the structure of the collection
 Ignorant about how the system works
 Ignorant about how to formulate queries
 Typically textual documents, but video and
audio are becoming more prevalent.
 Collections are heterogeneous in nature.

But...

 The real action right now lies in Web 2.0

issues...
 Dealing with User Generated Content
 Discussion forums
 Blogs
 Microblogs
 To deal with
 Sentiment, opinions, etc
 Social networks
 Tribes, influencers
4

2
Other Hot Topics

 Image search
 How to index images
 With and without additional information like
captions
 Multilingual issues
 Cross-language search and indexing
 Spoken language issues
 ASR for indexing videos

Manning…

 Most of today’s slides were stolen/adapted

from Chris Manning…

3
Unstructured (text) vs. structured (database)
data in 1996

Unstructured (text) vs. structured (database)

data in 2006

4
Boulder players

Boulder players

5
Course Plan

 Cover the basics of IR technology in the

first part of the course
 Read papers/investigate newer topics in
the latter part
 Use case studies of real companies
throughout the semester
 Project presentations and discussions for
the last section of the class.

 I expect informed participation.

Last year...

 We followed 1 company in the tech news

quite a bit...
 Powerset
 NLP-based search technology
 Most of us were pretty skeptical. It seemed
like a lot of hype and little sensible work.
 Acquired by MS for $100M last month...
 Shows you what I know
 This year
 Cuil (pronounced cool)
12

6
Go to the web

Unstructured Data Scenario

 Which plays of Shakespeare contain the

words Brutus AND Caesar but NOT
Calpurnia?
 One could grep all of Shakespeare’s plays
for Brutus and Caesar, then strip out
lines containing Calpurnia. This is
problematic:
 Slow (for large corpora)
 NOT Calpurnia is non-trivial
 Other operations (e.g., find the word
Romans near countrymen) not feasible
 Ranked retrieval (best documents to return)
14

7
Term-Document Matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar but NOT 1 if play contains

Calpurnia word, 0 otherwise
15

Incidence vectors

 So we have a 0/1 vector for each term.

 To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented) ➨ bitwise AND.
 110100 AND 110111 AND 101111 =
100100.

8
Answers to query

 Antony and Cleopatra, Act III, Scene ii

 Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
 When Antony found Julius Caesar dead,
 He cried almost to roaring; and he wept
 When at Philippi he found Brutus slain.

 Hamlet, Act III, Scene ii

 Lord Polonius: I did enact Julius Caesar I was killed i' the
 Capitol; Brutus killed me.

Bigger corpora

 Consider N = 1M documents, each with

about 1K terms.
 Avg 6 bytes/term including spaces and
punctuation
 6GB of data in the documents.
 Say there are m = 500K distinct terms
among these.
 Types vs. Tokens

9
Can’t build the matrix explicitly

 500K x 1M matrix has half-a-trillion 0’s

and 1’s.
 But it has no more than one billion 1’s. Why?
 matrix is extremely sparse.
 What’s a better representation?
 We only record the 1 positions.

Inverted index

 For each term T, we must store a list of all

documents that contain T.
 Use an array or a list for this?

Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16

What happens if the word Caesar

is added to document 14?
20

10
Inverted index

 Linked lists generally preferred to arrays

 Dynamic space allocation
 Insertion of terms into documents easy
 Space overhead of pointers Posting

Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16

Dictionary Postings lists

21
Sorted by docID (more later on why).

Inverted index construction

Documents to Friends, Romans, countrymen.
be indexed.

Tokenizer
Token stream. Friends Romans Countrymen
More on Linguistic
these later. modules
Modified tokens. friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 22 16

11
Indexer steps
Term Doc #
 Sequence of (Modified token, Document ID) I
did
1
1
pairs. enact
julius
1
1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
Doc 1 Doc 2 killed
me
1
1
so 2
let 2

I did enact Julius

it 2
So let it be with be
with
2
2
Caesar I was killed Caesar. The noble caesar
the
2
2

i' the Capitol; Brutus hath told you

noble
brutus
2
2

Brutus killed me.

hath 2
Caesar was ambitious told
you
2
2
caesar 2
was 2
ambitious 2
23

Term Doc # Term Doc #

Sort by terms.
I 1 ambitious 2
did 1 be 2
 enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
Core indexing step. was
killed
1
1
caesar
caesar
2
2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
24

12
Term Doc #
Multiple term entries in a
Term Doc # Term freq
 ambitious 2 ambitious 2 1

single document are

be 2 be 2 1
brutus 1 brutus 1 1

merged.
brutus 2 brutus 2 1
capitol 1 capitol 1 1
caesar 1
Frequency information is
caesar 1 1
 caesar 2 caesar 2 2
caesar 2
added. did 1
did
enact
1
1
1
1
enact 1 hath 2 1
hath 1 I 1 2
I 1 i' 1 1
I 1 it 2 1
i' 1 julius 1 1
We’ll see why frequency it
julius
2
1
killed
let
1
2
2
1
matters later. killed
killed
1
1
me
noble
1
2
1
1
let 2 so 2 1
me 1
the 1 1
noble 2
the 2 1
so 2
told 2 1
the 1
you 2 1
the 2
was 1 1
told 2
was 2 1
you 2
with 2 1
was 1
was 2
with 2 25

 The result is split into a Dictionary file

and a Postings file.

Term Doc # Freq

ambitious 2 1 Doc # Freq
Term N docs Coll freq 2 1
be 2 1
ambitious 1 1 2 1
brutus 1 1
be 1 1 1 1
brutus 2 1
brutus 2 2 2 1
capitol 1 1
capitol 1 1 1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 did 1 1 2 2
did 1 1 enact 1 1 1 1
enact 1 1 hath 1 1 1 1
hath 2 1 I 1 2
2 1
I 1 2 i' 1 1
1 2
i' 1 1 it 1 1
1 1
it 2 1 julius 1 1
killed 1 2 2 1
julius 1 1
let 1 1 1 1
killed 1 2
me 1 1 1 2
let 2 1
noble 1 1 2 1
me 1 1
so 1 1 1 1
noble 2 1
the 2 2 2 1
so 2 1 2 1
told 1 1
the 1 1 you 1 1 1 1
the 2 1 was 2 2 2 1
told 2 1 with 1 1 2 1
you 2 1 2 1
was 1 1 1 1
was 2 1 2 1
with 2 1 2 1

13
 Storage costs?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
1 1
did 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1

Terms it
julius
1
1
1
1
1
1
1
2
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1

Pointers 27

Distributed Systems

How would you duplicate/partition/distribute

this if you were operating a large parallel
distributed high-availability system?
Ie. What would Google do?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
did 1 1 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1
it 1 1 1 1
1 2
julius 1 1
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1
28

14
Administrivia

 Work/Grading:
 Problem sets and programming exercises
50%
 Quizzes 20%
 Group Project 30%
 Textbooks:
 Introduction to Information Retrieval ---
Manning, Raghavan and Schutze
 Collective Intelligence --- Toby Segaran

Administrivia

 The exercises (and group project) will use

Lucene (lucene.apache.org)
 Open-source full text indexing system
 Guest lectures from local industry
 Umbria (JD Powers)
 Google
 Lijit
 Collective Intellect

15
Administrivia

 Professor: Jim Martin

 James.martin@colorado.edu
 ECOT 735
 Office hours TBA
 www.cs.colorado.edu/~martin/csci7000/

Next time

 Read Chapter 1 of both texts for next time

Antonyms (1) (15 Files Merged)
No ratings yet
Antonyms (1) (15 Files Merged)
692 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
nidhi sakshi 9th
No ratings yet
nidhi sakshi 9th
459 pages
Iliad-18 (Lombardo, Trans.)
No ratings yet
Iliad-18 (Lombardo, Trans.)
25 pages
week6
No ratings yet
week6
98 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
Lecture2 Ranking1
No ratings yet
Lecture2 Ranking1
126 pages
lecture1-intro-boolean
No ratings yet
lecture1-intro-boolean
42 pages
(En) Design of A Steel Beams and Columns
No ratings yet
(En) Design of A Steel Beams and Columns
110 pages
Lecture 1
No ratings yet
Lecture 1
53 pages
Reading 2 - Forming and Empowering Scrum Team
No ratings yet
Reading 2 - Forming and Empowering Scrum Team
88 pages
chapter2-MA212-Indexing+&+Preprocessing
No ratings yet
chapter2-MA212-Indexing+&+Preprocessing
68 pages
Powders
100% (2)
Powders
16 pages
2
No ratings yet
2
50 pages
Unit 1 Intro to IR
No ratings yet
Unit 1 Intro to IR
32 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
lecture02 - IR
No ratings yet
lecture02 - IR
36 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
PERSONALITY and WORK MOTIVATION
No ratings yet
PERSONALITY and WORK MOTIVATION
12 pages
01 - Introduction To Information Retrieval
No ratings yet
01 - Introduction To Information Retrieval
15 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Electronic Code of Federal Regulations Norma Fda para Tuna
No ratings yet
Electronic Code of Federal Regulations Norma Fda para Tuna
19 pages
Pashto Language Stemming Algorithm: Jurnal Teknologi Maklumat Dan Multimedia Asia-Pasifik
No ratings yet
Pashto Language Stemming Algorithm: Jurnal Teknologi Maklumat Dan Multimedia Asia-Pasifik
13 pages
Unit 1
No ratings yet
Unit 1
181 pages
03 -Lect3 search engines-part2
No ratings yet
03 -Lect3 search engines-part2
32 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Unit I
No ratings yet
Unit I
83 pages
Ir 1
No ratings yet
Ir 1
59 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
_Grade 10 - Grammar - Construction Shift
No ratings yet
_Grade 10 - Grammar - Construction Shift
3 pages
Information Retrieval
No ratings yet
Information Retrieval
57 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Life On The Mississippi Excerpt
No ratings yet
Life On The Mississippi Excerpt
7 pages
Chapter 4 Plain Suction Dredgers
No ratings yet
Chapter 4 Plain Suction Dredgers
35 pages
Information Retrieval
No ratings yet
Information Retrieval
44 pages
LayingOutFrustumWithDividers 20jul2012
No ratings yet
LayingOutFrustumWithDividers 20jul2012
9 pages
CSCI 7000 Modern Information Retrieval Jim Martin
No ratings yet
CSCI 7000 Modern Information Retrieval Jim Martin
36 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Tuition Scholarship Application Form of Dali University: Present Address
No ratings yet
Tuition Scholarship Application Form of Dali University: Present Address
3 pages
Inverted Index Construction: Adapted From Lectures by
No ratings yet
Inverted Index Construction: Adapted From Lectures by
78 pages
Conceptual Structures in Modern Information Retrieval: Claudio Carpineto
No ratings yet
Conceptual Structures in Modern Information Retrieval: Claudio Carpineto
28 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
2GO Travel - Itinerary Receipt Elurdie
No ratings yet
2GO Travel - Itinerary Receipt Elurdie
2 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
VES - Law - Prospectus 2022 23 1 1 PDF
No ratings yet
VES - Law - Prospectus 2022 23 1 1 PDF
37 pages
Pediatrics History Taking and Physical Examination
0% (1)
Pediatrics History Taking and Physical Examination
61 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Chain Oiler
No ratings yet
Chain Oiler
3 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
Modern Information Retrieval: Parallel and Distributed IR
No ratings yet
Modern Information Retrieval: Parallel and Distributed IR
15 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Modern Information Retrieval: A Brief Overview: by Amit Singhal
No ratings yet
Modern Information Retrieval: A Brief Overview: by Amit Singhal
14 pages
Soalan Network Programming
No ratings yet
Soalan Network Programming
7 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Mix Design Analysis (Metro Prop) (M20)
No ratings yet
Mix Design Analysis (Metro Prop) (M20)
15 pages
Relevance Ranking and Relevance Feedback: Carl Staelin
No ratings yet
Relevance Ranking and Relevance Feedback: Carl Staelin
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Benzocaine Cream - USP38
No ratings yet
Benzocaine Cream - USP38
2 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Scope 090117 Fiction Closereading
No ratings yet
Scope 090117 Fiction Closereading
3 pages
SW 103 1st Topic PDF
No ratings yet
SW 103 1st Topic PDF
6 pages
Isulan National High School: Republic of The Philippines Department of Education Region XII Division of Sultan Kudarat
No ratings yet
Isulan National High School: Republic of The Philippines Department of Education Region XII Division of Sultan Kudarat
2 pages
Assessment of Quality of Life (AQoL-8D)
No ratings yet
Assessment of Quality of Life (AQoL-8D)
6 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Colonialism
No ratings yet
Colonialism
7 pages
Gr. 4 QP
No ratings yet
Gr. 4 QP
4 pages
Typical Food From Ecuador
No ratings yet
Typical Food From Ecuador
1 page
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
PDOTL Bicol Web
No ratings yet
PDOTL Bicol Web
2 pages
FORM Lecture Mlsu
No ratings yet
FORM Lecture Mlsu
6 pages
More Human Than Human
From Everand
More Human Than Human
Neil Clarke
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

Uploaded by

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

Uploaded by

CSCI 7000

Modern Information Retrieval

 Ad hoc retrieval is the core task that

 The real action right now lies in Web 2.0

 Most of today’s slides were stolen/adapted

Unstructured (text) vs. structured (database)

 Cover the basics of IR technology in the

 I expect informed participation.

 We followed 1 company in the tech news

Unstructured Data Scenario

 Which plays of Shakespeare contain the

Brutus AND Caesar but NOT 1 if play contains

 So we have a 0/1 vector for each term.

 Antony and Cleopatra, Act III, Scene ii

 Hamlet, Act III, Scene ii

 Consider N = 1M documents, each with

 500K x 1M matrix has half-a-trillion 0’s

 For each term T, we must store a list of all

What happens if the word Caesar

 Linked lists generally preferred to arrays

Dictionary Postings lists

Inverted index construction

I did enact Julius

i' the Capitol; Brutus hath told you

Brutus killed me.

Term Doc # Term Doc #

single document are

 The result is split into a Dictionary file

Term Doc # Freq

How would you duplicate/partition/distribute

 The exercises (and group project) will use

 Professor: Jim Martin

 Read Chapter 1 of both texts for next time

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.