0% found this document useful (0 votes)
49 views16 pages

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

The document discusses information retrieval and provides an introduction to the topic. It defines information retrieval as searching for information within documents or databases. It explains that information retrieval deals with representing, storing, organizing, and accessing information items. It also discusses some key aspects of modern information retrieval systems, including ad hoc retrieval of text documents, searching user-generated web content, and other hot topics like image search, multilingual issues, and spoken language retrieval.

Uploaded by

api-20013624
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views16 pages

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

The document discusses information retrieval and provides an introduction to the topic. It defines information retrieval as searching for information within documents or databases. It explains that information retrieval deals with representing, storing, organizing, and accessing information items. It also discusses some key aspects of modern information retrieval systems, including ad hoc retrieval of text documents, searching user-generated web content, and other hot topics like image search, multilingual issues, and spoken language retrieval.

Uploaded by

api-20013624
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CSCI 7000

Modern Information Retrieval

Lecture 1: Introduction

Information Retrieval
Information retrieval is the science of searching for information in
documents, searching for documents themselves, searching for
metadata which describe documents, or searching within databases,
whether relational stand-alone databases or hypertextually-networked
databases such as the World Wide Web.
Wikipedia
Finding material of an unstructured nature that satisfies an information
need from within large collections.
Manning et al 2008
The study of methods and structures used to represent and access
information.
Witten et al
The IR definition can be found in this book.
Salton
IR deals with the representation, storage, organization of, and access to
information items.
Salton
Information retrieval is the term conventionally, though somewhat
inaccurately, applied to the type of activity discussed in this volume.
2
van Rijsbergen

1
IR is now largely what Google does…

 Ad hoc retrieval is the core task that


modern IR systems need to start from.
 One-shot information seeking attempts by
ignorant users
 Ignorant about the structure of the collection
 Ignorant about how the system works
 Ignorant about how to formulate queries
 Typically textual documents, but video and
audio are becoming more prevalent.
 Collections are heterogeneous in nature.

But...

 The real action right now lies in Web 2.0


issues...
 Dealing with User Generated Content
 Discussion forums
 Blogs
 Microblogs
 To deal with
 Sentiment, opinions, etc
 Social networks
 Tribes, influencers
4

2
Other Hot Topics

 Image search
 How to index images
 With and without additional information like
captions
 Multilingual issues
 Cross-language search and indexing
 Spoken language issues
 ASR for indexing videos

Manning…

 Most of today’s slides were stolen/adapted


from Chris Manning…

3
Unstructured (text) vs. structured (database)
data in 1996

Unstructured (text) vs. structured (database)


data in 2006

4
Boulder players

Boulder players

10

5
Course Plan

 Cover the basics of IR technology in the


first part of the course
 Read papers/investigate newer topics in
the latter part
 Use case studies of real companies
throughout the semester
 Project presentations and discussions for
the last section of the class.

 I expect informed participation.

11

Last year...

 We followed 1 company in the tech news


quite a bit...
 Powerset
 NLP-based search technology
 Most of us were pretty skeptical. It seemed
like a lot of hype and little sensible work.
 Acquired by MS for $100M last month...
 Shows you what I know
 This year
 Cuil (pronounced cool)
12

6
Go to the web

13

Unstructured Data Scenario

 Which plays of Shakespeare contain the


words Brutus AND Caesar but NOT
Calpurnia?
 One could grep all of Shakespeare’s plays
for Brutus and Caesar, then strip out
lines containing Calpurnia. This is
problematic:
 Slow (for large corpora)
 NOT Calpurnia is non-trivial
 Other operations (e.g., find the word
Romans near countrymen) not feasible
 Ranked retrieval (best documents to return)
14

7
Term-Document Matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar but NOT 1 if play contains


Calpurnia word, 0 otherwise
15

Incidence vectors

 So we have a 0/1 vector for each term.


 To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented) ➨ bitwise AND.
 110100 AND 110111 AND 101111 =
100100.

16

8
Answers to query

 Antony and Cleopatra, Act III, Scene ii


 Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
 When Antony found Julius Caesar dead,
 He cried almost to roaring; and he wept
 When at Philippi he found Brutus slain.

 Hamlet, Act III, Scene ii


 Lord Polonius: I did enact Julius Caesar I was killed i' the
 Capitol; Brutus killed me.

17

Bigger corpora

 Consider N = 1M documents, each with


about 1K terms.
 Avg 6 bytes/term including spaces and
punctuation
 6GB of data in the documents.
 Say there are m = 500K distinct terms
among these.
 Types vs. Tokens

18

9
Can’t build the matrix explicitly

 500K x 1M matrix has half-a-trillion 0’s


and 1’s.
 But it has no more than one billion 1’s. Why?
 matrix is extremely sparse.
 What’s a better representation?
 We only record the 1 positions.

19

Inverted index

 For each term T, we must store a list of all


documents that contain T.
 Use an array or a list for this?

Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16

What happens if the word Caesar


is added to document 14?
20

10
Inverted index

 Linked lists generally preferred to arrays


 Dynamic space allocation
 Insertion of terms into documents easy
 Space overhead of pointers Posting

Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16

Dictionary Postings lists


21
Sorted by docID (more later on why).

Inverted index construction


Documents to Friends, Romans, countrymen.
be indexed.

Tokenizer
Token stream. Friends Romans Countrymen
More on Linguistic
these later. modules
Modified tokens. friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 22 16

11
Indexer steps
Term Doc #
 Sequence of (Modified token, Document ID) I
did
1
1
pairs. enact
julius
1
1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
Doc 1 Doc 2 killed
me
1
1
so 2
let 2

I did enact Julius


it 2
So let it be with be
with
2
2
Caesar I was killed Caesar. The noble caesar
the
2
2

i' the Capitol; Brutus hath told you


noble
brutus
2
2

Brutus killed me.


hath 2
Caesar was ambitious told
you
2
2
caesar 2
was 2
ambitious 2
23

Term Doc # Term Doc #

Sort by terms.
I 1 ambitious 2
did 1 be 2
 enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
Core indexing step. was
killed
1
1
caesar
caesar
2
2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
24

12
Term Doc #
Multiple term entries in a
Term Doc # Term freq
 ambitious 2 ambitious 2 1

single document are


be 2 be 2 1
brutus 1 brutus 1 1

merged.
brutus 2 brutus 2 1
capitol 1 capitol 1 1
caesar 1
Frequency information is
caesar 1 1
 caesar 2 caesar 2 2
caesar 2
added. did 1
did
enact
1
1
1
1
enact 1 hath 2 1
hath 1 I 1 2
I 1 i' 1 1
I 1 it 2 1
i' 1 julius 1 1
We’ll see why frequency it
julius
2
1
killed
let
1
2
2
1
matters later. killed
killed
1
1
me
noble
1
2
1
1
let 2 so 2 1
me 1
the 1 1
noble 2
the 2 1
so 2
told 2 1
the 1
you 2 1
the 2
was 1 1
told 2
was 2 1
you 2
with 2 1
was 1
was 2
with 2 25

 The result is split into a Dictionary file


and a Postings file.

Term Doc # Freq


ambitious 2 1 Doc # Freq
Term N docs Coll freq 2 1
be 2 1
ambitious 1 1 2 1
brutus 1 1
be 1 1 1 1
brutus 2 1
brutus 2 2 2 1
capitol 1 1
capitol 1 1 1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 did 1 1 2 2
did 1 1 enact 1 1 1 1
enact 1 1 hath 1 1 1 1
hath 2 1 I 1 2
2 1
I 1 2 i' 1 1
1 2
i' 1 1 it 1 1
1 1
it 2 1 julius 1 1
killed 1 2 2 1
julius 1 1
let 1 1 1 1
killed 1 2
me 1 1 1 2
let 2 1
noble 1 1 2 1
me 1 1
so 1 1 1 1
noble 2 1
the 2 2 2 1
so 2 1 2 1
told 1 1
the 1 1 you 1 1 1 1
the 2 1 was 2 2 2 1
told 2 1 with 1 1 2 1
you 2 1 2 1
was 1 1 1 1
was 2 1 2 1
with 2 1 2 1

26

13
 Storage costs?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
1 1
did 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1

Terms it
julius
1
1
1
1
1
1
1
2
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1

Pointers 27

Distributed Systems

How would you duplicate/partition/distribute


this if you were operating a large parallel
distributed high-availability system?
Ie. What would Google do?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
did 1 1 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1
it 1 1 1 1
1 2
julius 1 1
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1
28

14
Administrivia

 Work/Grading:
 Problem sets and programming exercises
50%
 Quizzes 20%
 Group Project 30%
 Textbooks:
 Introduction to Information Retrieval ---
Manning, Raghavan and Schutze
 Collective Intelligence --- Toby Segaran

29

Administrivia

 The exercises (and group project) will use


Lucene (lucene.apache.org)
 Open-source full text indexing system
 Guest lectures from local industry
 Umbria (JD Powers)
 Google
 Lijit
 Collective Intellect

30

15
Administrivia

 Professor: Jim Martin


 James.martin@colorado.edu
 ECOT 735
 Office hours TBA
 www.cs.colorado.edu/~martin/csci7000/

31

Next time

 Read Chapter 1 of both texts for next time

32

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy