2 TextOperations
2 TextOperations
Representation Representation
Retrieved documents
Evaluation
Document representation
• IR system/Search engine does not scan each
document to see if it satisfies the query
• It uses an index to quickly locate the relevant
documents
• Index: a list of concepts with pointers to
documents that discuss them
– What goes in the index is important
• Document representation: deciding what
concepts should go in the index
Document representation
• Two options:
• Controlled vocabulary – a set of manually
constructed concepts that describe the major
topics covered in the collection
• Free-text indexing – the set of individual
terms that occur in the collection
Document representation
• Controlled Vocabulary: a set of well-defined
concepts
– Assigned to documents by humans (or
automatically)
• E.g. Subject headings, key words, etc.
– May include parent-child relations b/n concepts
• E.g. computers: software: Information Retrieval
– Facilitate non-query-based browsing and
exploration
Controlled Vocabulary: Advantages
• Concepts do not need to appear explicitly in
the text
• R/ships b/n concepts facilitate non-query
based navigation and exploration
• Developed by experts who know the data and
the user
• Represent the concepts/relationships that
users (presumably) care the most about
• Describe the concepts that are most central to
the document
• Concepts are unambiguous and recognizable
Controlled Vocabulary:
Disadvantages
• Time consuming
• Users must know the concepts in the index
• Labor intensive
Free Text Indexing
• Represent documents using terms within the document
• Which terms? Only the most descriptive terms? Only
the unambiguous ones? All of them?
– Usually, all of them (a.k.a. full-text indexing)
• The user will use term-combinations to express higher
level concepts
• Query terms will hopefully disambiguate each other
(e.g., “volkswagen golf”)
• The search engine will determine which terms are
important
How are the texts handled?
• What happens if you take the words exactly as
they appear in the original text?
• What about punctuation, capitalization, etc.?
• What about spelling errors?
• What about plural vs. singular forms of
words?
• What about cases and declension in non-
English language?
• What about non-roman alphabets?
Free Text Indexing: Steps
1. Mark-up removal
2. Normalization – e.g downcasing
– Information and information
– Retrieval and RETRIEVAL
– US and us – can change the meaning of words
3. Tokenization - splitting text into words (based on
sequences of non-alpha-numeric characters)
– Problematic cases: ph.d. = pd d, isn’t = isn t
4. Stopword removal
5. Do steps 1-4 to every document in the collection
and create an index using the union of all remaining
terms
Controlled Vs Free Text Indexing
Cost of Assigning Ambiguity of Detail of
index terms index terms representation
r
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of occuerence (most
frequent words first), the occurence characterstics of the vocabulary
can be characterized by the constant rank-frequency law of Zipf.
• If the words, w, in a
collection are ranked, r,
by their frequency, f, they
roughly fit the relation:
1
f
r*f=c r
– Different collections have
different constants c.
• The table shows the most frequently occurring words from 336,310 document corpus
containing 125, 720, 891 total words; out of which 508, 209 are unique words
Word distribution: Zipf's Law
• The product of the frequency of words (f) and their
rank (r) is approximately constant
– Rank = order of words’ frequency of occurrence
f = C 1 / r