0% found this document useful (0 votes)
17 views54 pages

2 TextOperations

The document discusses different approaches to representing documents in an information retrieval system, including controlled vocabularies and free-text indexing. Controlled vocabularies assign predefined concepts to documents but are time-consuming to implement, while free-text indexing represents documents using the terms contained within. The document then discusses how texts are processed for free-text indexing, including removing markup, normalization, tokenization, and stopping. Statistical properties of word usage like Zipf's Law, Luhn's ideas, and Heap's Law are also covered.

Uploaded by

Mulugeta Hailu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views54 pages

2 TextOperations

The document discusses different approaches to representing documents in an information retrieval system, including controlled vocabularies and free-text indexing. Controlled vocabularies assign predefined concepts to documents but are time-consuming to implement, while free-text indexing represents documents using the terms contained within. The document then discusses how texts are processed for free-text indexing, including removing markup, normalization, tokenization, and stopping. Statistical properties of word usage like Zipf's Law, Luhn's ideas, and Heap's Law are also covered.

Uploaded by

Mulugeta Hailu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

Text Operations

The Basic Retrieval Process


doc

Information need Document

Representation Representation

Query Comparison Index

Retrieved documents

Evaluation
Document representation
• IR system/Search engine does not scan each
document to see if it satisfies the query
• It uses an index to quickly locate the relevant
documents
• Index: a list of concepts with pointers to
documents that discuss them
– What goes in the index is important
• Document representation: deciding what
concepts should go in the index
Document representation
• Two options:
• Controlled vocabulary – a set of manually
constructed concepts that describe the major
topics covered in the collection
• Free-text indexing – the set of individual
terms that occur in the collection
Document representation
• Controlled Vocabulary: a set of well-defined
concepts
– Assigned to documents by humans (or
automatically)
• E.g. Subject headings, key words, etc.
– May include parent-child relations b/n concepts
• E.g. computers: software: Information Retrieval
– Facilitate non-query-based browsing and
exploration
Controlled Vocabulary: Advantages
• Concepts do not need to appear explicitly in
the text
• R/ships b/n concepts facilitate non-query
based navigation and exploration
• Developed by experts who know the data and
the user
• Represent the concepts/relationships that
users (presumably) care the most about
• Describe the concepts that are most central to
the document
• Concepts are unambiguous and recognizable
Controlled Vocabulary:
Disadvantages
• Time consuming
• Users must know the concepts in the index
• Labor intensive
Free Text Indexing
• Represent documents using terms within the document
• Which terms? Only the most descriptive terms? Only
the unambiguous ones? All of them?
– Usually, all of them (a.k.a. full-text indexing)
• The user will use term-combinations to express higher
level concepts
• Query terms will hopefully disambiguate each other
(e.g., “volkswagen golf”)
• The search engine will determine which terms are
important
How are the texts handled?
• What happens if you take the words exactly as
they appear in the original text?
• What about punctuation, capitalization, etc.?
• What about spelling errors?
• What about plural vs. singular forms of
words?
• What about cases and declension in non-
English language?
• What about non-roman alphabets?
Free Text Indexing: Steps
1. Mark-up removal
2. Normalization – e.g downcasing
– Information and information
– Retrieval and RETRIEVAL
– US and us – can change the meaning of words
3. Tokenization - splitting text into words (based on
sequences of non-alpha-numeric characters)
– Problematic cases: ph.d. = pd d, isn’t = isn t
4. Stopword removal
5. Do steps 1-4 to every document in the collection
and create an index using the union of all remaining
terms
Controlled Vs Free Text Indexing
Cost of Assigning Ambiguity of Detail of
index terms index terms representation

Controlled High Not ambiguous Can’t represent


Vocabulary arbitrary detail
Free text low Can be Any level of
indexing ambiguous detail

• Both are effective and used often


• We will focus on free-text indexing in this
course
– cheap and easy
– most search engines use it
Free/Full Text Indexing
• Our goal is to describe content using content
• Are all words equally descriptive?
• What are the most descriptive words?
• How might a computer identify these?
• We know that language use is varied
– There are many ways to convey the same
information (which makes IR difficult)
• But, are there statistical properties of word
usage that are predictable? Across languages?
Across modalities? Across genres?
Statistical Properties of words in a
Text
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of a
corpus?
• Such properties of text collection greatly affect the
performance of IR system & can be used to select
suitable term weights & other aspects of the system
• There are three well-known researchers who define
statistical properties of words in a text:
– Zipf’s Law: models word distribution in text corpus
– Luhn’s idea: measures word significance
– Heap’s Law: shows how vocabulary size grows with the growth
of the corpus size
Word Distribution/Frequency
• A few words are very
common.
2 most frequent words
(e.g. “the”, “of”) can
account for about 10%
of word occurrences.
• Most words are very rare.
Half the words in a
corpus appear only
once, called “read only
once” or Hapax
Legomena (in Greek)
Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
 attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text
• For all the words in a collection of documents, for each word w
f : is the frequency of w
r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
f

Zipf’s Distribution of sorted word


distributions: w has rank r &
frequencies, according to
Rank
frequency f Zipf’s law
Frequency
Distribution

r
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of occuerence (most
frequent words first), the occurence characterstics of the vocabulary
can be characterized by the constant rank-frequency law of Zipf.

• If the words, w, in a
collection are ranked, r,
by their frequency, f, they
roughly fit the relation:
1
f 
r*f=c r
– Different collections have
different constants c.

• The table shows the most frequently occurring words from 336,310 document corpus
containing 125, 720, 891 total words; out of which 508, 209 are unique words
Word distribution: Zipf's Law
• The product of the frequency of words (f) and their
rank (r) is approximately constant
– Rank = order of words’ frequency of occurrence

f = C 1 / r

• Another way to state this is with an approximately correct


rule of thumb:
– Say the most common term occurs C times
– The second most common occurs C/2 times
– The third most common occurs C/3 times
Explanations for Zipf’s Law
• The law has been explained by “principle of least effort”
which makes it easier for a speaker or writer of a
language to repeat certain words instead of coining new
and different words.
– Zipf’s explanation was his “principle of least effort” which balance
between speaker’s desire for a small vocabulary and hearer’s
desire for a large one.
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper
cut-off). Used by almost all systems.
• Significant words: Take words in between the most
frequent (upper cut-off) and least frequent words
(lower cut-off)
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent words
weighed less. Used by almost all ranking methods.
Zipf's Law Impact
• Zipf’s Law Impact on IR
– Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-
index storage costs.
– Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation
analysis for query expansion) is difficult since they are
extremely rare.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a
text furnishes a useful measurement of word significance
• Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing
• For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
– The words exceeding the upper cutoff were considered to be
common
– The words below the lower cutoff were considered to be rare
– Hence they are not contributing significantly to the content of the
text
– The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
• Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating f & r
yields the following curve
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and


extremely uncommon words were not very useful for document
representation & indexing
Heaps’ Law
• As the corpus grows, the number of new
terms will increase dramatically at first, but
then will increase at a slower rate
• Nevertheless, as the corpus grows, new terms
will always be found (even if the corpus
becomes huge)
– there is no end to vocabulary growth
– invented words, proper nouns (people, products),
misspellings, email addresses, etc.
Vocabulary Growth: Heaps’ Law
• How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
– This determines how the size of the inverted index will scale with
the size of the corpus.
• Heap’s law: estimates the number of vocabularies in a
given corpus
– The vocabulary size grows by K(nβ), where β is a constant between
0 – 1.
– If V is the size of the vocabulary and n is the length of the corpus in
words, Heap’s law provides the following equation:

• Where constants:
V  Kn
– K  10100
–   0.40.6 (approx. square-root)
Heap’s distributions
• Distribution of size of the vocabulary vs. total number of
terms extracted from text corpus
Example: Heaps Law
• We want to estimate the size of the vocabulary for a
corpus of 1,000,000 words
• Assume that based on statistical analysis on smaller
corpora sizes:
– A corpus with 100,000 words contain 50,000 unique
words; and
– A corpus with 500,000 words contain 150,000 unique
words
• Estimate the vocabulary size for the 1,000,000 words
corpus?
– What about for a corpus of 1,000,000,000 words?
Implication of Heaps’ law
• Given a corpus and a new set of data, the
number of new index terms will depend on
the size of the corpus
• Given more data, new index terms will always
be required
• This may also be true for controlled
vocabularies
– Given a corpus and a new set of data, the
requirement for new concepts will depend on the
size of the corpus
– Given more data, new concepts will always be
required
Text Operations
• Not all words in a document are equally significant to represent
the contents/meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of a document content
• Therefore, need to preprocess the text of a document in a
collection to be used as source of index terms
• Using the set of all words in a collection to index documents
creates too much noise for the retrieval task
– Reduce noise means reduce words which can be used to refer to the
document
• Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
– Preprocessing will lead to an improvement in the information retrieval
performance
• However, some search engines on the Web omit preprocessing
– Every word in the document is an index term
Text Operations
• Text operations is the process of text transformations in
to logical representations
• 5 main operations for selecting index terms, i.e. to
choose words/stems (or groups of words) to be used as
indexing terms:
– Tokenization of the text: generate a set of words from text
collection
– Elimination of stop words - filter out words which are not useful
in the retrieval process
– Normalization – bringing to one form – e.g. downcasing
– Stemming words - remove affixes (prefixes and suffixes) and
group together word variants with similar meaning
– Construction of term categorization structures such as
thesaurus, to capture relationship for allowing the expansion of
the original query with related terms
Tokenization
• Tokenization is one of the steps used to convert text of the
documents into a sequence of words, w 1, w2, … wn to be
adopted as index terms
– It is the process of demarcating and possibly classifying
sections of a string of input characters into words
– For example,
• The quick brown fox jumps over the lazy dog

• Objective - identify words in the text


– What is a word means?
• Is that a sequence of characters, numbers and alpha-numeric once?
– How we identify a set of words that exist in a text
documents?
• Tokenization Issues
– numbers, hyphens, punctuations marks, apostrophes …
Issues in Tokenization
• One word or multiple: How to handle special cases involving hyphens,
apostrophes, punctuation marks etc? C++, C#, URL’s, e-mail, …
– Sometimes punctuations (e-mail), numbers (1999), & case (Republican vs.
republican) can be a meaningful part of a token.
– However, frequently they are not
• Two words may be connected by hyphens
– Can two words connected by hyphens taken as one word or two words?
Break up hyphenated sequence as two tokens?
• In most cases hyphen – break up the words (e.g. state-of-the-art  state of
the art), but some words, e.g. MS-DOS - unique word which require hyphens
• Two words may be connected by punctuation marks
– Punctuation marks: remove totally unless significant, e.g. program code:
x.exe and xexe. What about Kebede’s, www.command.com?
• Two words (phrase) may be separated by space
– E.g. Addis Ababa, San Francisco, Los Angeles
• Two words may be written in different ways
– lowercase, lower-case, lower case? data base, database, data-base?
Issues in Tokenization
• Numbers: are numbers/digits words and used as index terms?
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers (+251923415005)
– IP addresses (100.2.86.144)
– Numbers are not good index terms (like 1910, 1999); but 510 B.C. is unique.
Generally, don’t index numbers as text, though very useful.
• What about case of letters (e.g. Data or data or DATA):
– cases are not important and there is a need to convert all to upper or lower.
Which one is mostly followed by human beings?

• Simplest approach is to ignore all numbers and punctuation marks


(period, colon, comma, brackets, semi-colon, apostrophe, …) & use
only case-insensitive unbroken strings of alphabetic characters as
words.
– Will often index “meta-data”, including creation date, format, etc. separately
• Issues of tokenization are language specific
– Requires the language to be known
Tokenization
• Analyze text into a sequence of discrete tokens (words)
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters that
are grouped together as a useful semantic unit for processing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index entry,
after further processing
• But what are valid tokens to emit?
Exercise: Tokenization
• The cat slept peacefully in the living room. It’s
a very old cat.

• The instructor (Dr. O’Neill) thinks that the


boys’ stories about Chile’s capital aren’t
amusing.
Stopword Removal
• A stopword is a term that is discarded from the document
representation
• Stopwords: words that we ignore because we expect them not to
be useful in distinguishing between relevant/non-relevant
documents for any query
• Stopwords are extremely common words across document
collections that have no discriminatory power
• Assumption: stopwords are unimportant because they are frequent
in every document
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select documents matching a user need
and needs to be filtered out as potential index terms
• Stopwords are typically function words:
– Examples of stopwords are articles, prepositions, conjunctions, etc.:
• articles (a, an, the); pronouns: (I, he, she, it, their, his)
• Some prepositions (on, of, in, about, besides, against); conjunctions/ connectors (and,
but, for, nor, or, so, yet), verbs (is, are, was, were), adverbs (here, there, out, because,
soon, after) and adjectives (all, any, each, every, few, many, some) can also be treated as
stopwords
• Stopwords are language dependent
Why Stopword Removal?
• Intuition:
–Stopwords have little semantic content; It is typical to
remove such high-frequency words
–Stopwords take up 50% of the text. Hence, document size
reduces by 30-50%
• Smaller indices for information retrieval
–Good compression techniques for indices: The 30
most common words account for 30% of the tokens in
written text
• With the removal of stopwords, we can measure better
approximation of importance for text classification, text
categorization, text summarization, etc.
How to detect a stopword?
• One method: Sort terms (in decreasing order) by
document frequency (DF) and take the most frequent
ones based on the cutoff point
–In a collection about insurance practices, “insurance”
would be a stop word

• Another method: Build a stop word list that contains


a set of articles, pronouns, etc.
–Why do we need stop lists: With a stop list, we can
compare and exclude from index terms entirely the
commonest words.
–Can you identify common words in Amharic and build
stop list?
Trends in Stopwords
• Stopword elimination used to be standard in older IR
systems. But the trend is away from doing this nowadays.
• Most web search engines index stopwords:
– Good query optimization techniques mean you pay little
attention at query time for including stop words.
– You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
– Elimination of stopwords might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
Normalization
• It is Canonicalizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
– Need to “normalize” terms in indexed text as well as query
terms into the same form
– Example: We want to match U.S.A. and USA, by deleting
periods in a term
• Case Folding: Often best to lower case everything,
since users will use lowercase regardless of ‘correct’
capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. anti-discriminatory
– Car vs. Automobile?
Normalization issues
• Good for
– Allow instances of Automobile at the beginning of a
sentence to match with a query of automobile
– Helps a search engine when most users type ferrari
while they are interested in a Ferrari car
• Bad for
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, Kebede…
• Solution:
– lowercase only words at the beginning of the sentence
• In IR, lowercasing is most practical because of the way
users issue their queries
Stemming/morphological analysis
• Basic question: words occur in different forms.
Do we want to treat different forms as different
index terms?
• Conflation: treating different (inflectional and
derivational) variants as the same index term
• What are we trying to achieve by conflating
morphological variants?
• Goal: help the system ignore unimportant
variations of language usage
Stemming/Morphological analysis
• Stemming reduces tokens to their “root” form of words to
recognize morphological variation
– The process involves removal of affixes (i.e. prefixes and suffixes)
with the aim of reducing variants to the same stem

– Often removes inflectional and derivational morphology of a word


• Inflectional morphology: vary the form of words in order to express
grammatical features, such as singular/plural or past/present tense. E.g. Boy
→ boys, cut → cutting.
• Derivational morphology: makes new words from old ones. E.g. creation is
formed from create , but they are two separate words. And also,
destruction → destroy
• Compounding – combining words to form new ones e.g beefsteak

• Stemming is language dependent


– Correct stemming is language specific and can be complex
for example compressed and for example compress and
compression are both accepted compress are both accept
Stemming
• The final output from a conflation algorithm is a set of classes,
one for each stem detected
–A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected, connecting
connection, connections}
–Thus, [automate, automatic, automation] all reduce to 
automat
• A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the
document
–A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords
• Queries : Queries are handled in the same way
Ways to implement stemming
There are basically two ways to implement stemming
–The first approach is to create a big dictionary that maps words
to their stems
• The advantage of this approach is that it works perfectly (insofar
as the stem of a word can be defined perfectly); the disadvantages
are the space required by the dictionary and the investment
required to maintain the dictionary as new words appear
–The second approach is to use a set of rules that extract stems
from words
• Techniques widely used include: rule-based, statistical, machine
learning or hybrid
• The advantages of this approach are that the code is typically
small, & it can gracefully handle new words; the disadvantage is
that it occasionally makes mistakes
–But, since stemming is imperfectly defined, anyway, occasional
mistakes are tolerable, & the rule-based approach is the one
that is generally chosen
Porter Stemmer
• Stemming is the operation of stripping the suffices from a
word, leaving its stem
– Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection and
connections when users ask for a web page that contains the word
connect.
• In 1979, Martin Porter developed a stemming algorithm that
uses a set of rules to extract stems from words, and though it
makes some mistakes, most common words seem to work out
right
– Porter describes his algorithm and provides a reference
implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer
• Most common algorithm for stemming English words to
their common grammatical root
• It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of plurals
the following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
–S cats  cat

– EMENT   (Delete final emleent if what remains is longer than


1 character )
replacement  replac
cement  cement
Porter stemmer
• Porter stemmer works in steps.
– While step 1a gets rid of plurals –s and -es, step 1b removes
-ed or -ing.
– e.g.
;; agreed -> agree ;; disabled ->
disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet
;; feed -> feed
Stemming: challenges
• May produce unusual stems that are not English
words:
– Removing ‘UAL’ from FACTUAL and EQUAL

• May conflate (reduce to the same token) words


that are actually distinct
• “computer”, “computational”, “computation” all
reduced to same token “comput”

• Note recognize all morphological derivations


Thesauri
• Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same concept
– Problem: The same meaning can be expressed using different
terms that are synonyms, and related terms
– How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
• Thesaurus: The vocabulary of a controlled indexing language,
formally organized so that a priori relationships between
concepts (for example as "broader" and “related") are made
explicit
• A thesaurus contains terms and relationships between terms
– IR thesauri rely typically upon the use of symbols such as
USE/UF (UF=used for), BT, and RT to demonstrate inter-term
relationships
– e.g., car = automobile, truck, bus, taxi, motor vehicle
-color = colour, paint
Aim of Thesaurus
• Thesaurus tries to control the use of the vocabulary by showing a
set of related words to handle synonyms
• The aim of thesaurus is therefore:
– to provide a standard vocabulary for indexing and searching
• Thesaurus rewrite to form equivalence classes, and we index such
equivalences
• When the document contains automobile, index it under car as well
(usually, also vice-versa)
– to assist users with locating terms for proper query formulation:
When the query contains automobile, look under car as well for
expanding query
– to provide classified hierarchies that allow the broadening and
narrowing of the current request according to user needs
Thesaurus Construction
Example: thesaurus built to assist IR for searching
cars and vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport
More Example
Example: thesaurus built to assist IR in the fields of
computer science:
TERM: natural languages
– UF natural language processing (UF=used for NLP)
– BT languages (BT=broader term is languages)
– TT languages (TT = top term is languages)
– RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition
Language-specificity
• Many of the above features embody
transformations that are
– Language-specific and
– Often, application-specific
• These are “plug-in” addenda to the indexing
process
• Both open source and commercial plug-ins are
available for handling these
Index Term Selection
• Index language is the language used to describe
documents and requests
• Elements of the index language are index terms which
may be derived from the text of the document to be
described, or may be arrived at independently
– If a full text representation of the text is adopted, then all
words in the text are used as index terms = full text
indexing
– Otherwise, need to select the words to be used as index
terms for reducing the size of the index file which is basic
to design an efficient searching IR system

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy