0% found this document useful (0 votes)
53 views5 pages

Modern Information Storage and Retrieval: Document/Text Operations

Modern Information Storage and Retrieval discusses document/text operations for information retrieval systems including tokenization, handling HTML tokens, removing stopwords, and stemming tokens. Tokenization breaks text into discrete tokens while sometimes preserving punctuation and numbers. Stopwords like "a", "the", "in" are typically excluded. Stemming reduces tokens to their root form like reducing "computer", "computational", "computation" to the token "comput".

Uploaded by

teddy demissie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views5 pages

Modern Information Storage and Retrieval: Document/Text Operations

Modern Information Storage and Retrieval discusses document/text operations for information retrieval systems including tokenization, handling HTML tokens, removing stopwords, and stemming tokens. Tokenization breaks text into discrete tokens while sometimes preserving punctuation and numbers. Stopwords like "a", "the", "in" are typically excluded. Stemming reduces tokens to their root form like reducing "computer", "computational", "computation" to the token "comput".

Uploaded by

teddy demissie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 5

Modern Information Storage and Retrieval

Document/Text Operations
Tokenization
 Analyze text into a sequence of discrete
tokens.
 Sometimes punctuation (e-mail), numbers
(1999), and case (God vs. god) can be a
meaningful part of a token.
 However, frequently they are not.
 Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive
unbroken strings of alphabetic characters as
tokens.
Tokenizing HTML

 Should text in HTML commands not


typically seen by the user be included as
tokens?
– Words appearing in URLs.
– Words appearing in “meta text” of images.
Stopwords

o It is typical to exclude high-frequency words


(e.g. function words: “a”, “the”, “in”, “to”;
pronouns: “I”, “he”, “she”, “it”).
o Stopwords are language dependent.
o For efficiency, store strings for stopwords in
a hashtable to recognize them in constant
time.
Stemming

 Reduce tokens to “root” form of words to


recognize morphological variation.
 “computer”, “computational”, “computation”
all reduced to same token “comput”
 Correct morphological analysis is language
specific and can be complex.
 Stemming “blindly” strips off known affixes
(prefixes and suffixes) in an iterative fashion.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy