IRS Unit 2
IRS Unit 2
INDEXING:
The transformation from received item to searchable data structure is called indexing.
• Process can be manual or automatic.
• Creating a direct search in document data base or indirect search through index files.
• Concept based representation: instead of transforming the input into a searchable
format some systems transform the input into different representation that is concept
based .Search ? Search and return item as per the incoming items.
• History of indexing: shows the dependency of information processing capabilities on
manual and then automatic processing systems .
• Indexing originally called cataloguing : oldest technique to identity the contents of
items to assist in retrieval.
• Items overlap between full item indexing , public and private indexing of files
Objectives :
The public file indexer needs to consider the information needs of all users of library
system . Items overlap between full item indexing , public and private indexing of files.
•Users may use public index files as part of search criteria to increase recall.
•They can constrain there search by private index files
•The primary objective of representing the concepts within an item to facilitate users
finding relevant information .
•Users may use public index files as part of search criteria to increase recall.
•They can constrain there search by private index files
•The primary objective of representing the concepts within an item to facilitate users
finding relevant information
1. Direct Search: This involves creating a searchable format directly in the document
database, meaning users can search through the actual content of the documents.
2. Indirect Search: Here, users search through index files, which are created as a separate,
condensed representation of the original items.
While indexing was initially used for full item representation, it has since developed to
support both public and private indexing. This is important for managing the retrieval of
various types of content based on who is accessing it.
2. Catering to Diverse User Needs: A public file indexer, for instance, must consider the
broad range of information needs among users of a library system. This includes designing
the indexing system to accommodate different types of searches, ensuring both general and
specialized users can find the content they need.
3. Balancing Recall and Precision: Users may rely on public index files to increase recall—
that is, retrieving a wider set of results relevant to a search. However, they can also constrain
their search through private index files, which allow for more focused and specific retrieval.
This flexibility helps balance between retrieving a broad set of results (recall) and honing in
on specific, relevant items (precision).
Conclusion
Indexing is vital to modern information systems, offering multiple approaches to search and
retrieval, from manual cataloging to concept-based automated systems. It supports both
public and private retrieval needs, enhancing users' ability to discover and access relevant
information efficiently.
Text Processing:
1.Document Parsing: Documents come in all sorts of languages, character sets, and
formats; often, the same document may contain multiple languages or formats, e.g., a French
email with Portuguese PDF attachments. Document parsing deals with the recognition and
“breaking down” of the document structure into individual components. In this pre
processing phase, unit documents are created; e.g., emails with attachments are split into one
document representing the email and as many documents as there are attachments.
2. Lexical Analysis: After parsing, lexical analysis tokenizes a document, seen as an input
stream, into words. Issues related to lexical analysis include the correct identification of
accents, abbreviations, dates, and cases. The difficulty of this operation depends much on the
language at hand: for example, the English language has neither diacritics nor cases, French
has diacritics but no cases, German has both diacritics and cases. The recognition of
abbreviations and, in particular, of time expressions would deserve a separate chapter due to
its complexity and the extensive literature in the field For current approaches 3. Stop-Word
Removal. A subsequent step optionally applied to the results of lexical analysis is stop-word
removal, i.e., the removal of high-frequency words. For example, given the sentence “search
engines are the most visible information retrieval applications” and a classic stop words set
such as the one adopted by the Snowball stemmer,1 the effect of stop-word removal would
be: “search engine most visible information retrieval applications”.
4. Phrase Detection: This step captures text meaning beyond what is possible with pure bag-
of-word approaches, thanks to the identification of noun groups and other phrases. Phrase
detection may be approached in several ways, including rules (e.g., retaining terms that are
not separated by punctuation marks), morphological analysis , syntactic analysis, and
combinations thereof. For example, scanning our example sentence “search engines are the
most visible information retrieval applications” for noun phrases would probably result in
identifying “search engines” and “information retrieval”.
2. Weighting. The final phase of text pre processing deals with term weighting. As previously
mentioned, words in a text have different descriptive power; hence, index terms can be
weighted differently to account for their significance within a document and/or a document
collection. Such a weighting can be binary, e.g., assigning 0 for term absence and 1 for
presence.
Data structure :
The knowledge of data structure gives an insight into the capabilities available to the system .
1. One structure stores and manages received items in their normalized form is called
document manger
2. The other data structure contains processing tokens and associated data to support search.
Result of a search are references to the items that satisfy the search statement which are
passed to the document manager for retrieval.
Stemming : is the transformation often applied to data before placing it in the searchable data
structure Stemming represents concept(word) to a canonical (authorized; recognized;
accepted)morphological (the patterns of word formation in a particular language )
representation .
Risk with stemming : concept discrimination information may be lost in the process. Causing
decrease in performance.
STEMMING ALGORITHMS
• Stemming algorithm is used to improve the efficiency of IRS and improve recall.
• Conflation(the process or result of fusing items into one entity; fusion; amalgamation)is a
term that is used to refer mapping multiple morphological variants to single
representation(stem).
• Stem carries the meaning of the concept associated with the word and the affixes(ending)
introduce subtle(slight) modification of the concept.
• Terms with a common stem will usually have similar meanings, for example:
• Ex : Terms with a common stem will usually have similar meanings, for example:
• CONNECT
• CONNECTED
• CONNECTING
• CONNECTION
• CONNECTIONS
• Frequently, the performance of an IR system will be improved if term groups such as this
are conflated into a single term. This may be done by removal of the various suffixes -ED, -
ING, -ION, IONS to leave the single term CONNECT
• In addition, the suffix stripping process will reduce the total number of terms in the IR
system, and hence reduce the size and complexity of the data in the system, which is always
advantageous
Important for a system to categories a word prior to making the decision to stem.
Proper names and acronyms (A word formed from the initial letters of a name say
IARE …) should not have stemming applied.
Stemming can also cause problems for natural language processing NPL systems by
causing loss of information .
PORTER STEMMING ALGORITHM
• Based on a set condition of the stem
• A consonant in a word is a letter other than A, E, I, O or U, some important stem conditions
are
1. The measure m of a stem is a function of sequence of vowels (V) followed by a sequence
of consonant ( C ) .
2. C (VC)mV. m is number VC repeats The case m = 0 covers the null word.
3. *<X> - stem ends with a letter X 3.*v* - stem contains a vowel
4. *d - stem ends in double consonant (e.g. -TT, -SS).
5. *o- stem ends in consonant vowel sequence where the final consonant is not w,x,y(e.g. -
WIL, -HOP). Suffix cond.s takes the form current _suffix = = pattern Actions are in the form
old_suffix ->. New_suffix Rules are divided into steps to define the order for applying the
rule.
systems-
Inverted file structure Most common data structure Inverted file structures are composed of
three files
2. Dictionary
4. For each word a listof documents in which the word is found is stored(inversion of
document
5. Each document is given a unique the numerical identifier that is stored in inversion list .
Dictionary is used to located the inversion list for a particular word. Which is a sorted list(
processing tokens) in the system and a pointer to the location of its inversion list. Dictionary
can also store other information used in query optimization such as length of inversion lists to
increase the precision.
-Grams can be viewed as a special technique for conflation (stemming) and as a unique
data structure in information systems.
- stemming that
generally tries to determine the stem of a word that represents the semantic meaning of the
word, n-grams do not care about semantics.
ol lo on ny Bigrams (no interword symbols) sea col olo lon onyTrigrams (no interword
symbols) #se sea ea# #co col olo lon ony ny# Trigrams (with interword symbol #) #sea#
#colo colon olony lony#
The symbol # is used to represent the interword symbol which is anyone of a set of symbols
(e.g., blank, period, semicolon, colon, etc.).
Uses :
rrors:
Zamora showed trigram analysis provided a viable data structure for identifying misspellings
and transposed characters.
for correction as a procedure within the normalization process.
Advantage:
They place a finite limit on the number of searchable token MaxSeg n=( )n maximum number
of unique n grams that can be generated. “ n” is the length of n-grams number of process
able symbols
Disadvantage: longer the n gram the size of inversion list increase. Performance has 85 %
precision .
Signature files and hypertext structures represent two powerful methods for information
storage and retrieval, each with its own applications, advantages, and approaches. Here's an
elaboration on each concept along with clustering techniques in data management.
Signature Files
A signature file structure offers an efficient way to store and retrieve data by creating fixed-
length signatures for items based on the presence of specific words. Here’s a closer look at
the structure and benefits of this system:
3. Advantages:
- Efficient Information Retrieval: Signature files provide a fast and space-efficient way to
retrieve data, especially in databases where terms appear infrequently.
- Versatile Application: They are well-suited for medium-sized databases, WORM (write-
once-read-many) devices, and scenarios requiring fast, parallel data access.
- Scalability: Signature files support distributed storage, where each node can store and
process a part of the data independently, making it ideal for distributed environments.
With the rise of the internet, hypertext and XML have introduced new ways of structuring
information that prioritize connectivity and detailed data representation:
1. Hypertext:
- Hypertext structures store information in formats like HTML, where documents contain
embedded pointers (links) that reference other documents or locations within the same
document.
- This web of interconnected items creates a non-linear way to browse information, which
is foundational to the web's structure. Hypertext is well-suited for information that needs
frequent updates and interactive features.
3. Advantages:
- Detailed and Interactive Data Representation: Both HTML and XML allow for intricate
data structures, with XML supporting structured data exchange and HTML providing a
visually formatted user interface.
- Flexible and Extensible: XML is designed to represent various data types and structures,
making it versatile for different applications in web development and data management.
1. Term Clustering:
- Involves grouping index terms into clusters to create a statistical thesaurus. This process
allows search systems to identify and relate terms that often appear together, thus expanding
searches with related terms, which can significantly increase recall.
- For example, a search for "machine learning" could also return documents containing
terms like "artificial intelligence" and "data science" due to their statistical correlation.
2. Document Clustering:
- Organizes documents into clusters based on shared characteristics, which helps users
retrieve groups of similar documents in response to a query. This technique enables users to
find related items even when the query might not exactly match the desired items.
- A search could retrieve items similar to the user’s query, improving the chances of finding
relevant documents that a traditional search might miss due to keyword differences.
3. Considerations in Clustering:
- Clustering can introduce noise by including loosely related terms or documents, so careful
use of clustering methods and tuning is essential to avoid irrelevant results and maximize
search precision.
Each of these techniques—signature files, hypertext and XML structures, and clustering—
offers distinct benefits in data management. Signature files support efficient retrieval in
distributed systems, hypertext and XML provide flexible data representation for the web, and
clustering enhances search relevance. Together, these strategies contribute to building robust
information storage and retrieval systems.
Application(s)/Advantage(s)
• Signature files provide a practical solution for storing and locating information in a number
of different situations.
• Signature files have been applied as medium size databases, databases with low frequency
of terms, WORM devices, parallel processing machines, and distributed environments