0% found this document useful (0 votes)
12 views15 pages

IRS Unit 2

The document discusses the processes of cataloging and indexing, emphasizing the transformation of received items into searchable data structures through both manual and automatic methods. It outlines the objectives of indexing, including facilitating information retrieval and catering to diverse user needs, while also detailing text processing techniques such as document parsing, lexical analysis, and stemming. Additionally, it covers data structures like inverted file structures and n-grams, highlighting their roles in improving search efficiency and accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

IRS Unit 2

The document discusses the processes of cataloging and indexing, emphasizing the transformation of received items into searchable data structures through both manual and automatic methods. It outlines the objectives of indexing, including facilitating information retrieval and catering to diverse user needs, while also detailing text processing techniques such as document parsing, lexical analysis, and stemming. Additionally, it covers data structures like inverted file structures and n-grams, highlighting their roles in improving search efficiency and accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CATALOGING AND INDEXING

INDEXING:

The transformation from received item to searchable data structure is called indexing.
• Process can be manual or automatic.
• Creating a direct search in document data base or indirect search through index files.
• Concept based representation: instead of transforming the input into a searchable
format some systems transform the input into different representation that is concept
based .Search ? Search and return item as per the incoming items.
• History of indexing: shows the dependency of information processing capabilities on
manual and then automatic processing systems .
• Indexing originally called cataloguing : oldest technique to identity the contents of
items to assist in retrieval.
• Items overlap between full item indexing , public and private indexing of files

Objectives :
The public file indexer needs to consider the information needs of all users of library
system . Items overlap between full item indexing , public and private indexing of files.
•Users may use public index files as part of search criteria to increase recall.
•They can constrain there search by private index files
•The primary objective of representing the concepts within an item to facilitate users
finding relevant information .
•Users may use public index files as part of search criteria to increase recall.
•They can constrain there search by private index files
•The primary objective of representing the concepts within an item to facilitate users
finding relevant information

Indexing is a crucial aspect of cataloging and information retrieval systems, as it transforms


received items (such as documents, books, or other forms of media) into searchable data
structures, which allows users to efficiently find relevant information. The process can either
be manual, where human indexers analyze the content and assign descriptive terms, or
automatic, where systems use algorithms to identify and tag relevant terms.

There are two main types of searches that indexing enables:

1. Direct Search: This involves creating a searchable format directly in the document
database, meaning users can search through the actual content of the documents.
2. Indirect Search: Here, users search through index files, which are created as a separate,
condensed representation of the original items.

A more advanced method of indexing is concept-based representation. Instead of simply


transforming the input into a searchable format, the system transforms the input into a
representation based on the key concepts or themes. This method enhances retrieval by
identifying the core ideas within the data, allowing for more accurate search results. When
users search within this system, it doesn't just return items that exactly match the search terms
but also those that are conceptually relevant.

History and Evolution of Indexing


Indexing has evolved significantly, starting as a manual process often referred to as
cataloging, one of the earliest methods for identifying and retrieving content. Cataloging was
originally used to organize and manage collections such as libraries, enabling users to find
specific materials based on keywords, subjects, or titles. Over time, the process became more
automated, improving efficiency and allowing for larger datasets to be indexed and retrieved
faster.

While indexing was initially used for full item representation, it has since developed to
support both public and private indexing. This is important for managing the retrieval of
various types of content based on who is accessing it.

Key Objectives of Indexing

1. Facilitating Information Retrieval: The primary goal of indexing is to represent the


concepts within a given item in such a way that users can efficiently retrieve relevant
information. This involves categorizing or tagging the item based on its core content or
subject matter.

2. Catering to Diverse User Needs: A public file indexer, for instance, must consider the
broad range of information needs among users of a library system. This includes designing
the indexing system to accommodate different types of searches, ensuring both general and
specialized users can find the content they need.

3. Balancing Recall and Precision: Users may rely on public index files to increase recall—
that is, retrieving a wider set of results relevant to a search. However, they can also constrain
their search through private index files, which allow for more focused and specific retrieval.
This flexibility helps balance between retrieving a broad set of results (recall) and honing in
on specific, relevant items (precision).

Conclusion

Indexing is vital to modern information systems, offering multiple approaches to search and
retrieval, from manual cataloging to concept-based automated systems. It supports both
public and private retrieval needs, enhancing users' ability to discover and access relevant
information efficiently.
Text Processing:

1.Document Parsing: Documents come in all sorts of languages, character sets, and
formats; often, the same document may contain multiple languages or formats, e.g., a French
email with Portuguese PDF attachments. Document parsing deals with the recognition and
“breaking down” of the document structure into individual components. In this pre
processing phase, unit documents are created; e.g., emails with attachments are split into one
document representing the email and as many documents as there are attachments.

2. Lexical Analysis: After parsing, lexical analysis tokenizes a document, seen as an input
stream, into words. Issues related to lexical analysis include the correct identification of
accents, abbreviations, dates, and cases. The difficulty of this operation depends much on the
language at hand: for example, the English language has neither diacritics nor cases, French
has diacritics but no cases, German has both diacritics and cases. The recognition of
abbreviations and, in particular, of time expressions would deserve a separate chapter due to
its complexity and the extensive literature in the field For current approaches 3. Stop-Word
Removal. A subsequent step optionally applied to the results of lexical analysis is stop-word
removal, i.e., the removal of high-frequency words. For example, given the sentence “search
engines are the most visible information retrieval applications” and a classic stop words set
such as the one adopted by the Snowball stemmer,1 the effect of stop-word removal would
be: “search engine most visible information retrieval applications”.

4. Phrase Detection: This step captures text meaning beyond what is possible with pure bag-
of-word approaches, thanks to the identification of noun groups and other phrases. Phrase
detection may be approached in several ways, including rules (e.g., retaining terms that are
not separated by punctuation marks), morphological analysis , syntactic analysis, and
combinations thereof. For example, scanning our example sentence “search engines are the
most visible information retrieval applications” for noun phrases would probably result in
identifying “search engines” and “information retrieval”.

5. Stemming and Lemmatization: Following phrase extraction, stemming and


lemmatization aim at stripping down word suffixes in order to normalize the word. In
particular, stemming is a heuristic process that “chops off” the ends of words in the hope of
achieving the goal correctly most of the time; a classic rule based algorithm for this was
devised by Porter [280]. According to the Porter stemmer, our example sentence “Search
engines are the most visible information retrieval applications” would result in: “Search engin
are the most visibl inform retriev applic”.

1. Lemmatization is a process that typically uses dictionaries and morphological analysis of


words in order to return the base or dictionary form of a word, thereby collapsing its
inflectional forms (see, e.g., [278]). For example, our sentence would result in “Search engine
are the most visible information retrieval application” when lemmatized according to a
WordNet-based lemmatizer

2. Weighting. The final phase of text pre processing deals with term weighting. As previously
mentioned, words in a text have different descriptive power; hence, index terms can be
weighted differently to account for their significance within a document and/or a document
collection. Such a weighting can be binary, e.g., assigning 0 for term absence and 1 for
presence.
Data structure :

The knowledge of data structure gives an insight into the capabilities available to the system .

• Each data structure has a set of associated capabilities .

• Ability to represent the concept and their r/s.

• Supports location of those concepts Introduction

Two major data structures in any IRS:

1. One structure stores and manages received items in their normalized form is called
document manger

2. The other data structure contains processing tokens and associated data to support search.

Result of a search are references to the items that satisfy the search statement which are
passed to the document manager for retrieval.

Focus : on data structure that support search function

Stemming : is the transformation often applied to data before placing it in the searchable data
structure Stemming represents concept(word) to a canonical (authorized; recognized;
accepted)morphological (the patterns of word formation in a particular language )
representation .
Risk with stemming : concept discrimination information may be lost in the process. Causing
decrease in performance.

Advantage : has a potential to increase recall.

STEMMING ALGORITHMS

• Stemming algorithm is used to improve the efficiency of IRS and improve recall.

• Conflation(the process or result of fusing items into one entity; fusion; amalgamation)is a
term that is used to refer mapping multiple morphological variants to single
representation(stem).

• Stem carries the meaning of the concept associated with the word and the affixes(ending)
introduce subtle(slight) modification of the concept.

• Terms with a common stem will usually have similar meanings, for example:

• Ex : Terms with a common stem will usually have similar meanings, for example:

• CONNECT

• CONNECTED

• CONNECTING

• CONNECTION

• CONNECTIONS

• Frequently, the performance of an IR system will be improved if term groups such as this
are conflated into a single term. This may be done by removal of the various suffixes -ED, -
ING, -ION, IONS to leave the single term CONNECT

• In addition, the suffix stripping process will reduce the total number of terms in the IR
system, and hence reduce the size and complexity of the data in the system, which is always
advantageous

Major usage of stemming is to improve recall.

 Important for a system to categories a word prior to making the decision to stem.
 Proper names and acronyms (A word formed from the initial letters of a name say
IARE …) should not have stemming applied.
 Stemming can also cause problems for natural language processing NPL systems by
causing loss of information .
PORTER STEMMING ALGORITHM
• Based on a set condition of the stem
• A consonant in a word is a letter other than A, E, I, O or U, some important stem conditions
are
1. The measure m of a stem is a function of sequence of vowels (V) followed by a sequence
of consonant ( C ) .
2. C (VC)mV. m is number VC repeats The case m = 0 covers the null word.
3. *<X> - stem ends with a letter X 3.*v* - stem contains a vowel
4. *d - stem ends in double consonant (e.g. -TT, -SS).
5. *o- stem ends in consonant vowel sequence where the final consonant is not w,x,y(e.g. -
WIL, -HOP). Suffix cond.s takes the form current _suffix = = pattern Actions are in the form
old_suffix ->. New_suffix Rules are divided into steps to define the order for applying the
rule.

2. Dictionary look up stemmers

in a dictionary and replaced by the stem that best represents it.

systems-

INQUERY system uses the technique called Kstem.


nts to a root form.
1. Dictionary of words (lexicon)
2. Supplemental list of words for dictionary
3. Exceptional list of words that should retain a „e‟ at the end (e.g., “suites” to
“suite” but “suited” to “suit”).
4. Direct _conflation - word pairs that override stemming algorithm.
5. County_nationality _conflation ( British maps to Britain )
6. Proper nouns -- that should not be stemmed
New words that are not special forms (e.g., dates, phone
numbers) are located in the dictionary to determine simpler
forms by stripping off suffixes and respelling plurals as defined in
the dictionary.
3. Successor stemmers:
Based on length of prefixes .
The smallest unit of speech that distinguishes on word from another
The process uses successor varieties for a word .
Uses information to divide a word into segments and selects on of the segments to stem.

Successor variety of words are used to segment a word by applying one of


the following four methods.
1. Cutoff method : a cut of value is selected to define the stem length.
2. Peak and plateau: a segment break is made after a character
whose successor variety exceeds that of the character.
3. Complete word method: break on boundaries of complete words.

4. Entropy method:uses the distribution method of successor variety letters.


1. Let |Dak| be the number of words beginning with k length sequence of letters a.
2. Let |Dakj| be the number of words in Dak with successor j.
3. The probability that a member of Dak has the successor j
INVERTED FILE STRUCTURE

Inverted file structure Most common data structure Inverted file structures are composed of
three files

The document file

1. The inversion list (Posting List)

2. Dictionary

3. The inverted file : based on the methodology of storing an inversion of documents.

4. For each word a listof documents in which the word is found is stored(inversion of
document

5. Each document is given a unique the numerical identifier that is stored in inversion list .
Dictionary is used to located the inversion list for a particular word. Which is a sorted list(
processing tokens) in the system and a pointer to the location of its inversion list. Dictionary
can also store other information used in query optimization such as length of inversion lists to
increase the precision.

 Use zoning to improve precision and Restrict entries.


 Inversion list consists of document identifier for each document in which the word
is found.
 Ex: bit 1(10),1(12) 1(18) is in 10,12, 18 position of the word bit in the document #1.
 When a search is performed, the inversion lists for the terms in the query are locate
and appropriate logic is applied between inversion lists.
 Weights can also be stored in the inversion list.
 Inversion list are used to store concept and their relationship.
 Words with special characteristics can be stored in their own dictionary. Ex: Date…
which require date ranging and numbers.
 Systems that support ranking are re-organized in ranked order.
 B trees can also be used for inversion instead of dictionary.
 The inversion lists may be at the leaf level or referenced in higher level pointers.
 A B-tree of order m is defined as:
 A root node with between 2 and 2m keys
 All other internal nodes have between m and 2m keys
 All keys are kept in order from smaller to larger.
 All leaves are at the same level or differ by at most one level.
N-GRAM DATA STRUCTURE

-Grams can be viewed as a special technique for conflation (stemming) and as a unique
data structure in information systems.

- stemming that
generally tries to determine the stem of a word that represents the semantic meaning of the
word, n-grams do not care about semantics.

-grams, which are then used


to create the searchable database.

ol lo on ny Bigrams (no interword symbols) sea col olo lon onyTrigrams (no interword
symbols) #se sea ea# #co col olo lon ony ny# Trigrams (with interword symbol #) #sea#
#colo colon olony lony#

Pentagrams (with interword symbol #)

The symbol # is used to represent the interword symbol which is anyone of a set of symbols
(e.g., blank, period, semicolon, colon, etc.).

used to represent the interword symbol which is anyone of a set of


symbols (e.g., blank, period, semicolon, colon, etc.).

-grams created becomes a separate processing tokens and are searchable.

-gram can be created multiple times from a single word.

Uses :

Use bigrams for conflating terms.

-grams as a potential erroneous words.

rrors:

Zamora showed trigram analysis provided a viable data structure for identifying misspellings
and transposed characters.
for correction as a procedure within the normalization process.

-gram patterns can also be used for identifying the language


of an item.

he Selective Dissemination of Information.

Advantage:

They place a finite limit on the number of searchable token MaxSeg n=( )n maximum number
of unique n grams that can be generated. “ n” is the length of n-grams number of process
able symbols

Disadvantage: longer the n gram the size of inversion list increase. Performance has 85 %
precision .
Signature files and hypertext structures represent two powerful methods for information
storage and retrieval, each with its own applications, advantages, and approaches. Here's an
elaboration on each concept along with clustering techniques in data management.

Signature Files

A signature file structure offers an efficient way to store and retrieve data by creating fixed-
length signatures for items based on the presence of specific words. Here’s a closer look at
the structure and benefits of this system:

1. Structure and Creation:


- Words within an item are encoded into a binary "word signature" using a hash function.
This word signature has a fixed length and specific bits set to "1", with the number of "1"s
controlled to maintain a manageable density of information.
- These individual word signatures are then combined (using bitwise OR operations) to
create a single item signature. To avoid creating overly dense signatures that contain too
many "1"s, which could make the search process less precise, items are partitioned into
blocks, each with a defined number of words (e.g., a block size of five words).
- Each block is assigned a 16-bit code, allowing search and comparison based on template
matching of bit positions, which is highly efficient for filtering out irrelevant data.

2. Applications in Parallel Processing and Distributed Environments:


- In parallel processing, signature files allow multiple processors to work on separate blocks
or signatures concurrently, speeding up search and retrieval in large datasets.
- In distributed environments, signature files reduce the amount of data transfer needed for
searching by allowing comparisons based on compact signatures rather than full records,
conserving network bandwidth and improving efficiency.

3. Advantages:
- Efficient Information Retrieval: Signature files provide a fast and space-efficient way to
retrieve data, especially in databases where terms appear infrequently.
- Versatile Application: They are well-suited for medium-sized databases, WORM (write-
once-read-many) devices, and scenarios requiring fast, parallel data access.
- Scalability: Signature files support distributed storage, where each node can store and
process a part of the data independently, making it ideal for distributed environments.

Hypertext and XML Data Structures

With the rise of the internet, hypertext and XML have introduced new ways of structuring
information that prioritize connectivity and detailed data representation:

1. Hypertext:
- Hypertext structures store information in formats like HTML, where documents contain
embedded pointers (links) that reference other documents or locations within the same
document.
- This web of interconnected items creates a non-linear way to browse information, which
is foundational to the web's structure. Hypertext is well-suited for information that needs
frequent updates and interactive features.

2. XML (Extensible Markup Language):


- XML structures data with a more detailed and hierarchical approach compared to HTML,
enabling not just presentation but also data exchange and storage. XML uses schemas like
DTD (Document Type Definition) and DOM (Document Object Model) to define its
structure and enforce standards, making it ideal for complex data structures.
- XML is instrumental in applications requiring structured data interchange, like APIs, and
is widely used in web services due to its adaptability and precision.

3. Advantages:
- Detailed and Interactive Data Representation: Both HTML and XML allow for intricate
data structures, with XML supporting structured data exchange and HTML providing a
visually formatted user interface.
- Flexible and Extensible: XML is designed to represent various data types and structures,
making it versatile for different applications in web development and data management.

Document and Term Clustering

Clustering is a technique used in information retrieval to improve search efficiency and


accuracy. It operates in two main forms:

1. Term Clustering:
- Involves grouping index terms into clusters to create a statistical thesaurus. This process
allows search systems to identify and relate terms that often appear together, thus expanding
searches with related terms, which can significantly increase recall.
- For example, a search for "machine learning" could also return documents containing
terms like "artificial intelligence" and "data science" due to their statistical correlation.

2. Document Clustering:
- Organizes documents into clusters based on shared characteristics, which helps users
retrieve groups of similar documents in response to a query. This technique enables users to
find related items even when the query might not exactly match the desired items.
- A search could retrieve items similar to the user’s query, improving the chances of finding
relevant documents that a traditional search might miss due to keyword differences.

3. Considerations in Clustering:
- Clustering can introduce noise by including loosely related terms or documents, so careful
use of clustering methods and tuning is essential to avoid irrelevant results and maximize
search precision.

Each of these techniques—signature files, hypertext and XML structures, and clustering—
offers distinct benefits in data management. Signature files support efficient retrieval in
distributed systems, hypertext and XML provide flexible data representation for the web, and
clustering enhances search relevance. Together, these strategies contribute to building robust
information storage and retrieval systems.
Application(s)/Advantage(s)
• Signature files provide a practical solution for storing and locating information in a number
of different situations.
• Signature files have been applied as medium size databases, databases with low frequency
of terms, WORM devices, parallel processing machines, and distributed environments

HYPERTEXT AND XML DATA STRUCTURES


The advent of the Internet and its exponential growth and wide acceptance as a new global
information network has introduced new mechanisms for representing information.
This structure is called hypertext and differs from traditional information storage data
structures in format and use.
The hypertext is Hypertext is stored in HTML format and XML .
Bot of these languages provide detailed descriptions for subsets of text similar to the zoning.
Hypertext allows one item to reference another item via a embedded pointer .
HTML defines internal structure for information exchange over WWW on the internet.
XML: defined by DTD, DOM, XSL, etc.

Document and term clustering

Two types of clustering:


1) clustering index terms to create a statistical thesaurus and
2) clustering items to create document clusters. In the first case clustering is used to increase
recall by expanding searches with related terms. In document clustering the search can
retrieve items similar to an item of interest, even if the query would not have retrieved the
item. The clustering process is not precise and care must be taken on use of clustering
techniques to minimize the negative impact misuse can have.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy