0% found this document useful (0 votes)
3 views28 pages

IRS1part 2

This document outlines the capabilities of Information Retrieval Systems, focusing on search and browse functions that help users locate relevant information. It discusses various search techniques, including Boolean logic, natural language processing, and proximity searches, as well as browse capabilities that assist users in filtering and selecting results. Additionally, it highlights miscellaneous features such as vocabulary browsing and iterative search to enhance user experience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views28 pages

IRS1part 2

This document outlines the capabilities of Information Retrieval Systems, focusing on search and browse functions that help users locate relevant information. It discusses various search techniques, including Boolean logic, natural language processing, and proximity searches, as well as browse capabilities that assist users in filtering and selecting results. Additionally, it highlights miscellaneous features such as vocabulary browsing and iterative search to enhance user experience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Information Retrieval System Capabilities

 This chapter discusses the major functions that are available in Information
Retrieval System.
 Search and Browse capabilities are crucial to assist the user in locating
relevant items.
 The search capabilities address both Boolean and Natural Language queries.
 The algorithms used for searching are called Boolean, natural language
processing and probabilistic.
 The majority of existing commercial systems are based upon Boolean query
and search capabilities.
 The newer systems such as TOPIC, RetrievalWare and INQUERY all allow for
natural language queries.
 Because of the imprecise nature of search algorithms, Browse functions to
assist the user in filtering the search results to find relevant information.
Search Capabilities
 The objective of the search capability is to allow for a mapping between a user’s
specified need and the items in the information database that will answer that need.
 The search query statement is the means that the user employs to communicate a

description of needed information to the system.


 It can consists of natural language text in composition style and/or query terms with

Boolean logic indicators between them.


 One concept that has occasionally been implemented in commercial systems and

hold for significant potential for assisting in the location and ranking of relevant
items is the weighting of search terms.
 Given the following natural language query statement where the importance of a

particular search term is indicated by a value in parenthesis between 0.0 and 1.0
with 1.0 being the most important.
“Find articles that discuss automobile emission(0.9) or sulphurdioxide(0.3) on
the farming industry”
 The following functions are the functions which define the relationship
between search statement and the interpretation of a particular word.
1. Boolean Logic
2. Proximity
3. Contiguous Word Phrases
4. Fuzzy Search
5. Term masking
6. Numeric and Date Ranges
7. Concept/Thesaurus Expansion
8. Natural Language Queries
9. Multimedia Queries
Boolean Logic
 Boolean Logic allows user to logically relate multiple concepts together to
define what information is needed.
 Typically the Boolean functions apply to processing tokens identified
anywhere with in an item.
 The typical Boolean operators are AND, OR, NOT. These operations are
implemented using set intersection, set union, set difference procedures.
 Placing portions of the search statement in parenthesis are used to specify
the order of Boolean operations. If no precedence is given to operators then
the operators are processed in Left to right.
 A special type of Boolean search is called “M of N” logic. The user lists a set
of possible search terms and identifies, as acceptable, any item that contains
a subset of the terms.
 Most Information Retrieval Systems allow Boolean operations as well as
allowing natural language interfaces.
Use of Boolean
Logic

Search Statement System Operation

COMPUTER OR PROCESSOR Select all items discussing Computers or


NOT Processors but not Mainframes
MAINFRAME

COMPUTER OR (PROCESSOR Select all items discussing Computers or


NOT items that discuss Processors and do not
MAINFRAME) discuss Mainframes

COMPUTER AND NOT Select all items that discuss Computers


PROCESSOR and not Processors or MainFrames in the
OR MAINFRAME item
Proximity
 Proximity is used to restrict the distance allowed with in item between two
search terms.
 The semantic concept is that the closer two terms are found in a text the
more likely they are related in the description of particular concept.
 Proximity is used to increase precision of a search.
 The typical format for proximity is:
TERM1 within “m” “units” of TERM2
where m is integer number and units are in characters, words, sentences
or paragraphs.
 Sometimes the proximity relationship contains the direction operator
indicating the direction that the second term is before or after with number of
units specified.
 A special case of the Proximity operator is the Adjacent(ADJ) operator that
normally has distance of one and a forward distance only.
Use of Proximity

Search Statement System Operation

“Venetian” ADJ “Blind” Select all items that mention a Venetian


Blind but not items discussing Blind
Venetian

“United” within five words of Select items such as “United States and
“American” American interests” , “United Airlines and
American Airlines” etc. but not “United
States of America”

“Nuclear” within zero Select all items that have “nuclear” and
paragraphs of “emission” “emission” in same paragraph.
Contiguous Word Phrases
 A Contiguous Word Phrase (CWP) is both a way of specifying a query and a
special search operator.
 A Contiguous Word Phrase is two or more words that are treated as a single
sematic unit.
 An example of CWP is “United States of America”. It is four words that specify
a search term representing a single semantic concept (country) that can be
used with Boolean operators or proximity.
 Thus a query could specify “manufacturing” AND “United States of America”
which returns any item that contain the word “manufacturing” and the
contiguous words “United States of America”
 Contiguous Word Phrases are called Literal Strings in WAIS and extract
phrases in RetrievalWare. In WAIS multiple adjacency operator(ADJ) are used
to define a Literal String.
Fuzzy Searches
 Fuzzy Searches provide the capability to locate spellings of words that are
similar to the entered search term.
 This function is primarily used to compensate for errors in spelling of words.
 Fuzzy searching increase recall at the expense of decreasing precision.
 In the process of expanding a query term fuzzy searching include other terms
that have similar spellings ,giving more weight to words in the database that
have similar word lengths and position of the characters as the entered item.
 A Fuzzy Search on the term “computer” would automatically include the
following terms “computer”, “compiter”, “computer”, “computer” and
“compute”.
 Fuzzy searching has its maximum utilization in systems that accept items that
have been Optical Character Read(OCRed) The OCR process is a pattern
recognition process that segments the scanned in image into meaningful sub
regions, often considering a segment that the area defining a single character.
Term Masking
 Term masking is the ability to expand a query term by masking a portion of
the term and accepting as valid any processing token that maps the unmasked
portion of the term.
 There are two types of masking: fixed length and variable length.
 Fixed length masking is a single position mask. It masks out any symbol in a
particular position or lack of that position in a word.
 Variable length masking allows masking of any number of characters within a
processing token. The masking may be in the front, at the end, at both front
and end, or imbedded. They are characterized as suffix search, prefix search
and imbedded character string search.
Use of Term
masking

Search Statement System Operation

multi$national Matches “multi-


national”,”multiynational”,”multin
ational” but not “multi national”
*computer Mathches “minicomputer”,
“microcomputer” ,”computer”

comput* Matches “computers”,


“computing”,”computes”

*comput* Matches “microcompters”,


“minicomputing”,”compute”
Numeric and Date Ranges
 Term masking is useful when applied to words, but does not work for finding
ranges of numbers or numeric dates.
 To find numbers larger than “125” using a term “125*”will not find any
number except those that begin with “125”.
 Systems as part of their normalization process characterizes words as
numbers or dates.
 An item can be interpreted as date if the numbers are separated with ‘/’ or ‘-’
and mentioned in predefined range.
 A user could enter inclusive to infinite ranges as a part of query using
<,>,<=,>=, etc.
Concept/Thesaurus Expansion
 Associated with both Boolean and Natural Language Queries is the ability to
expand the search terms via Thesaurus or Concept Class database reference
tool.
 A Thesaurus is typically a one-level or two-level expansion of a term or other
terms that are similar in meaning.
 Thesaurus is either semantic or based upon statistics.
 A Concept Class is tree structure that expands each meaning of a word into
potential concepts that are related to the initial term.
 Systems such as RetrievalWare and TOPIC provide them as a part of the
search system.
 An example of concept class or thesaurus are shown in the following figure
 Thesaurus for term “COMPUTER”

COMPUTER

DATA
MULTITASKIN
PROCESSO MAINFRAME PC
G COMPUTER
R

MINICOMPUTE
R
 Concept Class Structure for term “COMPUTER”

COMPUTER

COMPUTER COMPUTER
HARDWARE SOFTWARE

MINICOMPUT OPERATING
PERIPHERAL APPLICATION
ER SYSTEM

• The problem with thesauri is that they are generic to a


language and can introduce many search terms that are
found in the document database.
• Theoretically thesauri and concept trees could be used to
expand a search statement but expanding the term
increases recall with a possible decrease in precision.
Natural Language Queries
 Rather than having the user enter a specific Boolean query by specifying
search terms and logic between them, natural language queries allow a user
to enter a prose statement that describes the information that the user wants
to find.
 The longer the prose the more accurate the results returned. The most
difficult logic case associated with natural language queries is the ability to
specify the negation in the search statement and have the system recognize it
as negation.
 To accommodate the negation function and provide users with a transition to
the natural language systems, most commercial systems have a user interface
that provides both a natural language and a Boolean logic capability.
Multimedia Queries
 The user interface becomes far more complex with the introduction of the
availability of multimedia items.
 All of the previous discussions still apply for search of the textual portions of a
multimedia database.
 But in addition, the user has to be able to specify search terms for the other
modalities.
 The correlation between different parts of the query against different
modalities is usually based on time or location.
 For example if a video news program has been indexed ,the user could have
access to the scene changes, the transcribed audio, the closed captioning and
the index terms that a user has assigned while displaying the video.
Browse Capabilities
 Once the search is complete, Browse capabilities provide the user with the capability
to determine which items are of interest and select those to be displayed.
 There are two ways of displaying a summary of the items that are associated with a
query: line item status and data visualization.
 From these summary displays, the user can select the specific items and zones
within the items for display.
 The system also allows for easy transitioning between summary displays and review
of specific items.
 If searches resulted in high precision, then the importance of browse capabilities
would be lessened.
 Since searches return many items that are not relevant to the user’s information
need, browse capabilities can assist the user in focusing on items that meeting his
need
 The following are the browse capabilities
1. Ranking
2. Zoning
3. Highlighting
Ranking:
 Under Boolean systems the status display is a count of the number of items
found in the query. Every one of the items meet all aspects of the Boolean
query.
 With the introduction of ranking, based upon predicted relevance values, the
status summary displays the relevance score associated with the item along
with a brief descriptor of the item.
 The relevance score is the estimate of the search system on how closely the
item satisfies the search statement.
 Typically relevance scores are normalized to a value between 0.0 and 1.0
 The highest value of 1.0 is interpreted that the system is sure that the item is
relevant to the search statement. Every item in the system could be returned
but many of the items will have a relevance value of 0.0
 In addition to ranking based upon the characteristics of the item and the
database, in many circumstances collaborative filtering is providing an option
for selecting and ordering output.
 In this case, the users when reviewing items provide feedback to the system
on the relative value of the item being accessed.
 The system accumulates the various user rankings and uses this information
to order the output for other user queries that are similar.
 Collaborative filtering has been very successful in sites such as AMAZON.COM,
MOVIEFINDER.COM and CDNow.com in deciding what products to display to
users based upon their queries
Zoning:
 When the user displays a particular item, the objective of minimization of
overhead still applies.
 The user wants to see the minimum information needed to determine if the
item is relevant.
 Once the determination is made an item is possibly relevant, the user wants
to display the complete item for details review.
 For example, display of the Title and Abstract may be sufficient information of
a user to predict the potential relevance of an item.
 Limiting the display of each item to these two zones allows multiple items to
be displayed on a single display screen.
 Related to zoning for use in minimizing what an end user needs to retrieve from a hit
item is the idea of locality and passage based search and retrieval.
 In this case the basic search unit is not the complete item, but an algorithmic defined
subdivision of the item. This has been known as passage retrieval where the item is
divided into uniform-sized passages that are indexed and locality based retrieval where
passage boundaries are dynami
Highlighting
 Another display aid is an indication of why an item was selected. This indication
frequently highlighting, lets the user quickly focus on the potentiality of relevant parts
of the text to scan for item relevance.
 Different strengths of highlighting indicates how strongly the highlighted word
participated in the selection of item.
 Most systems allow the display of an item to begin with the first highlight
within the item and allow subsequent jumping to the next highlight.
 Another capability, which is gaining strong acceptance, is for the system to
determine the passage in the document most relevant to the query and
position the browse to start at that passage.
 The highlighting may vary by introducing colours and intensities to indicate
the relative importance of a particular word in the item in the decision to
retrieval the item.
 The term being highlighted that caused a particular item to be returned may
not have direct or obvious mapping to any of the search terms entered.
 Information visualization appears to be a better display process to assist in
helping the user formulate his query than highlights in items.
Miscellaneous Capabilities

 There are many additional functions that facilitates the user’s ability to input
queries, reducing the time it takes to generate the queries, and reducing a
priori the probability of entering a poor query.
 The following are the miscellaneous capabilities
• Vocabulary Browse
• Iterative Search and Search History Log
• Canned Query
 Vocabulary Browse
• It provides the capability to display in alphabetical sorted order words
from the document database.
• Logically all unique words in the database are kept in sorted order
along with a count of the number of unique items in which the word
is found.

 Iterative Search and Search History Log


• Iterative Search is the process of refining the results of a previous
search to focus on relevant items.
• The search history log is the capability to display all the previous
searches that were executed during the current session.
 Canned Queries
• The capability to name a query and store it to be retrieved and
executed during a later user session is called canned or stored queries.
• Users tend to have areas of interest within which they execute their
searches on a regular basis.
• A canned query allows a user to create and refine a search that
focuses on the user’s general area of interest and then retrieve it to
add additional search criteria to retrieve data that is currently
needed.
Z39.50 and WAIS Standards

 Z39.50
• It is the Information Retrieval and Application Service Definition and
Protocol specification. The first version of Z39.50 was approved in 1992
• It is a standard assigned for Information Retrieval System by American
National Standards Institute.
• It is computer to computer communications standard for database
searching and record retrieval.
• The standard describes eight operation types: initialization, search,
present, delete, scan, sort, resource-report and extended services.
• There are five types of queries: Types 0,1,2,100,101 and 102. Type 101
allows for proximity whereas Type 102 is ranked list query.
• Z39.50 not only makes retrieval between origin and target systems, it also
structure the semantics of the search query, the sequence of message
exchange and the mechanism for returning records.
 Wide Area Information Service(WAIS)
• It is the second de facto standard for many search environments on the
Internet.
• WAIS was developed by project started by three commercial
companies(Apple, Thinking Machines and Dow Jones)
• The original idea was to create a program that would act as a personal
librarian.
• It would act like a personal agent keeping track of significant amounts of
data and filtering it for the information most relevant to the user.
• The interface concept was user entered natural language statements of
the topics the user had interest. In addition it provided the capability to a
computer that a particular item was of interest and have the computer
automatically find relevant items.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy