IRS1part 2
IRS1part 2
This chapter discusses the major functions that are available in Information
Retrieval System.
Search and Browse capabilities are crucial to assist the user in locating
relevant items.
The search capabilities address both Boolean and Natural Language queries.
The algorithms used for searching are called Boolean, natural language
processing and probabilistic.
The majority of existing commercial systems are based upon Boolean query
and search capabilities.
The newer systems such as TOPIC, RetrievalWare and INQUERY all allow for
natural language queries.
Because of the imprecise nature of search algorithms, Browse functions to
assist the user in filtering the search results to find relevant information.
Search Capabilities
The objective of the search capability is to allow for a mapping between a user’s
specified need and the items in the information database that will answer that need.
The search query statement is the means that the user employs to communicate a
hold for significant potential for assisting in the location and ranking of relevant
items is the weighting of search terms.
Given the following natural language query statement where the importance of a
particular search term is indicated by a value in parenthesis between 0.0 and 1.0
with 1.0 being the most important.
“Find articles that discuss automobile emission(0.9) or sulphurdioxide(0.3) on
the farming industry”
The following functions are the functions which define the relationship
between search statement and the interpretation of a particular word.
1. Boolean Logic
2. Proximity
3. Contiguous Word Phrases
4. Fuzzy Search
5. Term masking
6. Numeric and Date Ranges
7. Concept/Thesaurus Expansion
8. Natural Language Queries
9. Multimedia Queries
Boolean Logic
Boolean Logic allows user to logically relate multiple concepts together to
define what information is needed.
Typically the Boolean functions apply to processing tokens identified
anywhere with in an item.
The typical Boolean operators are AND, OR, NOT. These operations are
implemented using set intersection, set union, set difference procedures.
Placing portions of the search statement in parenthesis are used to specify
the order of Boolean operations. If no precedence is given to operators then
the operators are processed in Left to right.
A special type of Boolean search is called “M of N” logic. The user lists a set
of possible search terms and identifies, as acceptable, any item that contains
a subset of the terms.
Most Information Retrieval Systems allow Boolean operations as well as
allowing natural language interfaces.
Use of Boolean
Logic
“United” within five words of Select items such as “United States and
“American” American interests” , “United Airlines and
American Airlines” etc. but not “United
States of America”
“Nuclear” within zero Select all items that have “nuclear” and
paragraphs of “emission” “emission” in same paragraph.
Contiguous Word Phrases
A Contiguous Word Phrase (CWP) is both a way of specifying a query and a
special search operator.
A Contiguous Word Phrase is two or more words that are treated as a single
sematic unit.
An example of CWP is “United States of America”. It is four words that specify
a search term representing a single semantic concept (country) that can be
used with Boolean operators or proximity.
Thus a query could specify “manufacturing” AND “United States of America”
which returns any item that contain the word “manufacturing” and the
contiguous words “United States of America”
Contiguous Word Phrases are called Literal Strings in WAIS and extract
phrases in RetrievalWare. In WAIS multiple adjacency operator(ADJ) are used
to define a Literal String.
Fuzzy Searches
Fuzzy Searches provide the capability to locate spellings of words that are
similar to the entered search term.
This function is primarily used to compensate for errors in spelling of words.
Fuzzy searching increase recall at the expense of decreasing precision.
In the process of expanding a query term fuzzy searching include other terms
that have similar spellings ,giving more weight to words in the database that
have similar word lengths and position of the characters as the entered item.
A Fuzzy Search on the term “computer” would automatically include the
following terms “computer”, “compiter”, “computer”, “computer” and
“compute”.
Fuzzy searching has its maximum utilization in systems that accept items that
have been Optical Character Read(OCRed) The OCR process is a pattern
recognition process that segments the scanned in image into meaningful sub
regions, often considering a segment that the area defining a single character.
Term Masking
Term masking is the ability to expand a query term by masking a portion of
the term and accepting as valid any processing token that maps the unmasked
portion of the term.
There are two types of masking: fixed length and variable length.
Fixed length masking is a single position mask. It masks out any symbol in a
particular position or lack of that position in a word.
Variable length masking allows masking of any number of characters within a
processing token. The masking may be in the front, at the end, at both front
and end, or imbedded. They are characterized as suffix search, prefix search
and imbedded character string search.
Use of Term
masking
COMPUTER
DATA
MULTITASKIN
PROCESSO MAINFRAME PC
G COMPUTER
R
MINICOMPUTE
R
Concept Class Structure for term “COMPUTER”
COMPUTER
COMPUTER COMPUTER
HARDWARE SOFTWARE
MINICOMPUT OPERATING
PERIPHERAL APPLICATION
ER SYSTEM
There are many additional functions that facilitates the user’s ability to input
queries, reducing the time it takes to generate the queries, and reducing a
priori the probability of entering a poor query.
The following are the miscellaneous capabilities
• Vocabulary Browse
• Iterative Search and Search History Log
• Canned Query
Vocabulary Browse
• It provides the capability to display in alphabetical sorted order words
from the document database.
• Logically all unique words in the database are kept in sorted order
along with a count of the number of unique items in which the word
is found.
Z39.50
• It is the Information Retrieval and Application Service Definition and
Protocol specification. The first version of Z39.50 was approved in 1992
• It is a standard assigned for Information Retrieval System by American
National Standards Institute.
• It is computer to computer communications standard for database
searching and record retrieval.
• The standard describes eight operation types: initialization, search,
present, delete, scan, sort, resource-report and extended services.
• There are five types of queries: Types 0,1,2,100,101 and 102. Type 101
allows for proximity whereas Type 102 is ranked list query.
• Z39.50 not only makes retrieval between origin and target systems, it also
structure the semantics of the search query, the sequence of message
exchange and the mechanism for returning records.
Wide Area Information Service(WAIS)
• It is the second de facto standard for many search environments on the
Internet.
• WAIS was developed by project started by three commercial
companies(Apple, Thinking Machines and Dow Jones)
• The original idea was to create a program that would act as a personal
librarian.
• It would act like a personal agent keeping track of significant amounts of
data and filtering it for the information most relevant to the user.
• The interface concept was user entered natural language statements of
the topics the user had interest. In addition it provided the capability to a
computer that a particular item was of interest and have the computer
automatically find relevant items.