IR Chapter 1&2
IR Chapter 1&2
Introduction
Definition of IR
Challenges in IR
Assumptions in IR
Goal of IR
==========================================
IR systems
Information Retrieval
Chapter 1:
Information Storage and Retrieval
Introduction
• The practice of archiving written information can be traced
back to around 3000 BC, when the Sumerians designated
special areas to store clay tablets with cuneiform inscriptions
(Amit Singhal,2001) here also
• The need to store and retrieve written information became
increasingly important over centuries, especially with
inventions like paper and the printing press.
• After computers were invented, people realized that they
could be used for storing and mechanically retrieving large
amounts of information
Cont……
Approaching the end of the twentieth century, societies all
over the world are changing.
In countries of many different kinds, information now plays
an increasingly important part in economic, social, cultural
and political life.
This phenomenon is taking place regardless of a country’s
size, state of development or political philosophy.
Cont……..
• Clay tablets
cloud
Computers
What is Information, storage and retrieval?
• Information:
Data that have been processed and has meaning of itself and
the meaning is useful but does not have to be.
Information is a critical business resource and like any other
critical resource must be properly managed
Storage
?
Example
Googl
e
Web
17
IR
Information Retrieval(IR)
• The term Information Retrieval was first coined by Calvin
Moore (1950)
Definition: Is an Important sub-discipline of Information Science
that is concerned with developing theories and methods of access
to information
– Focus is on helping user find information that matches their
information need (User Centered View)
• Is a branch of applied Computer Science that focus on
representation, storage, organization of, and access to information
items (System Centered View).
…cont
• A good formal definition of information retrieval is given in
Baeze-Yates and Riberio-Neto (1990p1)
“Information retrieval deals with representation, storage,
organization of, and access to information items. The
organization and access of information items should provide
the user with easy access to the information in which he is
interested”
• As a field, IR focuses on advanced application of
computers
• Is about finding relevant information in large collection of
data
IR can be defined:
• Conceptually, IR is used to cover all related problems in
finding needed information
• Historically, information retrieval is about document retrieval,
emphasizing documents as a basic units
– Until recently, in the above sense, IR was considered as a narrow area
of interest for Librarians and Information experts
– Today, IR includes Modelling, document classification, user interfaces
and visualization, multimedia retrieval, digital library, filtering, natural
languages etc.
Remark
Acts as an assistant
Major functions of an IR systems
– Analyze contents of information items
– Represent the contents of the analyzed sources in a way suitable
for matching with users’ queries
– Analyze users information need and represent them in a form
that will be suitable for matching with the database
– Match the search statement with the stored database
– Retrieve or generate information that are relevant in a ranking
which reflects relevance
– Make necessary adjustments in the system based on feedback
from users
Types of IR Systems
• IR can be structured for ease of discussion as:
– Text IR
• Discusses the classic problem of searching a collection of documents
for useful information
• Focuses is on document s that are predominantly text (rather than
pictures)
• These are called textual images and are amenable to automatic
extraction of key words
– Multimedia IR
• Discusses how to index document images and other binary data by
extracting features from their content and how to search them
efficiently
– Human computer interaction (HCI) for IR
• Discusses current trends in IR towards improved user interface and
better data visualization tools
– Application of IR
• Covers modern applications of IR (such as the Web, bibliographic
systems, and digital libraries)
Components of an IR systems
– Vocabulary subsystem
• In indexing we need to use controlled vocabulary i.e., a list of selected
subject terms to represent a document. It is based on a vocabulary that
the indexing is updated
– DBMS:
• Records, tupples, No ranking
• Query specification
• DBMS
• Complete (requires precise retrieval criteria)
• A single erroneous object implies failure
– IR system
• Incomplete
Text Operations
6, 7
logical view logical view
Query DB Manager
Operations Indexing
Module
user feedback
5 8
inverted file
query
Searching
Index
8
retrieved docs
Text
Database
Ranking
ranked docs
2
For queries
• For queries, the query has arisen as a result of an information
need on the part of the user
• The query is then a representation of the information need and
must be expressed in a language understood by the system
• Due to the inherent difficulty of accurately representing the
information need, the query in IR system is always regarded
as approximate and imperfect
…..cont
IR ?
1. Regulatory compliance/amenabilty/agreement
• A well-organized information storage and retrieval system
that follows compliance regulations and tax record-keeping
guidelines significantly increases a business owner‟s
confidence the business is fully complying.
2. Efficiency and Productivity
Chapter 2:
Automatic Term Selection and
Term Weighting
Definition of Term Selection
• Before a computerized information retrieval system can actually operate
to retrieve some information, that information must have already been
stored inside the computer.
• The computer, however, is not likely to have stored the complete text
of each document in the natural language in which it was written.
• It will have, instead, a document representative which may have been
produced from the documents either manually or automatically
Why term selection?
• If full text representation is adopted then all words are used for
indexing (not as such efficient as it will have an overhead, time
and space)
Index term
• Is also called keyword
• Assumption
• Index terms can be extracted from the title, abstract and text of the
document
Indexing
• Comes from the use of index terms
• Is a critical process
• Ways to do indexing
– Manual
– Non-weighted indexing
– Weighted indexing
Indexing
• Non-weighted indexing
– Indexing vocabulary
– Collection characteristics
• Indexers are normally provided with guidelines (input
sheets, manuals and instructions, printed thesaurus) to
determin the contents of a given document
• Is usually done in the library environment
Advantages of Manual Indexing
• Labor intensive
– Information overload
• Enormous amount of information is being generated from
day to day activities
– Cost effectiveness
• Human indexing is expensive and labor intensive.
Procedures for Building an Index Automatically
Documents
Tokenizing
text
break into words
Noise reduction
words stoplist Feature
normalization
non-stoplist stemming*
words
*Indicates
optional stemmed term weighting*
operation words
• Some are good, some are bad and some are indifferent
• It was Luhn (1957) who first suggested that certain words could
be automatically extracted from texts to represent their content
• Much of text analysis has been built on the original idea of Luhn
Automatic Text Analysis
• Luhn’s proposal
• Still today, the search engines that operate on the Internet index
the documents based on this principle
Automatic Text Analysis
• Luhn’s observation
• Luhn used the work of Zipf to enbale him specify the two cut
offs, an upper and lower, thus exclude non-significant words
• Luhn’s Assumption
• Therefore, Luhn suggested using the words in the middle of the frequency range