0% found this document useful (0 votes)
63 views88 pages

IR Chapter 1&2

The document provides an overview of information storage and retrieval systems. It defines key concepts like information, storage, retrieval and queries. It discusses the goals of IR systems as helping users find relevant information with minimal effort. It outlines some challenges in IR like representation of information and user needs, and matching them. Finally, it describes the major functions of IR systems as analyzing, representing, matching and retrieving relevant information to users.

Uploaded by

Magarsa Bedasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views88 pages

IR Chapter 1&2

The document provides an overview of information storage and retrieval systems. It defines key concepts like information, storage, retrieval and queries. It discusses the goals of IR systems as helping users find relevant information with minimal effort. It outlines some challenges in IR like representation of information and user needs, and matching them. Finally, it describes the major functions of IR systems as analyzing, representing, matching and retrieving relevant information to users.

Uploaded by

Magarsa Bedasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 88

WOLKITE UNIVERITY

COLLEGE OF COMPUTING AND INFORMATICS

DEPARTMENT OF INFORMATION SYSTEM

Course : Introduction to Information Storage & Retrieval


INSY 2063: IR

BSc(IS) Third Year, First Semester, 2021


ISAYAS W.
INFORMATION SYSYTEMS
Points to be covered

 Introduction
 Definition of IR
 Challenges in IR

 Assumptions in IR
 Goal of IR

==========================================

IR systems
Information Retrieval

Chapter 1:
Information Storage and Retrieval
Introduction
• The practice of archiving written information can be traced
back to around 3000 BC, when the Sumerians designated
special areas to store clay tablets with cuneiform inscriptions
(Amit Singhal,2001) here also
• The need to store and retrieve written information became
increasingly important over centuries, especially with
inventions like paper and the printing press.
• After computers were invented, people realized that they
could be used for storing and mechanically retrieving large
amounts of information
Cont……
 Approaching the end of the twentieth century, societies all
over the world are changing.
 In countries of many different kinds, information now plays
an increasingly important part in economic, social, cultural
and political life.
 This phenomenon is taking place regardless of a country’s
size, state of development or political philosophy.
Cont……..

 Changes that are happening in Singapore, with a population


of 2.5 million, are similar to those taking place in Japan with
its population of 125 million.
 Developing countries like Thailand are striving to build
information-intensive social and economic systems just as
hard as countries like the United Kingdom or France.
The storage of information from first to now

• Clay tablets

Paper and other soft materials

cloud
Computers
What is Information, storage and retrieval?

• Information:

 Data that have been processed and has meaning of itself and
the meaning is useful but does not have to be.
 Information is a critical business resource and like any other
critical resource must be properly managed
Storage

 The action of or method of storing something.


 The place where data is held in an electromagnetic or optical
for access by a computer processor.

Retrieval : The process of getting some thing backfrom


somewhere easily.
The action of obtaining or consulting material stored in a
computer system.
Example: find „BRUTUS AND CAESAR AND NOT
CALPURNIA‟ in the big book of shakespare.
Information Storage

• The computers can store different types of information in


different ways, depending on what the information is, how
much storage it requires and how quickly it needs to be
accessed.
Example
….cont
instead
IR and IR systems
What is IR(Information Retrieval)

?
Example

Googl
e

Web

17
IR
Information Retrieval(IR)
• The term Information Retrieval was first coined by Calvin
Moore (1950)
Definition: Is an Important sub-discipline of Information Science
that is concerned with developing theories and methods of access
to information
– Focus is on helping user find information that matches their
information need (User Centered View)
• Is a branch of applied Computer Science that focus on
representation, storage, organization of, and access to information
items (System Centered View).
…cont
• A good formal definition of information retrieval is given in
Baeze-Yates and Riberio-Neto (1990p1)
“Information retrieval deals with representation, storage,
organization of, and access to information items. The
organization and access of information items should provide
the user with easy access to the information in which he is
interested”
• As a field, IR focuses on advanced application of
computers
• Is about finding relevant information in large collection of
data
IR can be defined:
• Conceptually, IR is used to cover all related problems in
finding needed information
• Historically, information retrieval is about document retrieval,
emphasizing documents as a basic units
– Until recently, in the above sense, IR was considered as a narrow area
of interest for Librarians and Information experts
– Today, IR includes Modelling, document classification, user interfaces
and visualization, multimedia retrieval, digital library, filtering, natural
languages etc.

• Technically, information retrieval refers to (text) string


manipulation, indexing, matching, querying, etc.
Generally;
• Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers). Information retrieval
technology has been central to the success of the Web.

• Question 1: what is the difference between the structured


and unstructured data? What about semi-structured data?
Query

• (computing) a set of instructions passed to a database to


retrieve particular data (Dictionary Definition)
• Queries are formal statements of information needs that are
put to an IR system by the user to search for a document.
• The users’ query is matched to the documents stored in a
database through the documents’ index (example books’
index).

• They are different from Site Seeker


Query

• When formulating a query, the user can employ search


facilities such as search limits (by date of publication,
language, publication type, and so on) and Boolean operators
(AND/OR/NEAR/NOT) to make the query more specified
(i.e. refine or relax the query).
 
• The user can also often control the output in terms of, for
example, number of retrieved documents to display and of
highlighting search terms.
Goal of IR

• The general goal of IR is to

Help users find useful information based on their information

needs (with a minimum effort, ) despite the increasing

complexity of Information and the changing needs of user

Provide immediate random access to the data

Remark

Retrieval systems such as google are developed with this aim


What IR assumes?

• Information is stored (or available)

• A user has an information need

• An automated system exists from which information can


be retrieved
• The system works!!
Challenges in IR
• Representation of information items and information needs
(first problem)

– Document representation is one area of IR


– Query representation is another area of IR

• Matching (second problem)

– How to match need Vs. information items

• Modification of representation as a result of judgment (query


expansion or reformulation)
IR SYSTEMS
Data IR Systems
• Are systems which are build to retrieve documents highly
likely relevant to the user
• Are systems built to reduce user’s workload in searching
through the store of documents to find relevant one’s
• Are systems that give information about the presence or
absence of documents in accordance with the query

• Are computer based systems (we are talking about


automation )
….cont
• Are systems that attempt to find relevant documents to
respond to user’s request

• Are devices interposed between a potential user of


information and the information collection itself.

– For a given information problem, the purpose of the


system is to capture wanted items and to filter out
unwanted items
…….cont
Information retrieval system
• Consists of:

– Sets of Information items (documents)


• Objects that have the information we need

– A set of requests (Information needs)


– Some mechanisms for determining the requirements of
the request (matching functions)
Examples of IR systems

• Typical examples of IR systems are search engines that


can be found on the web or in library

– They concentrate on finding documents, performing


full text retrieval

– After a user types in several keywords, the system


returns the documents that are most interesting
according to the system
What an IR system should do?
 Store/archive information

 Provide access to that information

 Answer queries with relevant information

 Understand the user’s queries

 Understand the user’s need

 Acts as an assistant
Major functions of an IR systems
– Analyze contents of information items
– Represent the contents of the analyzed sources in a way suitable
for matching with users’ queries
– Analyze users information need and represent them in a form
that will be suitable for matching with the database
– Match the search statement with the stored database
– Retrieve or generate information that are relevant in a ranking
which reflects relevance
– Make necessary adjustments in the system based on feedback
from users
Types of IR Systems
• IR can be structured for ease of discussion as:
– Text IR
• Discusses the classic problem of searching a collection of documents
for useful information
• Focuses is on document s that are predominantly text (rather than
pictures)
• These are called textual images and are amenable to automatic
extraction of key words
– Multimedia IR
• Discusses how to index document images and other binary data by
extracting features from their content and how to search them
efficiently
– Human computer interaction (HCI) for IR
• Discusses current trends in IR towards improved user interface and
better data visualization tools
– Application of IR
• Covers modern applications of IR (such as the Web, bibliographic
systems, and digital libraries)
Components of an IR systems

• An IR system comprises the following major subsystems


– Document selection subsystem
• Documents are there in the database. How are we going to select those
documents that are relevant (matched with user requests)

– Vocabulary subsystem
• In indexing we need to use controlled vocabulary i.e., a list of selected
subject terms to represent a document. It is based on a vocabulary that
the indexing is updated

– Text Operations subsystem


– Tokenization, Stopword removal, Stemming
Data versus Information Retrieval
• Information items
– DBMS
• highly structured data (are of known nature), often
homogeneous records, often semantically unambiguous (well
defined semantics)
– IR systems
• Unstructured or unformatted data (as opposed to relational
database). When you go to a specific document it is not
structured as in DB
• Free text
– text data- papers, technical reports, news article ( completely
untagged or plain text)
– Web-pages – HTML and XML files (semi structured)
• None textual data – images, graphics etc.
• Heterogeneous, Semantically ambiguous (semantics is
frequently loose; we want approximate match)
……cont
• Answers

– DBMS:
• Records, tupples, No ranking

• Well defined results


• Perfect precision and recall, each item is relevant
– IR systems

• Documents, ranked list of documents. The issue ranking is


very important (page through the top k documents)
• Fuzzy results

• Imperfect precision and recall, each item has specific


….cont
• Matching
– DBMS:
• Analoguous to db quering: Which records contain a set of
keywords?
• Exact match; We talk of items that match exactly; Every record
either matches or fails to match a query; No notion of relevnce
• A single erroneous object implies failure!
– IR systems
• Information about a subject or topic
• Partial or best match; We talk of possibly relevant items not
exact matched items
• Notion of relevance is most important- needs a model
• Small errors are tolerated (and in fact inevitable)
• Interpret contents of information items
• Generate a ranking which reflects relevance
….cont
• Items wanted
– DBMS
• Matching
– IR systems
• Relevant
• Model
• DBMS:
• Deterministic (answer can be predetermined
– IR systems:
• Probabilistic, not deterministic; answer is not
predetermined
…cont
• Querying
– DBMS:
• (DB query) assumes that the data is in standardized
format
– IR system
• Query assumes that we work on plain, unformatted data
• Query language
• DBMS
• Artificial language
– IR system
• Natural language
….cont

• Query specification

• DBMS
• Complete (requires precise retrieval criteria)
• A single erroneous object implies failure

– IR system
• Incomplete

• Small errors are tolerated


Summary of Comparison (data retrieval Vs information retrieval)
IR and the Retrieval Process
• The purpose of an information retrieval strategy is to retrieve
all the relevant documents whilst at the same time retrieving as
few nonevent once as possible

• The process involves a certain amount of element of feed back


and is best illustrated using the diagram in the next slide

• Can be seen or interpreted in terms of component sub-processes


whose study fields yields many of the topics that will be
covered in the course
Retrieval Process
Text
User
Interface

user need 4, 10 Text

Text Operations
6, 7
logical view logical view

Query DB Manager
Operations Indexing
Module
user feedback

5 8
inverted file
query

Searching
Index

8
retrieved docs
Text
Database
Ranking
ranked docs
2

A simple and generic software architecture to describe the retrieval process


….cont
• There are three main ingredients to the IR process
– Texts or documents
– Queries
– The process of evaluation
For texts
• For texts, the main problem is to obtain a representation of the
text in a form which is amenable (agreeable) to automatic
indexing
• This is achieved (i.e., the representation) by creating an
abbreviated form of the text, known as a text surrogate
• A typical surrogate would consist of a set of index terms or
keywords or descriptors
….cont

For queries
• For queries, the query has arisen as a result of an information
need on the part of the user
• The query is then a representation of the information need and
must be expressed in a language understood by the system
• Due to the inherent difficulty of accurately representing the
information need, the query in IR system is always regarded
as approximate and imperfect
…..cont

For the evaluation


• The evaluation process involves a comparison of the text
actually retrieved with those the user expected to retrieve
• This often leads to some modification, typically of the query
through possibly of the information need or even of the
surrogates
• The extent to which modification is required is closely linked
with the process of measuring the effectiveness of the retrieval
operation (recall and precision)
Why

IR ?
1. Regulatory compliance/amenabilty/agreement
• A well-organized information storage and retrieval system
that follows compliance regulations and tax record-keeping
guidelines significantly increases a business owner‟s
confidence the business is fully complying.
2. Efficiency and Productivity

• A good information storage and retrieval system, including an


effective indexing system, not only decreases the chances
information will be misfiled but also speeds up the storing and
retrieval of information.
• The resulting time saving benefit increases office efficiency
and productivity while decreasing stress and anxiety
3. Improving working environment
• It can be disheartening to anyone walking through an office area
to see vital business documents and other information stacked
on top of file cabinets or in boxes next to office workstations.
• Not only does this create a stressful and poor working
environment, but if customers see this, can cause customers to
form a negative perception of the business.
• Contrast this with an office area in which file cabinets, walkways
and workstations are clear and neatly organized to see how
important it is for even a small business to have a well-organized
information storage and retrieval system.
Information Retrieval

Chapter 2:
Automatic Term Selection and
Term Weighting
Definition of Term Selection
• Before a computerized information retrieval system can actually operate
to retrieve some information, that information must have already been
stored inside the computer.

• Originally it will usually have been in the form of documents.

• The computer, however, is not likely to have stored the complete text
of each document in the natural language in which it was written.
• It will have, instead, a document representative which may have been
produced from the documents either manually or automatically
Why term selection?

• Some words are not good for representing documents

• Use of all words have computational cost, increase searching


time and storage requirements

• Using the set of all words in a collection to index


documents generates too much noise for the retrieval
task
Objective or aim of term selection
• Represent textual documents by a set of keywords called index
terms or simply terms

• Increase efficiency by extracting from the resulting


document a selected set of terms to be used for
indexing the document

• If full text representation is adopted then all words are used for
indexing (not as such efficient as it will have an overhead, time
and space)
Index term
• Is also called keyword

• Is a word (a single word) or phrase (multiword) in a document


whose semantics gives an indication of the document’s theme
(main idea)
– Capture subject discussed
– Help in remembering the documents main theme

• Is mainly noun (because nouns have meanings by themselves)


Index Terms

• Assumption

– The index terms selected are assumed to reflect the content of


the text (are descriptions of content)

• Index terms can be extracted from the title, abstract and text of the
document
Indexing
• Comes from the use of index terms

• Is a critical process

– User’s ability to find documents on a particular subject is


limited by the indexing process used to create index terms for
the subject
Indexing
• Some definitions
– Is the art of organizing information

– Is an association of descriptors (keywords, concepts) to


documents in view of future retrieval

– Is a process of constructing document surrogates by


assigning identifiers to text items

– Is the process of storing data in a particular way in order to


locate and retrieve the data

– Is the process of analyzing the information content in the


language of the indexing system
Indexing
• Purpose/objective

– To give access point to a collection that are expected to be


most useful to the users of information

– To allow easy identification of documents (e.g., find


documents by topic)

– To relate documents to each other

– To allow prediction of document relevance to a particular


information need
Indexing

• Ways to do indexing

– Manual

– Automatic (focus of the course)


Indexing

• Indexing may also assign weights to terms

– Non-weighted indexing

– Weighted indexing
Indexing
• Non-weighted indexing

– No attempt to determine the value of the different terms


assigned to a document

– Not possible to distinguish between major topics and causal


references

All retrieved documents are equal in value

– Typical of commercial systems through the 1980s


Indexing
• Weighted indexing

– Attempt made to place a value on each term of the


description of the document

– This value is related to the frequency of occurrence of the


term in the document (higher is better), but also to the
number of collection documents that uses this term (lower is
better)
Manual Indexing
• Indexers decide which keywords to assign to documents based
on controlled vocabulary

– Human indexers assign index terms to documents

• The indexers try to summarize the contents or aboutness of the


whole document in a few keywords

• That is, indexers analyze and represent the content of a document


through keywords

• Is based on intellectual judgment and semantic interpretation of


(concepts, themes) of indexers
Manual Indexing
• Indexers prior knowledge of the following is important to come
up with good keywords or index terms

– Terms that will be used by the user

– Indexing vocabulary

– Collection characteristics
• Indexers are normally provided with guidelines (input
sheets, manuals and instructions, printed thesaurus) to
determin the contents of a given document
• Is usually done in the library environment
Advantages of Manual Indexing

• Ability to perform abstraction (conclude what the subject is) and


determine additional related terms

• Ability to judge the value of concepts


Disadvantages of Manual Indexing
• Slow and expensive (significant cost)
– Cost of professional indexers is very expensive

• Is based on intellectual judgment and semantic interpretation


(concepts, themes)

– High probability of inconsistency or low consistency among


indexers (maintaining consistency is difficult),

• Labor intensive

• In automatic indexing all these problems will some how be


solved
Automatic Indexing
• Is the assignment of content identifiers, with the help of modern
computing technology

– A computer system is used to record the descriptors generated


by the human

• The system extracts “typical”/ “significant” terms

• The human may contribute by setting the parameters or


thresholds, or by choosing components or algorithms

• The original texts of information items are used as a basis of


indexing
Why automatic indexing?
• Reasons for the necessity of automatic indexing

– Information overload
• Enormous amount of information is being generated from
day to day activities

– Explosion of machine-readable text


• Massive information available in electronic format and on
Internet.

– Cost effectiveness
• Human indexing is expensive and labor intensive.
Procedures for Building an Index Automatically

Documents
Tokenizing
text
break into words
Noise reduction
words stoplist Feature
normalization
non-stoplist stemming*
words
*Indicates
optional stemmed term weighting*
operation words

terms with Index


weights database
Procedures for Building an Index Automatically
• Thus, automatic indexing consists of two processes

– Assigning terms or concepts capable of representing document


content

– Assigning a weight or value to each term reflecting its


presumed importance for the purpose of content identification

• Important words are assigned higher weights

• Less important words are assigned lower weights


Advantages of Automatic Indexing
• Reduced processing time (Fast)
• Reduced cost (inexpensive)
– Once initial hardware cost is amortized (pay back),
operational cost is cheaper than wages for human indexers
– Indexing entries are generated at a lower cost than manual
indexing
• Easy to maintain
• Improved consistency
– No inconsistency or high consistency
– Algorithms select index terms much more consistently than
humans.
• Better retrieval (achieved)
Disadvantages of Automatic Indexing
• Mechanical execution of algorithms, with no intelligent interpretation (of
aboutness / relevance)
Automatic Text Analysis
• Not all words in a text are good index terms

• Some are good, some are bad and some are indifferent

• How do we know whether a term is good or bad or indifferent for


indexing?

• Luhn’s idea will give us answer to this question


Automatic Text Analysis

• It was Luhn (1957) who first suggested that certain words could
be automatically extracted from texts to represent their content

– He is one of the earliest researcher into IR

• He discovered that the distribution patterns of words could give


significant information about the property of being content
bearing

• Much of text analysis has been built on the original idea of Luhn
Automatic Text Analysis
• Luhn’s proposal

“The frequency of word occurrences in an article furnishes a


useful measure of word significance…”

• The quote fairly summarizes Luhn’s contribution to automatic


analysis

– However, a high frequency term will be acceptable for indexing purposes


only if its occurrence frequency is not equally high in all documents of the
collection
Automatic Text Analysis

• According to his assumption, frequency data can be used to


extract words (and sentences) to represent documents

• Still today, the search engines that operate on the Internet index
the documents based on this principle
Automatic Text Analysis
• Luhn’s observation

– He noted that high frequency words tend to be common, non


content bearing words

– He also recognized that one or two occurrences of a word in a


relatively long text could not be taken significant in defining
the subject matter

• Came up with a model for selecting terms based on their


frequency of occurrences
Automatic Text Analysis
• Luhn’s model

– Words which occur very infrequently in a collection are of little


importance for indexing since they are unlikely to be specified in queries
• Such rare terms are likely to be specific to the documents and they
may not occur in users queries
– Words which occur very frequently in a collection are of little importance
for indexing since they do not discriminate sufficiently between documents
• It is less likely to use these terms to discriminate the documents from
others so not important for indexing
– The most important words for indexing are those which occur with
intermediate frequencies
• Thus, according to Luhn, medium frequency terms are better
candidates for indexing
• Therefore, mechanisms should be devised to get rid of rare and
frequent terms
Automatic Text Analysis
• Let f be the frequency
of occurrence of
various word types in
a given position of
text
• Let r be their rank
order, the order of
their frequency of
occurrence
• Then a plot relating f
and r yields a curve
similar to the
hyperbolic curve
shown to the right
• The curve is, in fact,
demonstrates Zipf’s
law
Automatic Text Analysis

• Luhn used the work of Zipf to enbale him specify the two cut
offs, an upper and lower, thus exclude non-significant words

• The words exceeding the upper cut off were considered to be


common and those below the cut-off rare, and therefore not
contributing significantly to the content of the document
Automatic Text Analysis
• That is, there is a relationship between Zipfian curve and Luhn’s concept of
where the significant words are

• Luhn’s Assumption

“…resolving power of significant words reach a peak at a rank order position


half way between the two cut-offs and from the peak fell off in either direction
reducing to almost zero at the cut off points”

• The resolving power of words is the ability of words to discriminate


content (i.e., document content)

• Words with low significance are at both tails of the distribution

• Therefore, Luhn suggested using the words in the middle of the frequency range

• These findings are the bases of a number of classical weighting schemes


Problems with Luhn’s Selection Mechanism
• Finding a threshold value for elimination of high and low frequency words

– Certain arbitrariness is involved in determining the cut-offs

– That is, there is no oracle which gives their values

– They have to be determined by trial and error

• The risk of loss of retrieval performance

– The removal of high frequency words may reduce recall

– The removal of low frequency words may bring losses in precision


Zipf’s Law in IR
• Is a commonly used model of the distribution of words in a
collection
• The law states that there is an inverse relation between the
frequency of a word f and its rank r; highest frequency term has
rank 1, second highest frequency term has rank 2 etc.)
• If the terms in a collection are ranked (r) by their frequency (f),
they roughly fit the relation r_term * f_term = C, which is
known as Zipf’s law
– f = C*1/r
– In other words, the law states that the product of the frequency of use of
words and their rank order is approximately constant
rank * frequency ≈ constant
•End

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy