0% found this document useful (0 votes)
11 views23 pages

01 - Lect - Introd

Uploaded by

Mahmoud Nasser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views23 pages

01 - Lect - Introd

Uploaded by

Mahmoud Nasser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Information Storage and

Retrieval
CS418

Dr. Ebtsam AbdelHakam

Computer Science Dept.


Minia University
Course Goals
• To help you understand the fundamentals of search
engines.

‣ How to crawl, index, and search documents

‣ How to evaluate and compare different search engines

‣ How to modify search engines for specific applications

• To provide broad coverage of the major issues


in information retrieval

• To take a closer look at particular applications


of Information Retrieval in industry
What is the difference between information
storage and information retrieval?

 Information storage is how the information is


organized and represents the objects in a collection.

 Information retrieval is how to locate the objects being


stored in the collection.
 An Information Retrieval System (IRS) organizes both
parts of this making locating information less complex
Information Retrieval intro.

What is Information Retrieval (IR)?

Information retrieval is a field concerned with the


structure, analysis, organization, storage, searching, &
retrieval of information.
(Gerard Salton, IR pioneer, 1968)
Information retrieval intro.
• An information retrieval process begins when a user enters a
query into the system.

• Queries are formal statements of information needs, for


example search strings in web search engines.

• In information retrieval a query does not uniquely identify a


single object in the collection.

• Instead, several objects may match the query, perhaps with


different degrees of relevancy.
Information need VS. Query

INFO NEED: I need to understand


why I’m getting a NullPointer
Exception when calling
randomize() in the FastMath library

QUERY:
NullPointer Exception randomize() FastMath

Web documents
that may be relevant
Users and Information Needs

 What is the difference between an information need and a


query?
 Information needs may be a vague idea of what the user is
looking for.
 A query is the exact search. The terms and words the user is
looking for to find the information that they need.
 A user may have a vague idea of their information need, putting
it into precise search terms formulating the query that finds
information to suit their needs
Information Retrieval - Example

Input Query Related Documents

IR
System

8
Web
Search
Site-specific
Search
Product
Search
And answering everyday questions

http://news.cnet.com/8301-13579_3-57615135-37/siri-battles-google-now-in-new-contest/
Information Retrieval
13 General applications

Digital libraries
Media search
 Image retrieval
 Music retrieval
 Video retrieval

Search engines
 Site search
 Desktop search
 Mobile search
 Social search
 Web search
IR vs. databases
Structured vs. unstructured data
 Structured data tends to refer to information in “tables” in
database.

Employee Manager Salary


Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000

Typically allows numerical range and exact match


(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
14
Databases vs. IR
Database IR
What we’re Structured data. Unstructured data.
retrieving? Clear semantics based on Free text with metadata.
formal model. Videos, images, music.
Queries we’re Unambiguous formally imprecise queries
posing defined queries.
Results we Exact. Always correct in a Sometimes relevant
get formal sense. sometimes not.

Note: From a user perspective, the distinction may be seamless,


e.g. asking Siri a question about nearby restaurants w/ good reviews
Applications of IR
 Vertical search is a specialized form of web search where
the domain of the search is restricted to a particular topic.

 Enterprise search involves finding the required information


in the huge variety of computer files scattered across a
corporate intranet.
 Web pages are certainly a part of that distributed information
store, but most information will be found in sources such as
email, reports, presentations, spreadsheets, and structured
data in corporate databases.
 Desktop search is the personal version of enterprise
search, where the information sources are the files stored
on an individual computer, including email messages and
web pages that have recently been browsed.
tasks of IR

 Ad hoc search: search based on a user query (sometimes


called ad hoc search because the range of possible queries is
huge and not pre-specified) is not the only text-based task
that is studied in information retrieval.
 Filtering or tracking involves detecting stories of interest
based on a person’s interests and providing an alert using
email or some other mechanism.

 Classification or categorization uses a defined set of labels or


classes (such as the categories listed in the Yahoo! Directory)
and automatically assigns those labels to documents.
http://dir.yahoo.com/
Applications of IR

Table 1: Some dimensions of information retrieval


Basic Concepts of IR

 What is relevance?
Simple (and simplistic) definition: A relevant document contains
the information that a person was looking for when they
submitted a query to the search engine.

•There Many factors influence a person’s decision about what is


relevant: e.g., context, novelty, style.

• These factors must be taken into account when designing


algorithms for comparing text and ranking documents.

• Simply comparing the text of a query with the text of a


document and looking for an exact match.
Basic Concepts of IR

• Retrieval models define a particular view of relevance based


on some idea of what users want.

• Ranking algorithms used in search engines are based on


retrieval models

• Most models are based on statistical properties of text


rather than deep linguistic analysis.

• i.e., counting simple text features such as words


instead of parsing and analyzing the sentences
Challenges of IR

• Search evaluation is user-centered

• Keyword queries are often poor descriptions of


actual information needs.

• Interaction and context are important for


understanding user intent.

• Query refinement techniques such as query


expansion, query suggestion, relevance feedback
improve ranking
Users and Information Needs

 User query needs

1. informational,
2. navigational
3. transactional.

 We now explain these categories; it should be clear that some


queries will fall in more than one of these categories, while
others will fall outside them.
Users and Information Needs

 Informational queries seek general information on a broad


topic, such as Covid-19. There is typically not a single web
page that contains all the information sought; indeed, users
with informational queries typically try to assimilate
information from multiple web pages.
 Navigational queries seek the website or home page of a single
entity that the user has in mind, say Nase airlines.
 In such cases, the user's expectation is that the very first search
result should be the home page of Nase airlines. The user is not
interested in a plethora of documents containing the term Nase
airlines.
 A transactional query is one that is a prelude to the user
performing a transaction on the Web - such as purchasing a
product, downloading a file or making a reservation. In such
cases, the search engine should return results listing services

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy