Irs 3
Irs 3
Query Processing
CHAPTER 3 and Operations
Sylabus
Query Languages Keyword based Querying, Pattern Matching,
Structural Queries, Query Protocols; Query Operations: User relevance
feedback, Multimedia IR models: Data Modeling
Self-learning Topics: Proximity Queries and Wildcard Queries
Ayeterm(MU)
ntonnahn
Fietrev QUERYING Intomation Ftetrievel fiysterm (MU) (Qusry Pr. B peratirns) Pa r (4)
KEYWORD-BASED
Querying. () Single Word (Querles
M 3.2 Keyword-based
shortnote
on Aquery is lutmulated by a word
Whitea kind of IR queries.
mostwidelyused Adtecunent is formulated by long wquences of word.
splestand combinations
tsthe sinply enter phrase A word is a scquence of leters uITOunded hy separators, fur exarnpie, a
userto word 'o ine'
rrquiresthe
similar docunents online Thc division of the text into words is not arbitrary.
doxcunents
lookfor
times, people Word qucries returIn alist of docunents that contajn at least onc of the
The majorityof
connection qucry words.
keywords
Alogical
AND
query keyword terms.
operator creates an
inplied
for example,
between The level of simil:arity between the returred docurnents and the query
determines their rarnking.
for
"infornat1on retricval," the Term frequcncy and invere document frequency are cornmonly used to
When searching contain both
will be documents thal the phra support ranking.
retrieved result (b) Context (Queries
"information" and "retrieval".
systems also retrieve Scarch words in a given context, that is, near other words
Additionally. the majority of
"information" or "retrieval" in
documents o
thern Words that are near together suggest a higher possibil1ty of rclevance
merely contain the words than words that are far apart.
key-words to the JR
Before delivering the filtered query most
engine,so
systems preprocess the data by removing the
frequent wo Types :
on. The order of these terms in
(sopwords), such as a, the, of, and so (1) Phrase (2) Proxinity
query is typically ignored by IR systens. Phrase
models.
Keyword scarches are supported by all retrieval
It is a sequence of single-word queries.
a 3.2.1 Types of Keyword-Based Querying An occurTence of the phrase is a sequence of words, for cxarrple.
"enhance retrieval'".
GQ What are the diferent types of Keyword-based Querying?
The phrase is generally enclosed within double quotes.
(a) Singie Word Queries Each retricved docunent must contain at least one instance of the exact
(b) ContextQueries phrase.
Phrase
Proximity
Proximity
(t) Boolean (Jueries Proximity refers to search that accounts for how close within a record
OR AND, BUT
multiple iterns should be to cach other.
(d) Natural Language It is a rrore relaxed version of phrase query.
(e) Widcard ueries Here, a sequence of single words or phrases, and a maximum allowed
distance between then are specified.
ien
wefwaderr year 22 23, (M787) (hNew Syl wefacadernic year 22-23) (M7-87) Tech-Neo Pubicatons
echties Publicatig
(MU) within 4 Words
RetrievalSysterm
Information should occur
retrieval" Infomation Retrieval System (MU) (Query Proc. & Operations) Pg. no. (3-5)
"enhance retrieval....
example. of
For thepower required to appear in
match..enhance
or
be
may not ihe sam Fuzzy Boolean
phrases may
The word or Retrieve documents appearing in some operands (The AND may require
query.
order as in the ittoappcar in more operands than the OR)
(c) Boolean Queries composed of atoms that (d) Natural Language
give a syntax
Boolean queries Boolean operators
which work their
on
NOT.
retrie
operands. It isgeneralization of "fuzzy Boolean'".
documents, and of AND, OR, ).
using the formulatioDs +, an A query is an enumeration of words and context queries.
systems allow keyword
Some IR combinations of All the documents matching a portion of the user query are retrieved.
as
Boolean operators in OR syntactic shown Few natural language search engines that aim to understand the
translation AND syntax
For example, structure and meaning of queries written in natural language text,
Fig. 3.2.1l. AND generally as question or narrative.
The system tries to formulate answers for these queries from reirieved
results.
Translation
OR
Semanticmodels can provide support for this query type.
Syntactic 3.3 PATTERN MATCHING
syntax
syntax tree
Fig. 3.2.l:A query I GQ. Define Pattern Matching.
IGQ. Write a short note on Pattern Matching.
terms be found.
AND requires that both
found. Data retrieval: allow the retrieval of pieces of text that have some
OR lets either term be excluded
containing the second term willbe property (match a pattern)
NOT means any record parentheses.
can be nested using A pattern is a set of syntactic features that must occur in atext segment.
0'means the Boolean operators should be place
requiring the term; the '+
+ is equivalent to AND, a 3.3.1 Types of Pattern Based Querying
directly in front of the search term.
to exclude the term: the What are the different types of Pattern Matching based Querying?
is equivalent to AND NOT and means GQ.
search term not wanted.
- should be placed directly in front of the
and thei (a) Words
Complex Boolean queries can be built out of these operators
evaluated according to the classical rules o
combinations, and they are Basic pattern
Boolean algebra. A string which must be a word in the text
No ranking is possible, because a document either satisfies such aquen (b) Prefixes
(is "relevant") or does not satisfy it (is "nonrele vant").
Adocument is retrievedfor a Boolean query if the query is logically tru A string which must form the beginning of the text word
international', etc
as an exact match in the document. For example, 'inter' in words 'interactive,
SystenMU)
inkoationRaeve
Infomation Retrieval System (MU) (Query Proc. &Operations) Pg no (3-7)
(e) Sefes ternunation of the text word
the
whihmastfrn cte concalenation: if el and e2 are regular expressions, the oxcurrences
ANtnng
wrds rekom, kingdom',
ám'in of (ele2) are forned by the occurrences of
Frevanle, followed tby those of e2
el immediately
( Sabetrings text wort
can ayarwithina repeition: ile is a regular expression, then (e®) matchs a sequence
whih
Asang alm, pals. principal, of zero or more contiguous occurrence ofe
' in worts Pal
Rr euample. 'procblemlein) (sle)\01I2)>'problem2" and 'proteins'
muNiahy e
3.4 STRUCTURAL QUERIES
(e) RangesMatchesany wond ing berweena pair of stringsin lexicographi
GQ. What are structural Queries? Give different structures.
order (alphabeticalorder)
rtrieve word such as
"hon GQ. Explain the following data structures giving suitable examples
and 'hold'
Fx exatpk, held' (a) Fixed (b) Hypertext (c) Hierarchical
hising
Allow user to query the text on their structure
(0 Alowing errors
Mixing contents and structure in queries
an error threshold
Aword together with
'similar' to the given word Contents : words, phrases, or patterns
Retrieve all text words which all
spell1ng or from oph Structural constraints: containment, proximity. or other
The parterm or text may have eror typing.
restrictions on structural elements
character reognition.
information retrieval are Three main structures
Models which can be used for
(a) Fixed structure
Edit distance
insertions, deletions.
the minimum number of character (b) Hypertext structure
replacements needed to make two strings equal (c) Hierarchical structure
for example. flower' and 'flo wer' (edit distance 1)
Maximum allowed edit distance
3.4.1 Fixed Structure
Fixed structure for text retrieval such as Form which is shown in
query specifies the maximum number of allowed errors for a wor
Fig. 3.4.1
to match the patern
extended to search substring and not only words For example : Mail archive
Each mail has a sender, a receiver, a date a subject and a body field as a
(g) Regular expressions fixed structure.
General pattem built up by simple string and following operators : Easy to search a mail based on a date, a receiver, a subject and so on
union: if el and e2 are regular expressions, then (elle2) matche Other examples : Log file (Document : a fixed set of fields)
what el or e2 matches
(New Syll. wef academic year 22-23) (M7-87) eTech-Neo Publication (New Syl. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications
(MU)
RetrievalSysterm
Infomaton
Intormation Retrieval Systen (MU) (Query Proc. &Operations) Pg. no (3-9)
text
text
a 3.4.3 Hierarchical Structure
Intermediate level of flexibility
Fields text Lies between fixed structure and hypertext structure
Represent the recursive decomposition of the text
Fig. 3.4.3 represent a schematic view of Hierarchical structure
text
fixed structure
Fig. 3.4.l :Form-like
3.4.2 Hypertext Structure
suructure
Scarch by content and
nodes hold Some
directed graph where
A hypertext is a connections between nod
represent
(text contents). The links connectivity).
(structural
berween positions inside nodes Fig. 3.4.3 : Hierarchical structure
The user had to manually traverse the hypertext nodes following links
Fig. 3.4.4 as shown below gives an example of hierarchical structure
search what he wanted. with the page of book, its semantic view and a parsed query to retrieve
nodes and links
Fig. 3.4.2 representS a hypertext structure with the figure
chapter
section section
in
Fig. 3.4.2 : Hypertext structure
New Syl. wefacadermic year 22-23) (New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications
(M7-87) Tech-Neo Publications
HIERARCI
SAMPLE OF
3.5 Intomation Retrieval Systen (MU) (Query Proc. & Operations) Pg no (31)
hierarchicalmodels
sampleof
Discussdierent structure withsuitable examples. The model allows to perforn set union and to combine regions.
Explain PAT
deta
G The model allows for the areas of a region to overlap, but not to nest.
3.5.1 PAT Expressions A 'followed by' operator adds the exra restrictions requiring that the
first region come before tlhe second area.
text index
sane index as the
Butlt on the the textt by tags An 'n words' operator creales the region containing all text's sequences
presumptivelyindicatedin of n words.
Structureis final tags
terms of initial and
Strucure is defined in and final tags It is not clear, whether overlapping is good or not for capturing the
defined by each pair ofinitial structural properties.
Region is region to overlap or nee.
the areas of a
The model allow for 3.5.3 Lists of References
system
PAT is a text searching
Uaiversity of Waterloo Model makes the definition and querying of structured text uniform
Developed at
set of suffix strings The structure of the document is fixed and hierarchical
PAT interprets text as a
every wordin this sentence yields the
For example, indexing
For exarnple, indexing every word in this sentence yield
12 string Allpossible regions are defined at indexing time
Overlap and nest are not allowed
12strings All elements must be of the same type, e.g. only sections, or only
this sentence yields the
example, indexing every word in 12 strin paragraphs.
indexing every word in this sentence yields the 12 strings Answer to the query is seen as list of 'references'
every word in this sentence yields the 12 strings
A reference is a pointer to a region of the database.
word in this sentence yields the 12 strings
in this sentence yields the 12 strings a 3.5.4 Proximal Nodes
this sentence yields the 12 strings This model tries to find a good compromise between expressiveness and
sentence yields the 12strings efficiency.
yields the 12 strings It does not define a specific language, but a model in which i is shown
the 12 strings that a number of useful operators can be included achieving good
12 strings efficiency.
strings The structure of the document is fiaed and hierarchical.
patterns.
meaningthat the that pate Beginning in the 1990s
3.6 QUERYPROTOCOOLS Network publishing protocol
Query databases through the Intemet
Protocols.
on Query (c) CCL
Write a short note
GQ
Common Command Language
(a) Z39.50
National Standards Institute NISO proposal based on Z39.50
Approved by American
National Information Standards Organization(NISO) in 1995 (ANSI) Defines 19 commands
platform More popular in Europe
Can be implemented on any
bibliographical information using a standard Based on the classical Boolean model
Query
between the client and the host database
manager intert (d) CD-RDx
With query language, the protocol also specifies a way in wbi
Compact Disk Read only Data exchange
session, communicate and
client and server establish a
information, etc.
exchan Uses client server architecture on most platforms
Client is generic
Z39.50 protocol is part of WAIS Server is designed and provided by the CD-ROM publisher
Z39.50 Brief history Allows fixed length fields, images and audio
Supported by CIA, NASA and GSA
Work on the Z39.50 protocol began in the 1970s and loa
Successive versions in 1988. 1992, 1995 and 2003 (e) SFQL
Z39.50-1988(version 1) Structured Full-text Query Language
Z39.50-1992( version2) Based on SQL
Uses client server architecture
Z39.50-1995(version 3)
Adopted as a standard by aerospace community
Z39.50-2003(version 4) Documents are rows in relational tables which are tagged using
SGML
www Client| ww z39.50 Answer format has header and message area
z39.50 Repository
Server Digital 3.7 TRENDs AND RESEARCH ISSUES
Z39.50 Client Library
Table 3.7.1 shows the different basic queries allowed in the different
models.
Fig. 3.6.l: Using Z39.50 over the WWW
(New Syll wefacademic year 22-23) (M7-87) Tech-Neo Publications (New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications
(MU
RetnevalSystem Network (BBN)
Infomation
Bayesian
Belief models intonabon Retrieval System (MU)
(Query Proc &
Operations) Pg no (3-15)
Probabilistic and
incoporalesseloperations.
Relationship
betweentypes of queries and models Approaches based on
Table3.7.1: Queries allowed feedback information from the user
documents initially retrieved
Model Queries allowed information derived from the set of
Boolean Words (called the local set of documents).
document collection
Vector Words global information derived from the
Probabilistic
Network Words 3.10 USER RELEVANCE
FEEDBACK
Bayesian Belief
TAXONOMY Define Relevance feedback model.
LANGUAGE GQ. Feedback. OR Give brief notes
3.8 QUERY I GQ. Give short notes for User Relevance
it is used in query i
operationscovered so far and about user Relevance feedback method and how
Fig. 3.8.1
types of
representsthe how t expansion
Relevance Feedback for !
can be structured. GQ. What are the two basic approaches in User
Boolean queries query processing?
Fuzzy Boolean
after reviewing. markS
natural User receives a list of searched documents and,
language basic queries the relevant documents
structured queries
that are attached to the
A selection of key terms or expressions
proximity document and identified by the user as relevant
Definition : Relevance Feedback Model
phrases pattern matching user to provide
After initial retrieval, results are presented, allow the
more of the retrieved documents.
errors feedback on the relevance of one or
Use this feedback information to reformulate the query and produce new
interactive multi
results based on reformulated query. Thus allows more
Words substrings regular expressions
pass process.
prefixes extended patterns
keywords and suffixes Two basic operations :
context
document
Query expansion : addition of new terms from relevant
Fig. 3.8.1 : Query Language Taxonomy (Expand queries with the vector model)
the
Term reweighting: modification of term weights based on
3.9 QUERY OPERATIONS user relevance judgement
It is difficult to formulate queries which are well designed for retrieval The usage of user relevance feedback to :
purposes. (a) expand queries with the vector model
model
Improving the initial query formulation through query expansion and (b) reweight query terms with the probabilistic
term reweighting the probabilistic model
(c) reweight query terms with a variant of
(New Syll. wefacademic year 22-23) Tech-Neo Publications
(M7-87)
Tech-Neo Publications (New Syll. w.e.f academic year 22-23) (M7-87)
InformatonRetrieV
3.11 VECTOR MODEL
Infomation Retrieval System (MU) (Query Proc. &Operations) Pg. no. (3-17)
weightingin
calculatetheterm document and
Howdo you VectorModel? Qu a 3.11.1 Query Expansion and Term Reweighting for
GQ in
termweight the Vector Model
Define :
GQ. What are the three classic and similar ways to calculate the modified
query qm?
Weight: term in the set K=/kj, ., k, |
be a genericindex Ideal case C, : the complete set C, of relevant documents to a given
Letthe k,
associated with each index term k; of ia
A weight w;; >Ois
documentindex term
vector :
with an index term vector d,
documem query q
the best query vector is presented by
(New Syll wefacadernic year 22-23) (M7-87) Tech-Neo Publication (New Syll. w.e.f academic year 22-23) (M7-87) ech.Neo Publications
Retrieval
Systen(MU)
THE
REWEIGHTINGFOR
Infomaton
no (319)
Inlonation Retrieval Systen (MU) (Query Proc &Operatons) Pg
PROBABILISTICMODEL
TERM
3.12
reweightingin Probabilistic
GQ. How
doyou
Similarity:
calculatethe term
the correlation betweenthe vectors d. and this coTrela MMode The similarity of d, to q:
n,-ID,|
sim (d, q) = 2 Wi.gw,j log (ID,I-ID,, !'N-ID, I- (n, - |D, D)
quantifiedas: i=1
canbe d, oq
..3.12.5)
D0sadvantages
P(k;IR) = N-ID,I ..(3.12.3
(1) Document term weights are not taken into account during the feedback
ID, loop.
P(k;|R) = ..(3.12.4)
ID,I (2) Weights of terms in the previous query formulations are also
the D, is the set of relevant documents according to the disregarded.
user judgement
the D,, is the subset of D, composed of the (3) No query expansion is used.
documents contain the term
(New Syll wefacadermic year 22-23) (M7-87) Tech-Neo Publications (New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications
VARIANTOF PI RM
3.13 A
REWEIGHTING
M Infonation Retrieval System (MU) (Query Proc. &Operations) Pg. no. (3-21)
include
within-document frequency
probabilisticterm
reweighting: forma iGQ. Write a short note on data retrieval in Multimedia IR system.
fij= K+(1 + K) max (f.i) (5) Data whose structure may not match, or only partially match, the
. .3.13 structure prescribed by the data schema
The f;,is a norma!ized within-document frequency Cand K (6) The system must typically extract some features from the
adjusted according to the collection should multimedia objects.
Feedback searches : Data retrieval
Fig. 3.16.2: Complete conceptual structure of the type (2) It gives good resuits : Observed experimentally and are due to the fact
Business_Product Letter that the modified query vector does reflect a portion of the intended
query semantics.
(New Syll wefacadem1c year 22 23)(M7-87)
ech -Neo Publications (New Syll. w.e.f academic year 22-23)(M7-87) Tech-Neo Publications
Intomat letrevaliysten
V Ans. :
() It shields the user from the details of the query reformulation procea
relevance
because all the user has
documents.
to provide is a
judgement o
(2) It breaks down the whole searching task into a sequence of smal) step
Ans. :
() Term Frequency (t) : Term Frequency is the number of times a termi
appears in document j (tfij).
(2) Document Frequency (df) : Number of documents a term iappears in
(dfi ).