Web Search
Web Search
1
Our Discussion of Web
Search
Begin with traditional information retrieval
Document models
Stemming and stop words
Web-specific issues
Crawlers and robots.txt
Scalability
Models for exploiting hyperlinks in ranking
Google and PageRank
Latent Semantic Indexing
2
Information Retrieval
Traditional information retrieval is basically text search
A corpus or body of text documents, e.g., in a document
collection in a library or on a CD
Documents are generally high-quality and designed to convey
information
Documents are assumed to have no structure beyond words
Searches are generally based on meaningful phrases,
perhaps including predicates over categories, dates,
etc.
The goal is to find the document(s) that best match the
search phrase, according to a search model
Assumptions are typically different from Web: quality
text, limited-size corpus, no hyperlinks
3
Motivation for Information
Retrieval
Information Retrieval (IR) is about:
Representation
Storage
Organization of
And access to “information items”
Focus is on user’s “information need” rather than
a precise query:
“March Madness” – Find information on college
basketball teams which: (1) are maintained by a US
university and (2) participate in the NCAA tournament
Emphasis is on the retrieval of information (not
data)
4
Data vs. Information
Retrieval
Data retrieval, analogous to database querying:
which docs contain a set of keywords?
Well-defined, precise logical semantics
A single erroneous object implies failure!
Information retrieval:
Information about a subject or topic
Semantics is frequently loose; we want approximate
matches
Small errors are tolerated (and in fact inevitable)
IR system:
Interpret contents of information items
Generate a ranking which reflects relevance
Notion of relevance is most important – needs a model
5
Basic Model
Docs Index Terms
doc
match
Information Need Ranking
?
query
6
Information Retrieval as a
Field
IR addressed many issues in the last 20 years:
Classification and categorization of documents
Systems and languages for searching
User interfaces and visualization of results
Area was seen as of narrow interest – libraries,
mainly
Sea-change event – the advent of the web:
Universal “library”
Free (low cost) universal access
No central editorial board
Many problems in finding information:
IR seen as key to finding the solutions!
7
The Full Info Retrieval
Process
Text
Browse
r / UI
user interest
Text
Query
user feedback Operations Indexing
Crawler
inverted index / Data
query
Access
Searching Index
ranked docs
Ranking
8
Terminology
IR systems usually adopt index terms to
process queries
Index term:
a keyword or group of selected words
any word (more general)
Stemming might be used:
connect: connecting, connection, connections
An inverted index is built for the chosen
index terms
9
What’s a Meaningful Result?
Matching at index term level is quite
imprecise
Users are frequently dissatisfied
One problem: users are generally poor at
posing queries
Frequent dissatisfaction of Web users (who
often give single-keyword queries)
Issue of deciding relevance is critical for IR
systems: ranking
10
Rankings
A ranking is an ordering of the documents
retrieved that (hopefully) reflects the
relevance of the documents to the user query
A ranking is based on fundamental premises
regarding the notion of relevance, such as:
common sets of index terms
sharing of weighted terms
likelihood of relevance
Each set of premisses leads to a distinct IR
model
11
Types of IR Models Set Theoretic
Fuzzy
Extended Boolean
Classic Models
boolean Algebraic
vector
U probabilistic Generalized Vector
s Retrieval: Lat. Semantic Index
e Adhoc Neural Networks
r Filtering
Structured Models
Probabilistic
T Non-Overlapping Lists
a Proximal Nodes Inference Network
s Belief Network
k Browsing
Browsing
Flat
Structure Guided
Hypertext
12
Classic IR Models – Basic
Concepts
Each document represented by a set of
representative keywords or index terms
An index term is a document word useful for
remembering the document main themes
Traditionally, index terms were nouns
because nouns have meaning by themselves
However, search engines assume that all
words are index terms (full text
representation)
13
Classic IR Models – Ranking
Not all terms are equally useful for representing
the document contents: less frequent terms allow
identifying a narrower set of documents
The importance of the index terms is represented
by weights associated to them
Let
ki be an index term
dj be a document
wij is a weight associated with (ki,dj)
The weight wij quantifies the importance of the
index term for describing the document contents
14
Classic IR Models – Notation
ki is an index term (keyword)
dj is a document
t is the total number of docs
K = (k1, k2, …, kt) is the set of all index terms
wij >= 0 is a weight associated with (ki,dj)
wij = 0 indicates that term does not belong to
doc
vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated
with the document dj
gi(vec(dj)) = wij is a function which returns the weight
associated with pair (ki,dj)
15
Boolean Model
Simple model based on set theory
Queries specified as boolean expressions
precise semantics
neat formalism
q = ka (kb kc)
Terms are either present or absent. Thus,
wij {0,1}
An example query
q = ka (kb kc)
Disjunctive normal form: vec(qdnf) = (1,1,1) (1,1,0)
(1,0,0)
Conjunctive component: vec(qcc) = (1,1,0)
16
Boolean Model for Similarity
Ka Kb
q = ka (kb kc)
(1,1,0)
(1,0,0)
(1,1,1)
sim(q,dj) = 1 if vec(qcc) s.t. (vec(qcc) vec(qdnf)) ( k i,
gi(vec(dj)) = gi(vec(qcc))) 0 otherwise
{ Kc
17
Drawbacks of Boolean Model
Retrieval based on binary decision criteria with no
notion of partial matching
No ranking of the documents is provided (absence
of a grading scale)
Information need has to be translated into a
Boolean expression which most users find awkward
The Boolean queries formulated by the users are
most often too simplistic
As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
18
Vector Model
A refinement of the boolean model, which
focused strictly on exact matchines
Non-binary weights provide consideration for
partial matches
These term weights are used to compute a
degree of similarity between a query and each
document
Ranked set of documents provides for
better matching
19
Vector Model
Define:
wij > 0 whenever ki dj
wiq >= 0 associated with the pair (ki,q)
vec(dj) = (w1j, w2j, ..., wtj)
vec(q) = (w1q, w2q, ..., wtq)
With each term ki , associate a unit vector vec(i)
The unit vectors vec(i) and vec(j) are assumed to be
orthonormal (i.e., index terms are assumed to occur
independently within the documents)
The t unit vectors vec(i) form an orthonormal basis
for a t-dimensional space
In this space, queries and documents are
represented as weighted vectors
20
Vector Model
j
dj
Sim(q,dj) = cos() q
= [vec(dj) vec(q)] / |dj| * |q|
i
= [ wij * wiq] / |dj| * |q|
21
Weights in the Vector Model
Sim(q,dj) = [ wij * wiq] / |dj| * |q|
How do we compute the weights wij and wiq?
A good weight must take into account two
effects:
quantification of intra-document contents
(similarity)
tf factor, the term frequency within a document
quantification of inter-documents separation
(dissimilarity)
idf factor, the inverse document frequency
wij = tf(i,j) * idf(i)
22
TF and IDF Factors
Let:
N be the total number of docs in the collection
ni be the number of docs which contain ki
freq(i,j) raw frequency of ki within dj
23
Vector Model k1
k2
Example 1 d2 d6
d7
d4 d5
d3
d1
k3
k1 k2 k3 q dj
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1
24
Vector Model k1
k2
Example 1I d2 d6
d7
d4 d5
d3
d1
k3
k1 k2 k3 q dj
d1 1 0 1 4
d2 1 0 0 1
d3 0 1 1 5
d4 1 0 0 1
d5 1 1 1 6
d6 1 1 0 3
d7 0 1 0 2
q 1 2 3
25
Vector Model k1
k2
Example III d2 d6
d7
d4 d5
d3
d1
k3
k1 k2 k3 q dj
d1 2 0 1 5
d2 1 0 0 1
d3 0 1 3 11
d4 2 0 0 2
d5 1 2 4 17
d6 1 2 0 5
d7 0 5 0 10
q 1 2 3
26
Vector Model, Summarized
The best term-weighting schemes tf-idf
weights:
wij = f(i,j) * log(N/ni)
For the query term weights, a suggestion is
wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *
log(N / ni)
28
Comparison of Classic
Models
Boolean model does not provide for partial
matches and is considered to be the weakest
classic model
Some experiments indicate that the vector
model outperforms the third alternative, the
probabilistic model, in general
Recent IR research has focused on improving
probabilistic models – but these haven’t made their
way to Web search
Generally we use a variation of the vector
model in most text search systems
29
Next Time: The Web
… And in particular, PageRank!
30