0% found this document useful (0 votes)
16 views29 pages

Ir 103 131

The document discusses various text mining techniques including information filtering, relevance feedback, text classification, clustering algorithms, and applications of text mining. It covers topics like naive Bayes classification, decision trees, k-means clustering, and expectation maximization.

Uploaded by

Madhurima Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views29 pages

Ir 103 131

The document discusses various text mining techniques including information filtering, relevance feedback, text classification, clustering algorithms, and applications of text mining. It covers topics like naive Bayes classification, decision trees, k-means clustering, and expectation maximization.

Uploaded by

Madhurima Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

UNIT-V

DOCUMENT TEXT MINING Information filtering; organization and relevance feedback – Text
Mining -Text classification and clustering – Categorization algorithms: naive Bayes; decision
trees; and nearest neighbor – Clustering algorithms: agglomerative clustering; k-means;
expectation maximization (EM).

5.1 INFORMATION FILTERING; ORGANIZATION AND RELEVANCE FEEDBACK

PART-A

Differentiate between information filtering and information retrieval(Nov/Dec’17)

What are the characteristics of information filtering.(Nov/Dec’16)

Page 97
Page 98
• Relevance feedback: user feedback on relevance of docs in initial set of results
User issues a (short, simple) query
The user marks some results as relevant or non-relevant.

Page 99
The system computes a better representation of the information need based on
feedback.
Relevance feedback can go through one or more iterations.
We will use ad hoc retrieval to refer to regular retrieval without relevance feedback.
Relevance Feed back Example:

Results for Initial Query

Page 100
Results after Relevance feedback

The process of query modification is commonly referred as

Page 101
Relevance feedback, when the user provides information on relevant documents
to a query, or

Query expansion, when information related to the query is used to expand it.

Two basic approaches of feedback methods:

explicit feedback, in which the information for query reformulation is provided


directly by the users.

In an explicit relevance feedback cycle, the feedback information is provided


directly by the users However, collecting feedback information is expensive and time
consuming

In the Web, user clicks on search results constitute a new source of feedback
information A click indicate a document that is of interest to the user in the context of the
current query

Explicit Feedback

Implicit Feedback, in which the information for query reformulation is implicitly


derived by the system.

There are two basic approaches for compiling implicit feedback information:

local analysis, which derives the feedback information from the top ranked documents in
the result set

Page 102
global analysis, which derives the feedback information from external sources such as a
thesaurus

Implicit Feedback

Centroid

▪ The centroid is the center of mass of a set of points


▪ Recall that we represent documents as points in a high-dimensional space
Centroid:
→ →
d
1
(C) =
| C | dC

where C is a set of documents.

The Rocchio algorithm uses the vector space model to pick a relevance feedback
query
Rocchio seeks the query qopt that maximizes

→ → → → →
q = arg max [cos(q,  (C )) − cos(q,  (C ))]
opt r nr

q

▪ Tries to separate docs marked relevant and non-relevant

Page 103
Page 104
→ 1 → 1 →
→j j
q opt = d − →
d
d jCr d jCr

Theoretical Best Query

→ → →
qm = q0 + 
1
→ d j − D
1
d j
D →

r d jDr nr d jDnr

Rocchio 1971 Algorithm (SMART)

• Dr = set of known relevant doc vectors


• Dnr = set of known irrelevant doc vectors
Different from Cr and Cnr
• qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-
chosen or set empirically)
• New query moves toward relevant documents and away from irrelevant
documents

Page 105
Relevance feedback on initial query

Page 106
• We can modify the query based on relevance feedback and apply standard vector space
model.
• Use only the docs that were marked.
• Relevance feedback can improve recall and precision
• Relevance feedback is most useful for increasing recall in situations where recall is
important
Users can be expected to review results and to take time to iterate

Positive feedback is more valuable than negative feedback

Query Expansion

• In relevance feedback, users give additional input (relevant/non-relevant) on documents,


which is used to reweight terms in the documents
• In query expansion, users give additional input (good/bad search term) on words or
phrases
• For each term, t, in a query, expand the query with synonyms and related words of t from
the thesaurus
feline → feline cat
• May weight added terms less than original query terms.
• Generally increases recall

Page 107
• Widely used in many science/engineering fields
• May significantly decrease precision, particularly with ambiguous terms.
“interest rate” → “interest rate fascinate evaluate”
• There is a high cost of manually producing a thesaurus
And for updating it for scientific changes

5.2 TEXT MINING

Another way to view text data mining is as a process of exploratory data analysis that
leads to heretofore unknown information, or to answers for questions for which the
answer is not currently known.”

Text Mining Methods

• Information Retrieval
– Indexing and retrieval of textual documents

• Information Extraction
– Extraction of partial knowledge in the text

• Web Mining
– Indexing and retrieval of textual documents and extraction of partial knowledge
using the web

• Clustering
– Generating collections of similar text documents

• Text mining process Text preprocessing

Page 108
Syntactic/Semantic text analysis

• Part Of Speech (pos) tagging


• Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball (noun)
• Word sense disambiguation
• Context based or proximity based
• Very accurate
• Parsing
• Generates a parse tree (graph) for each sentence
• Each sentence is a stand alone graph

Features Generation
Bag of words
• Text document is represented by the words it contains (and their occurrences)
– e.g., “Lord of the rings” → {“the”, “Lord”, “rings”, “of”}
– Highly efficient
– Makes learning far simpler and easier
– Order of words is not that important for certain applications
• Stemming: identifies a word by its root
– Reduce dimensionality
– e.g., flying, flew → fly
– Use Porter Algorithm
• Stop words: The most common words are unlikely to help text mining
– e.g., “the”, “a”, “an”, “you” …
Features Selection
Simple counting
Statistics
• Reduce dimensionality
• Learners have difficulty addressing tasks with high dimensionality
• Irrelevant features
• Not all features help!
• e.g., the existence of a noun in a news article is unlikely to help classify it as “politics” or
“sport”
• Use Weightening

Text/Data Mining
Classification- Supervised learning
Clustering- Unsupervised learning
• Given: a collection of labeled records (training set)
– Each record contains a set of features (attributes), and the true class (label)
• Find: a model for the class as a function of the values of the features
• Goal: previously unseen records should be assigned a class as accurately as possible
Page 109
– A test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it
• Analyzing results
Correct classification: The known label of test sample is identical with the class result
from the classification model
Accuracy ratio: the percentage of test set samples that are correctly classified by the
model
A distance measure between classes can be used
– e.g., classifying “football” document as a “basketball” document is not as bad as
classifying it as “crime”.

Text Mining Application

Email: Spam filtering


News Feeds: Discover what is interesting
Medical: Identify relationships and link information from different medical fields
Marketing: Discover distinct groups of potential buyers and make suggestions
for other products
Industry: Identifying groups of competitors web pages
Job Seeking: Identify parameters in searching for jobs
5.3 TEXT CLASSIFICATION AND CLUSTERING

Text classification and Naive Bayes

PART-A

What are the desirable properties of a clustering


algorithm(Nov/Dec’16’17)

A class is a more general subject area like China or coffee. Such more
general classes are usually referred to as topics, and the classifica- tion task is then
called text classification, text categorization, topic classification, or topic spotting.
In text classification, we are given a description d ∈ X of a document,
where X is the document space; and a fixed set of classes C = {c1, c2, . . . , cJ}. Classes
are also called categories or labels. Typically, the document space X is some type
of high-dimensional space, and the classes are human defined for the needs of an
application, as in the examples China and documents that talk about

Page 110
multicore computer chips above. We are given a training set D of labeled documents
(d, c),where (d, c) ∈ X × C. For example:
(d, c) = (Beijing joins the World Trade Organization, China)
for the one-sentence document Beijing joins the World Trade Organization and the
class (or label) China.
Using a learning method or learning algorithm, we then wish to learn a clas- sifier
or classification function γ that maps documents to classes:
γ:X→C
This type of learning is called supervised learning because a supervisor (the human
who defines the classes and labels training documents) serves as a teacher directing
the learning process. We denote the supervised learning method by andwrite Γ(D)
= γ. The learning method Γ takes the training
set D as input and returns the learned classification function γ.

5.4 TEXT CATEGORIZATION

Page 111
PART-B
Explain in detail the multiple –Bernoulli and multinomial model(Nov/Dec’17)
Discuss in detail about the working of naïve Bayesian and its limitation in web search.
(Nov/Dec’16)
State bayes Theorem, Explain Naïve bayes classification with an example(Apr/may’17)

Assign labels to each document or web-page:

• Labels are most often topics such as Yahoo-categories


▪ e.g., "finance," "sports," "news>world>asia>business"

• Labels may be genres


▪ e.g., "editorials" "movie-reviews" "news“

• Labels may be opinion


▪ e.g., “like”, “hate”, “neutral”

• Labels may be domain-specific binary


▪ e.g., "interesting-to-me" : "not-interesting-to-me”
▪ e.g., “spam” : “not-spam”
▪ e.g., “contains adult language” :“doesn’t”

• Given:
o A description of an instance, xX, where X is the instance language or instance
space.
▪ Issue: how to represent text documents.
o A fixed set of categories:
C = {c1, c2,…, cn}
• Determine:
o The category of x: c(x)C, where c(x) is a categorization function whose domain
is X and whose range is C.

▪ We want to know how to build categorization functions (“classifiers”).

5.4.1 Naive Bayes text classification

Page 112
The first supervised learning method we introduce is the multinomial Naïve
Bayes or multinomial NB model, a probabilistic learning method. The proba
bility of a document d being in class c is computed as

where P(tk c) is the conditional probability of term tk occurring in a document of


class c. |
We interpret P(tk c) as a measure of ho| w much evidence tk contributes that c is
the correct class.
P(c) is the prior probability of a document occurring in class c. If a document’s
terms do not provide clear evidence for one class versus another,
( )
we choose the one that has a higher prior probability. t1, t2, . . . , tnd are the tokens
in d that are part of the vocabulary we use for classification and nd is the number
of such tokens in d.

Naive Bayes algorithm (multinomial model): Training and


testing
TRAINMULTINOMIALNB(C, D)
1 V ← EXTRACTVOCABULARY(D)
2 N ← COUNTDOCS(D)
3 for each c ∈ C
4 do Nc ← COUNTDOCSINCLASS(D, c)
5 prior[c] ← Nc /N
6 textc ←
CONCATENATETEXTOFALLDOCSINCLASS(D, c)
7 for each t ∈ V
8 do Tct ← COUNTTOKENSOFTERM(textc, t)
9 for each t ∈ V
10 do condprob[t][c] ←
11 return V, prior, cond prob

APPLYMULTINOMIALNB(C, V, prior, cond prob, d)


1 ←
W EXTRACTTOKENSFROMDOC(V, d)
2 for eac∈ h c C
3 do scor←e[c] log prior[c]
4 for ea∈ch t W
5 do score[c] += log condprob[t][c]
6 return arg maxc∈C score[c]

5.4.2 The Bernoulli model

Page 113
There are two different ways we can set up an NB classifier. The model we in-
troduced in the previous section is the multinomial model. It generates one term
from the vocabulary in each position of the document.

An alternative to the multinomial model is the multivariate Bernoulli model. or


Bernoulli model. It is equivalent to the binary independence model, which
generates an indicator for each term of the vocabulary, either 1 indicating presence
of the term in the document or 0 indi- cating absence.

Applying the Bernoulli model to the example in we


have the same estimates for the priors as before:
P̂(c) = 3/4, P̂(c) = 1/4. The conditional
probabilities are:
(Chinese|c) = (3 + 1)/(3 + 2) = 4/5
Pˆ(Japan|c) = Pˆ(Tokyo|c) = (0 + 1)/(3 + 2) = 1/5
Pˆ(Beijing|c) = Pˆ(Macao|c) = Pˆ(Shanghai|c)=(1 + 1)/(3 + 2) = 2/5
Pˆ(Chinese|c) = (1 + 1)/(1 + 2) = 2/3
Pˆ(Japan|c) = Pˆ(Tokyo|c) = (1 + 1)/(1 + 2) = 2/3
Pˆ(Beijing|c) = Pˆ(Macao|c) = Pˆ(Shanghai|c) = (0 + 1)/(1 +
2) = 1/3
The denominators are (3 + 2) and (1 + 2) because there are three
documents in c and one document in c and because the constant B in
The scores of the test document for the two classes are
Pˆ(c|d5) 𝖺 Pˆ(c) · Pˆ(Chinese|c) · Pˆ(Japan|c) · Pˆ(Tokyo|c)

· (1 − Pˆ(Beijing|c)) · (1 − Pˆ(Shanghai|c)) · (1 − Pˆ(Macao|c))


=3/4 · 4/5 · 1/5 · 1/5 · (1−2/5) · (1−2/5) · (1−2/5)

· ≈ 0.005

· and, analogously,

Pˆ(c|d5) 𝖺 1/4 · 2/3 · 2/3 · 2/3 · (1−1/3) · (1−1/3) · (1−1/3)

· ≈ 0.022

· Thus, the classifier assigns the test document to c = not-China. When looking
only at binary occurrence and not at term frequency, Japan and Tokyo are indicators

Page 114
for c (2/3 > 1/5) and the conditional probabilities of Chinese for c and c are not
different enough (4/5 vs. 2/3) to affect the classification decision.

Properties of Naive Bayes

P(d) is the same for all classes and does not affect the argmax.
To generate a document, we first choose class c with probability P(c)

The two models differ in the formalization of the second step, the generation of the
document given the class, corresponding to the conditional distribution P(d|c):

Multinomial P(d|c) = P((t1, . . . , tk, . . . , tnd )|c)


Bernoulli P(d|c) = P((e1, . . . , ei, . . . , eM)|c),

where t1, . . . , tnd is the sequence of terms as it occurs in d (minus terms that were
( )
excluded from the vocabulary) and e1 , . . . , ei , . . .( , eM is a binary
) vector of
dimensionality M that indicates for each term whether it occurs in d or not.

5.4.3 k Nearest Neighbor Classification

To classify document d into class c


• Define k-neighborhood N as k nearest neighbors of d
• Count number of documents i in N that belong to c
• Estimate P(c|d) as i/k
• Choose as class argmaxc P(c|d) [ = majority class]
• Learning is just storing the representations of the
training examples in D.
· Testing instance x:
Compute similarity between x and all examples in D.
• Assign x the category of the most similar example in D.
• Does not explicitly compute a generalization or category prototypes.
• Also called:
• Case-based learning

Page 115
• Memory-based learning
• Lazy learning
Using only the closest example to determine the categorization is subject to errors due to:
A single atypical example.
Noise (i.e. error) in the category label of a single training example.
· More robust alternative is to find the k most-similar examples and return the
majority category of these k examples.
· Value of k is typically odd to avoid ties; 3 and 5 are most common.

Decision Tree Classification

• Tree with internal nodes labeled by terms


• Branches are labeled by tests on the weight that the term has
• Leaves are labeled by categories
• Classifier categorizes document by descending tree following tests to leaf
• The label of the leaf node is then assigned to the document
• Most decision trees are binary trees (never disadvantageous; may require extra
internal nodes)
• DT make good use of a few high-leverage features

Page 116
Learn a sequence of tests on features, typically using top-down, greedy search
At each stage choose the unused feature with highest Information Gain
(feature/class MI)

Binary (yes/no) or continuous decisions

o Fully grown trees tend to have decision rules that are overly specific and are
therefore unable to categorize documents well
▪ Therefore, pruning or early stopping methods for Decision Trees are
normally a standard part of classification packages
o Use of small number of features is potentially bad in text cat, but in practice
decision trees do well for some text classification tasks

Page 117
o Decision trees are very easily interpreted by humans – much more easily than
probabilistic methods like Naive Bayes
o Decision Trees are normally regarded as a symbolic machine learning algorithm,
though they can be used probabilistically

5.6 CLUSTERING

PART-A
Differentiate supervised learning and unsupervised learning(Nov/Dec’17)
PART-B
Explain the process of choosing K in Nearest neighbor clustering. (Nov/Dec’17)

• Clustering: the process of grouping a set of objects into classes of similar objects
Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

• The commonest form of unsupervised learning

Unsupervised learning = learning from raw data, as opposed to supervised data


where a classification of examples is given
A common and important task that finds many applications in IR and other places.

Whole corpus analysis/navigation


Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective “user recall” will be higher

Page 118
For speeding up vector space retrieval
Cluster-based retrieval gives faster search

www.yahoo.com/Science

Scatter/Gather: Cutting, Karger, and Pedersen

Page 119
5.6.1 K-MEANS

Assumes documents are real-valued vectors.


Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:
→ 1 →
μ(c) =

| c | x→c
x

Reassignment of instances to clusters is based on distance to the current cluster centroids.


(Or one can equivalently phrase it in terms of similarities)

Select K random docs {s1, s2,… sK} as seeds.

Until clustering converges (or other stopping criterion):


For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)

For each cluster cj


sj = (cj)

Page 120
K Means Example (K=2)

5.6.2 AGGLOMERATIVE CLUSTERING

Starts with each doc in a separate cluster


then repeatedly joins the closest pair of clusters, until there is only one cluster.

The history of merging forms a binary tree or hierarchy.

Many variants to defining closest pair of clusters

Single-link

Similarity of the most cosine-similar (single-link)

Complete-link

Similarity of the “furthest” points, the least cosine-similar

Centroid
Clusters whose centroids (centers of gravity) are the most cosine-similar

Page 121
Average-link
Average cosine between pairs of elements

Single Link Agglomerative Clustering

▪ Use maximum similarity of pairs:


sim(ci ,c j ) = max sim(x, y)
xci , yc j

▪ Can result in “straggly” (long and thin) clusters due to chaining effect.

▪ After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

sim((ci  c j ), ck ) = max( sim(ci , ck ), sim(cj , ck ))

Single Link:

Complete Link:

▪ Use minimum similarity of pairs:

sim(ci ,c j ) = min sim(x, y)


xci , yc j

▪ Makes “tighter,” spherical clusters that are typically preferable.


Page 122
Page 123
After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
sim((ci  c j ), ck ) = min( sim(ci , ck ), sim(cj , ck ))

Complete Link Example

Internal criterion: A good clustering will produce high quality clusters in which:

The intra-class (that is, intra-cluster) similarity is high


The inter-class similarity is low
The measured quality of a clustering depends on both the document
representation and the similarity measure used

5.7 EXPECTATION MAXIMIZATION (EM)

PART-B
Brief about Expectation maximization algorithm(Nov/Dec’17)
Give an account of the Expaectation maximization problem. (Nov/Dec’16)

• Iterative method for learning probabilistic categorization model from


unsupervised data.

• Initially assume random assignment of examples to categories.

Page 124
• Learn an initial probabilistic model by estimating model parameters  from this
randomly labeled data.

• Iterate following two steps until convergence:


Expectation (E-step): Compute P(ci | E) for each example given the current
model, and probabilistically re-label the examples based on these posterior
probability estimates.
Maximization (M-step): Re-estimate the model parameters, , from the
probabilistically re-labeled data.

EM Experiment :

Semi-supervised: some labeled and unlabeled data

Take a completely labeled corpus D, and randomly select a subset as DK.


Also use the set of unlabeled documents in the EM procedure.
Correct classification of a document

Concealed class label = class with largest probability

Accuracy with unlabeled documents > accuracy without unlabeled documents


Keeping labeled set of same size

EM beats naïve Bayes with same size of labeled document set

Largest boost for small size of labeled set

Comparable or poorer performance of EM for large labeled sets.

******************************************************************

Page 125

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy