Ir 103 131
Ir 103 131
DOCUMENT TEXT MINING Information filtering; organization and relevance feedback – Text
Mining -Text classification and clustering – Categorization algorithms: naive Bayes; decision
trees; and nearest neighbor – Clustering algorithms: agglomerative clustering; k-means;
expectation maximization (EM).
PART-A
Page 97
Page 98
• Relevance feedback: user feedback on relevance of docs in initial set of results
User issues a (short, simple) query
The user marks some results as relevant or non-relevant.
Page 99
The system computes a better representation of the information need based on
feedback.
Relevance feedback can go through one or more iterations.
We will use ad hoc retrieval to refer to regular retrieval without relevance feedback.
Relevance Feed back Example:
Page 100
Results after Relevance feedback
Page 101
Relevance feedback, when the user provides information on relevant documents
to a query, or
Query expansion, when information related to the query is used to expand it.
In the Web, user clicks on search results constitute a new source of feedback
information A click indicate a document that is of interest to the user in the context of the
current query
Explicit Feedback
There are two basic approaches for compiling implicit feedback information:
local analysis, which derives the feedback information from the top ranked documents in
the result set
Page 102
global analysis, which derives the feedback information from external sources such as a
thesaurus
Implicit Feedback
Centroid
The Rocchio algorithm uses the vector space model to pick a relevance feedback
query
Rocchio seeks the query qopt that maximizes
→ → → → →
q = arg max [cos(q, (C )) − cos(q, (C ))]
opt r nr
→
q
Page 103
Page 104
→ 1 → 1 →
→j j
q opt = d − →
d
d jCr d jCr
→ → →
qm = q0 +
1
→ d j − D
1
d j
D →
→
r d jDr nr d jDnr
Page 105
Relevance feedback on initial query
Page 106
• We can modify the query based on relevance feedback and apply standard vector space
model.
• Use only the docs that were marked.
• Relevance feedback can improve recall and precision
• Relevance feedback is most useful for increasing recall in situations where recall is
important
Users can be expected to review results and to take time to iterate
Query Expansion
Page 107
• Widely used in many science/engineering fields
• May significantly decrease precision, particularly with ambiguous terms.
“interest rate” → “interest rate fascinate evaluate”
• There is a high cost of manually producing a thesaurus
And for updating it for scientific changes
Another way to view text data mining is as a process of exploratory data analysis that
leads to heretofore unknown information, or to answers for questions for which the
answer is not currently known.”
• Information Retrieval
– Indexing and retrieval of textual documents
• Information Extraction
– Extraction of partial knowledge in the text
• Web Mining
– Indexing and retrieval of textual documents and extraction of partial knowledge
using the web
• Clustering
– Generating collections of similar text documents
Page 108
Syntactic/Semantic text analysis
Features Generation
Bag of words
• Text document is represented by the words it contains (and their occurrences)
– e.g., “Lord of the rings” → {“the”, “Lord”, “rings”, “of”}
– Highly efficient
– Makes learning far simpler and easier
– Order of words is not that important for certain applications
• Stemming: identifies a word by its root
– Reduce dimensionality
– e.g., flying, flew → fly
– Use Porter Algorithm
• Stop words: The most common words are unlikely to help text mining
– e.g., “the”, “a”, “an”, “you” …
Features Selection
Simple counting
Statistics
• Reduce dimensionality
• Learners have difficulty addressing tasks with high dimensionality
• Irrelevant features
• Not all features help!
• e.g., the existence of a noun in a news article is unlikely to help classify it as “politics” or
“sport”
• Use Weightening
Text/Data Mining
Classification- Supervised learning
Clustering- Unsupervised learning
• Given: a collection of labeled records (training set)
– Each record contains a set of features (attributes), and the true class (label)
• Find: a model for the class as a function of the values of the features
• Goal: previously unseen records should be assigned a class as accurately as possible
Page 109
– A test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it
• Analyzing results
Correct classification: The known label of test sample is identical with the class result
from the classification model
Accuracy ratio: the percentage of test set samples that are correctly classified by the
model
A distance measure between classes can be used
– e.g., classifying “football” document as a “basketball” document is not as bad as
classifying it as “crime”.
PART-A
A class is a more general subject area like China or coffee. Such more
general classes are usually referred to as topics, and the classifica- tion task is then
called text classification, text categorization, topic classification, or topic spotting.
In text classification, we are given a description d ∈ X of a document,
where X is the document space; and a fixed set of classes C = {c1, c2, . . . , cJ}. Classes
are also called categories or labels. Typically, the document space X is some type
of high-dimensional space, and the classes are human defined for the needs of an
application, as in the examples China and documents that talk about
Page 110
multicore computer chips above. We are given a training set D of labeled documents
(d, c),where (d, c) ∈ X × C. For example:
(d, c) = (Beijing joins the World Trade Organization, China)
for the one-sentence document Beijing joins the World Trade Organization and the
class (or label) China.
Using a learning method or learning algorithm, we then wish to learn a clas- sifier
or classification function γ that maps documents to classes:
γ:X→C
This type of learning is called supervised learning because a supervisor (the human
who defines the classes and labels training documents) serves as a teacher directing
the learning process. We denote the supervised learning method by andwrite Γ(D)
= γ. The learning method Γ takes the training
set D as input and returns the learned classification function γ.
Page 111
PART-B
Explain in detail the multiple –Bernoulli and multinomial model(Nov/Dec’17)
Discuss in detail about the working of naïve Bayesian and its limitation in web search.
(Nov/Dec’16)
State bayes Theorem, Explain Naïve bayes classification with an example(Apr/may’17)
• Given:
o A description of an instance, xX, where X is the instance language or instance
space.
▪ Issue: how to represent text documents.
o A fixed set of categories:
C = {c1, c2,…, cn}
• Determine:
o The category of x: c(x)C, where c(x) is a categorization function whose domain
is X and whose range is C.
Page 112
The first supervised learning method we introduce is the multinomial Naïve
Bayes or multinomial NB model, a probabilistic learning method. The proba
bility of a document d being in class c is computed as
Page 113
There are two different ways we can set up an NB classifier. The model we in-
troduced in the previous section is the multinomial model. It generates one term
from the vocabulary in each position of the document.
· ≈ 0.005
· and, analogously,
· ≈ 0.022
· Thus, the classifier assigns the test document to c = not-China. When looking
only at binary occurrence and not at term frequency, Japan and Tokyo are indicators
Page 114
for c (2/3 > 1/5) and the conditional probabilities of Chinese for c and c are not
different enough (4/5 vs. 2/3) to affect the classification decision.
P(d) is the same for all classes and does not affect the argmax.
To generate a document, we first choose class c with probability P(c)
The two models differ in the formalization of the second step, the generation of the
document given the class, corresponding to the conditional distribution P(d|c):
where t1, . . . , tnd is the sequence of terms as it occurs in d (minus terms that were
( )
excluded from the vocabulary) and e1 , . . . , ei , . . .( , eM is a binary
) vector of
dimensionality M that indicates for each term whether it occurs in d or not.
Page 115
• Memory-based learning
• Lazy learning
Using only the closest example to determine the categorization is subject to errors due to:
A single atypical example.
Noise (i.e. error) in the category label of a single training example.
· More robust alternative is to find the k most-similar examples and return the
majority category of these k examples.
· Value of k is typically odd to avoid ties; 3 and 5 are most common.
Page 116
Learn a sequence of tests on features, typically using top-down, greedy search
At each stage choose the unused feature with highest Information Gain
(feature/class MI)
o Fully grown trees tend to have decision rules that are overly specific and are
therefore unable to categorize documents well
▪ Therefore, pruning or early stopping methods for Decision Trees are
normally a standard part of classification packages
o Use of small number of features is potentially bad in text cat, but in practice
decision trees do well for some text classification tasks
Page 117
o Decision trees are very easily interpreted by humans – much more easily than
probabilistic methods like Naive Bayes
o Decision Trees are normally regarded as a symbolic machine learning algorithm,
though they can be used probabilistically
5.6 CLUSTERING
PART-A
Differentiate supervised learning and unsupervised learning(Nov/Dec’17)
PART-B
Explain the process of choosing K in Nearest neighbor clustering. (Nov/Dec’17)
• Clustering: the process of grouping a set of objects into classes of similar objects
Documents within a cluster should be similar.
Page 118
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
www.yahoo.com/Science
Page 119
5.6.1 K-MEANS
Page 120
K Means Example (K=2)
Single-link
Complete-link
Centroid
Clusters whose centroids (centers of gravity) are the most cosine-similar
Page 121
Average-link
Average cosine between pairs of elements
▪ Can result in “straggly” (long and thin) clusters due to chaining effect.
▪ After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
Single Link:
Complete Link:
Internal criterion: A good clustering will produce high quality clusters in which:
PART-B
Brief about Expectation maximization algorithm(Nov/Dec’17)
Give an account of the Expaectation maximization problem. (Nov/Dec’16)
Page 124
• Learn an initial probabilistic model by estimating model parameters from this
randomly labeled data.
EM Experiment :
******************************************************************
Page 125