04-textcat text class
04-textcat text class
using Classical ML
Mausam
(based on slides of Dan Weld, Dan Jurafsky,
Prabhakar Raghavan, Hinrich Schutze, Guillaume
Obozinski, David D. Lewis, Fei Xia, Michael Collins,
Emily Fox, Alexander Ihler, Dan Jurafsky, Dan Klein,
Chris Manning, Ray Mooney, Mark Schmidt, Dan
Weld, Alex Yates, Luke Zettlemoyer) 1
Categorization
• Given:
– A description of an instance, xX, where X is
the instance language or instance space.
– A fixed set of categories:
C={c1, c2,…cn}
• Determine:
– The category of x: c(x)C, where c(x) is a
categorization function whose domain is X and
whose range is C.
2
County vs. Country?
3
Male or female author?
• The main aim of this article is to propose an exercise in stylistic analysis
which can be employed in the teaching of English language. It details the
design and results of a workshop activity on narrative carried out with
undergraduates in a university department of English. The methods proposed
are intended to enable students to obtain insights into aspects of cohesion and
narrative structure: insights, it is suggested, which are not as readily obtainable
through moreFemale writers
traditional use
techniques of stylistic analysis.
more first person/second person pronouns
• My aim in this article is to show
more that
gender giventhird
laiden a relevance theoretic approach to
person pronouns
utterance interpretation,
(overall moreit ispersonalization)
possible to develop a better understanding of
what some of these so-called apposition markers indicate. It will be argued
that the decision to put something in other words is essentially a decision
about style, a point which is, perhaps, anticipated by Burton-Roberts when he
describes loose apposition as a rhetorical device. However, he does not justify
this suggestion by giving the criteria for classifying a mode of expression as a
rhetorical device.
S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3,
pp. 321–346
Positive or negative movie review?
• unbelievably disappointing
• Full of zany characters and richly applied
satire, and some great plot twists
• this is the greatest screwball comedy ever
filmed
• It was pathetic. The worst part about it was
the boxing scenes.
•6
What is the subject of this article?
• MACHINE LEARNING
Bayesian Methods
• Learning and classification methods based
on probability theory.
– Bayes theorem plays a critical role in
probabilistic learning and classification.
– Uses prior probability of each category given
no information about an item.
• Categorization produces a posterior
probability distribution over the possible
categories given a description of an item.
10
The bag of words representation
γ(
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
)=c
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
The bag of words representation
γ(
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
)=c
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
The bag of words representation:
using a subset of words
γ(
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx several xxxxxxxxxxxxxxxxx
)=c
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
The bag of words representation
great 2
γ(
love 2
recommend
laugh
1
1
)=c
happy 1
... ...
Bayes’ Rule Applied to Documents and
Classes
P(d | c)P(c)
P(c | d) =
P(d)
Naïve Bayes Classifier (I)
P(d | c)P(c)
= argmax Bayes Rule
cÎC P(d)
= argmax P(d | c)P(c) Dropping the
denominator
cÎC
Naïve Bayes Classifier (II)
P( x1 , x2 ,, xn | c)
• Bag of Words assumption: Assume position doesn’t
matter
• Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
P( x1 ,, xn | c) P( x1 | c) P( x2 | c) P( x3 | c) ... P( xn | c)
Learning the Multinomial Naïve Bayes Model
count("fantastic", positive)
P̂("fantastic" positive) = = 0
å count(w, positive)
wÎV
count(wi , c) +1
P̂(wi | c) =
å (count(w, c))+1)
wÎV
count(wi , c) +1
=
æ ö
çç å count(w, c)÷÷ + V
è wÎV ø
Easy to Implement
• But…
32
Probabilities: Important Detail!
33
Generative Model for Multinomial Naïve Bayes
c=China
•34
Advantages
• Simple to implement
– No numerical optimization, matrix algebra, etc
• Efficient to train and use
– Easy to update with new data
– Fast to apply
• Binary/multi-class
• Good in domains with many equally important features
– Decision Trees suffer from fragmentation in such cases –
especially if little data
• Comparatively good effectiveness with small training sets
• A good dependable baseline for text classification
– But we will see other classifiers that give better accuracy
35
Disadvantages
• Generative model
– Generally lower effectiveness than
discriminative techniques
36
Experimental Evaluation
37
Evaluation: Cross Validation
• Partition examples into k disjoint sets
• Now create k training sets
– Each set is union of all equiv classes except one
– So each set has (k-1)/k of the original training data
Train
Test
Test
…
Test
38
Cross-Validation (2)
• Leave-one-out
– Use if < 100 examples (rough estimate)
– Hold out one example, train on remaining examples
• 10-fold
– If have 100-1000’s of examples
39
Joint vs. Conditional Models
49
Exponential Models
(log-linear, maxent, Logistic, Gibbs)
Model: use the scores as probabilities:
Make positive
Normalize
Expected count of
Total count of feature j in feature j in predicted
candidates with class k candidates of class k
Proof
(Conditional Likelihood Derivative)
• Recall
p( y | x, w)
P(Y | X , w)
( x , y )D
= expected count of
feature j predicted with class k
Proof (concluded)
• The optimum parameters are the ones for which each feature’s
predicted expectation equals its empirical expectation. The optimum
distribution is:
– Always unique (but parameters may not be unique)
– Always exists (if feature counts are from actual data).
• These models are also called maximum entropy models because we
find the model has the maximum entropy while satisfying the
constraints:
E p (i ) E ~p (i ), i
Basic idea: move uphill from current guess
Gradient ascent / descent follows the gradient incrementally
At local optimum, derivative vector is zero
Will converge if step sizes are small enough, but not efficient
All we need is to be able to evaluate the function and its derivative
For convex functions, a local optimum will be global
Basic gradient ascent isn’t very efficient, but there are simple
enhancements which take into account previous gradients:
conjugate gradient, L-BFGS
There are special-purpose optimization techniques for maxent,
like iterative scaling, but they aren’t better
What About Overfitting?
• For Naïve Bayes, we were worried about zero counts in MLE
estimates
– Can that happen here?
Big weights
Total count of feature j Expected count of
are bad
in correct candidates feature j in predicted
candidates
L1 and L2 Regularization
L2 Regularization for Log-linear models
– Instead, we worry about large feature weights
– Add a regularization term to the likelihood to push weights towards
zero
Regularization Constant
L1 Regularization for Log-linear models
– Instead, we worry about number of active features
– Add a regularization term to the likelihood to push weights to zero
– For L1 regularization, we need to compute subgradients.
L1 vs L2
• Optimizing L1 harder
– Discontinuous objective function
– Subgradient descent versus gradient descent
How to pick weights?
• Goal: choose “best” vector w given training data
– For now, we mean “best for classification”
• The ideal: the weights which have greatest test set accuracy /
F1 / whatever
– But, don’t have the test set
– Must compute weights from training set
• Use devset
Diving Deeper into Feature Engineering
Construct Better Features
• Ideas??
67
Issues in document representation
• 3/12/91
• Mar. 12, 1991
• 55 B.C.
• B-52
• 100.2.86.144
– Generally, don’t represent as text
– Creation dates for docs
71
Case folding
• Reduce all letters to lower case
• Exception: upper case in mid-sentence
– e.g., General Motors
– Fed vs. fed
– SAIL vs. sail
•79
Properties of Text
• Word frequencies - skewed distribution
• `The’ and `of’ account for 10% of all words
• Six most common words account for 40%
80
From [Croft, Metzler & Strohman 2010]
Associate Press Corpus `AP89’
81
From [Croft, Metzler & Strohman 2010]
Middle Ground
82
Word Frequency
83
TF x IDF
Tk term k in document Di
tfik frequency of term Tk in document Di
idfk inverse document frequency of term Tk in C
idfk log N
nk
N total number of documents in the collection C
nk the number of documents in C that contain Tk
84
Inverse Document Frequency
tfik (1 log( N / nk ))
wik
k 1 ik
t
(tf ) 2
[1 log( N / nk )]2
86
Evaluation in
Multi-class Problems
•87
Evaluation:
Classic Reuters-21578 Data Set Sec. 15.2.4
• Most (over)used data set, 21,578 docs (each 90 types, 200 tokens)
• 9603 training, 3299 test articles (ModApte/Lewis split)
• 118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
• Average document (with at least one category) has 1.24 classes
• Only about 10 out of 118 categories are large
• Earn (2877, 1087) • Trade (369,119)
• Acquisitions (1650, 179) • Interest (347, 131)
Common categories • Ship (197, 89)
• Money-fx (538, 179)
(#train, #test) • Grain (433, 149) • Wheat (212, 71)
• Crude (389, 189) • Corn (182, 56)
•88
Reuters Text Categorization data set
(Reuters-21578) document Sec. 15.2.4
</BODY></TEXT></REUTERS>
•89
Precision & Recall
Multi-class situation:
Two class situation
Predicted
“P” “N”
Actual
P TP FN FP
TP
N FP TN
FP
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F-measure = 2pr/(p+r)
90
Micro-‐ vs. Macro-‐Averaging
• If we have more than one class, how do we combine
multiple performance measures into one quantity?
• Macroaveraging
– Compute performance for each class, then average.
• Microaveraging
– Collect decisions for all classes, compute contingency table,
evaluate
91
Precision & Recall
Multi-class Multi-label situation:
Aggregate
Average Macro Precision = Σpi/N
Average Macro Recall = Σri/N
Average Macro F-measure = 2pMrM/(pM+rM)
Classifier hallucinations