0% found this document useful (0 votes)

7 views

04-textcat text class

The document discusses text categorization using classical machine learning methods, focusing on the categorization function and classification techniques such as Naïve Bayes and linear classifiers. It highlights the advantages and disadvantages of various classification methods, including the importance of feature selection and the use of cross-validation for performance estimation. Additionally, it covers the theoretical foundations of probabilistic models and their applications in text classification tasks.

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

04-textcat text class

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Text Categorization

using Classical ML
Mausam
(based on slides of Dan Weld, Dan Jurafsky,
Prabhakar Raghavan, Hinrich Schutze, Guillaume
Obozinski, David D. Lewis, Fei Xia, Michael Collins,
Emily Fox, Alexander Ihler, Dan Jurafsky, Dan Klein,
Chris Manning, Ray Mooney, Mark Schmidt, Dan
Weld, Alex Yates, Luke Zettlemoyer) 1
Categorization

• Given:
– A description of an instance, xX, where X is
the instance language or instance space.
– A fixed set of categories:
C={c1, c2,…cn}
• Determine:
– The category of x: c(x)C, where c(x) is a
categorization function whose domain is X and
whose range is C.

2
County vs. Country?

3
Male or female author?
• The main aim of this article is to propose an exercise in stylistic analysis
which can be employed in the teaching of English language. It details the
design and results of a workshop activity on narrative carried out with
undergraduates in a university department of English. The methods proposed
are intended to enable students to obtain insights into aspects of cohesion and
narrative structure: insights, it is suggested, which are not as readily obtainable
through moreFemale writers
traditional use
techniques of stylistic analysis.
more first person/second person pronouns
• My aim in this article is to show
more that
gender giventhird
laiden a relevance theoretic approach to
person pronouns
utterance interpretation,
(overall moreit ispersonalization)
possible to develop a better understanding of
what some of these so-called apposition markers indicate. It will be argued
that the decision to put something in other words is essentially a decision
about style, a point which is, perhaps, anticipated by Burton-Roberts when he
describes loose apposition as a rhetorical device. However, he does not justify
this suggestion by giving the criteria for classifying a mode of expression as a
rhetorical device.
S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3,
pp. 321–346
Positive or negative movie review?

• unbelievably disappointing
• Full of zany characters and richly applied
satire, and some great plot twists
• this is the greatest screwball comedy ever
filmed
• It was pathetic. The worst part about it was
the boxing scenes.
•6
What is the subject of this article?

• MACHINE LEARNING
Bayesian Methods
• Learning and classification methods based
on probability theory.
– Bayes theorem plays a critical role in
probabilistic learning and classification.
– Uses prior probability of each category given
no information about an item.
• Categorization produces a posterior
probability distribution over the possible
categories given a description of an item.

10
The bag of words representation

I love this movie! It's sweet,

but with satirical humor. The
dialogue is great and the

γ(
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
)=c
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
The bag of words representation

I love this movie! It's sweet,

but with satirical humor. The
dialogue is great and the

x love xxxxxxxxxxxxxxxx sweet

xxxxxxx satirical xxxxxxxxxx
xxxxxxxxxxx great xxxxxxx

γ(
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx several xxxxxxxxxxxxxxxxx
)=c
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
The bag of words representation

great 2

γ(
love 2
recommend
laugh
1
1
)=c
happy 1
... ...
Bayes’ Rule Applied to Documents and
Classes

• For a document d and a class c

P(d | c)P(c)
P(c | d) =
P(d)
Naïve Bayes Classifier (I)

cMAP = argmax P(c | d) MAP is “maximum a

posteriori” = most likely
cÎC class

P(d | c)P(c)
= argmax Bayes Rule

cÎC P(d)
= argmax P(d | c)P(c) Dropping the
denominator
cÎC
Naïve Bayes Classifier (II)

cMAP = argmax P(d | c)P(c)

cÎC
Document d

 argmax P( x1 , x2 ,, xn | c) P(c)

represented as
features x1..xn
cC
Naïve Bayes Classifier (IV)

cMAP  argmax P( x1 , x2 ,, xn | c) P(c)

cC

O(|X|n•|C|) parameters How often does this class

occur?

Could only be estimated if a very,

very large number of training We can just count the
relative frequencies in a
examples was available. corpus
Multinomial Naïve Bayes Classifier

cMAP  argmax P( x1 , x2 ,, xn | c) P(c)

cC

cNB = argmax P(c j )Õ P(x | c)

cÎC xÎX
Multinomial Naïve Bayes Independence
Assumptions

P( x1 , x2 ,, xn | c)
• Bag of Words assumption: Assume position doesn’t
matter
• Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.

P( x1 ,, xn | c)  P( x1 | c)  P( x2 | c)  P( x3 | c)  ...  P( xn | c)
Learning the Multinomial Naïve Bayes Model

• First attempt: maximum likelihood estimates

– simply use the frequencies in the data
doccount(C = c j )
P̂(c j ) =
N doc
count(wi , c j )
P̂(wi | c j ) =
å count(w, c j )
wÎV
Problem with Maximum LikelihoodSec.13.3
• What if we have seen no training documents
with the word fantastic and classified in the
topic positive (thumbs-up)?

count("fantastic", positive)
P̂("fantastic" positive) = = 0
å count(w, positive)
wÎV

• Zero probabilities cannot be conditioned

away, no matter the other evidence!
cMAP = argmax c P̂(c)Õ P̂(xi | c)
i
Laplace (add-1) smoothing for Naïve Bayes

count(wi , c) +1
P̂(wi | c) =
å (count(w, c))+1)
wÎV

count(wi , c) +1
=
æ ö
çç å count(w, c)÷÷ + V
è wÎV ø
Easy to Implement

• But…

• If you do… it probably won’t work…

32
Probabilities: Important Detail!

 We are multiplying lots of small numbers

Danger of underflow!
 0.557 = 7 E -18

 Solution? Use logs and add!

 p1 * p2 = e log(p1)+log(p2)
 Always keep in log form

33
Generative Model for Multinomial Naïve Bayes

c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

•34
Advantages
• Simple to implement
– No numerical optimization, matrix algebra, etc
• Efficient to train and use
– Easy to update with new data
– Fast to apply
• Binary/multi-class
• Good in domains with many equally important features
– Decision Trees suffer from fragmentation in such cases –
especially if little data
• Comparatively good effectiveness with small training sets
• A good dependable baseline for text classification
– But we will see other classifiers that give better accuracy

35
Disadvantages

• Independence assumption wrong

– Absurd estimates of class probabilities
• Output probabilities close to 0 or 1
– Thresholds must be tuned; not set analytically

• Generative model
– Generally lower effectiveness than
discriminative techniques

36
Experimental Evaluation

Question: How do we estimate the

performance of classifier on unseen data?
• Can’t just at accuracy on training data – this
will yield an over optimistic estimate of
performance
• Solution: Cross-validation
• Note: this is sometimes called estimating
how well the classifier will generalize

37
Evaluation: Cross Validation
• Partition examples into k disjoint sets
• Now create k training sets
– Each set is union of all equiv classes except one
– So each set has (k-1)/k of the original training data
 Train 

Test
Test
…
Test

38
Cross-Validation (2)
• Leave-one-out
– Use if < 100 examples (rough estimate)
– Hold out one example, train on remaining examples

• 10-fold
– If have 100-1000’s of examples

39
Joint vs. Conditional Models

• We have some data {(d, c)} of paired

observations d and hidden classes c.
• Joint (generative) models place probabilities
over both observed data and the hidden stuff
(generate the observed data from hidden
stuff):
– All the classic Stat-NLP models:
• n-gram models, Naive Bayes classifiers, hidden
Markov models, probabilistic context-free
grammars, IBM machine translation alignment
models
Joint vs. Conditional Models

• Discriminative (conditional) models take

the data as given, and put a probability over
hidden structure given the data:
• Logistic regression, conditional loglinear or
maximum entropy models, conditional random
fields
• Also, SVMs, (averaged) perceptron, etc. are
discriminative classifiers (but not directly
probabilistic)
Conditional vs. Joint Likelihood

• A joint model gives probabilities P(d,c) and

tries to maximize this joint likelihood.
– It turns out to be trivial to choose weights: just
relative frequencies.
• A conditional model gives probabilities
P(c|d). It takes the data as given and models
only the conditional probability of the class.
– We seek to maximize conditional likelihood.
– Harder to do (as we’ll see…)
– More closely related to classification error.
Text Categorization with Word Features

Data (Zhang and Oles 2001)

BUSINESS: Stocks • Features are presence of each word in a
hit a yearly low … document and the document class (they do
feature selection to use reliable indicator words)
Label: BUSINESS
Features
{…, stocks, hit, a, • Tests on classic Reuters data set (and others)
yearly, low, …}
– Naïve Bayes: 77.0% F1
– Logistic regression: 86.4%
– Support vector machine: 86.5%
Feature-Based Linear Classifiers
• Linear classifiers at classification time:
– Linear function from feature sets {ϕi} to
classes{y}.
– Assign a weight wi to each feature ϕi.
– We consider each class for an observed
datum x

– For a pair (x,y), features vote with their

weights:
• vote(y) = wiϕi(x,y)
• Choose the class y which maximizes wiϕi(x,y)
Features for Multi-Class Problems

• ϕi(x,y) = 1 if ϕi(x) = 1 and label(x) = y

= 0 otherwise

Assign a weight for each feature ϕi(x,y), i.e., a different

weight for each prediction y

For a pair (x,y), features vote with their weights:

• vote(y) = wiϕi(x,y)
• Choose the class y which maximizes wiϕi(x,y)
• This can be written in linear algebra notation as WTX and it will
yield a |X|x|Y| matrix with a score for each (x,y)
48
“all models are wrong
some are useful!”
-- George Box

49
Exponential Models
(log-linear, maxent, Logistic, Gibbs)
 Model: use the scores as probabilities:
Make positive
Normalize

 Learning: maximize the (log) conditional likelihood of training data

 Prediction: output argmaxy p(y|x;w)

Derivative of
Log-linear Model
• Unfortunately, argmaxw L(w) doesn’t have a close formed solution
• We will have to differentiate and use gradient ascent

Expected count of
Total count of feature j in feature j in predicted
candidates with class k candidates of class k
Proof
(Conditional Likelihood Derivative)
• Recall

 p( y | x, w)
P(Y | X , w) 
( x , y )D

• We can separate this into two components:

• The derivative is the difference between the

derivatives of each component
log P(Y | X , w)  N (w) - D(w)
Proof: Numerator
Proof: Denominator

= expected count of
feature j predicted with class k
Proof (concluded)

• The optimum parameters are the ones for which each feature’s
predicted expectation equals its empirical expectation. The optimum
distribution is:
– Always unique (but parameters may not be unique)
– Always exists (if feature counts are from actual data).
• These models are also called maximum entropy models because we
find the model has the maximum entropy while satisfying the
constraints:

E p (i )  E ~p (i ), i
 Basic idea: move uphill from current guess
 Gradient ascent / descent follows the gradient incrementally
 At local optimum, derivative vector is zero
 Will converge if step sizes are small enough, but not efficient
 All we need is to be able to evaluate the function and its derivative
 For convex functions, a local optimum will be global
 Basic gradient ascent isn’t very efficient, but there are simple
enhancements which take into account previous gradients:
conjugate gradient, L-BFGS
 There are special-purpose optimization techniques for maxent,
like iterative scaling, but they aren’t better
What About Overfitting?
• For Naïve Bayes, we were worried about zero counts in MLE
estimates
– Can that happen here?

• Regularization (smoothing) for Log-linear models

– Instead, we worry about large feature weights
– Add a regularization term to the likelihood to push weights
towards zero
Derivative for Regularized Maximum Entropy
• Unfortunately, argmaxw L(w) still doesn’t have a close formed solution
• We will have to differentiate and use gradient ascent

Big weights
Total count of feature j Expected count of
are bad
in correct candidates feature j in predicted
candidates
L1 and L2 Regularization
L2 Regularization for Log-linear models
– Instead, we worry about large feature weights
– Add a regularization term to the likelihood to push weights towards
zero

Regularization Constant
L1 Regularization for Log-linear models
– Instead, we worry about number of active features
– Add a regularization term to the likelihood to push weights to zero
– For L1 regularization, we need to compute subgradients.
L1 vs L2
• Optimizing L1 harder
– Discontinuous objective function
– Subgradient descent versus gradient descent
How to pick weights?
• Goal: choose “best” vector w given training data
– For now, we mean “best for classification”

• The ideal: the weights which have greatest test set accuracy /
F1 / whatever
– But, don’t have the test set
– Must compute weights from training set

• Maybe we want weights which give best training set

accuracy?
– May not (does not) generalize to test set
– Easy to overfit

• Use devset
Diving Deeper into Feature Engineering
Construct Better Features

• Key to machine learning is having good

features

• In gen 2 ML, large effort devoted to

constructing appropriate features

• Ideas??

67
Issues in document representation

Cooper’s concordance of Wordsworth was published in

1911. The applications of full-text retrieval are legion:
they include résumé scanning, litigation support and
searching published journals on-line.

• Cooper’s vs. Cooper vs. Coopers.

• Full-text vs. full text vs. {full, text} vs. fulltext.
• résumé vs. resume.

slide from Raghavan, Schütze,

Larson
Punctuation

• Ne’er: use language-specific, handcrafted

“locale” to normalize.
• State-of-the-art: break up hyphenated
sequence.
• U.S.A. vs. USA
• a.out

slide from Raghavan, Schütze,

Numbers

• 3/12/91
• Mar. 12, 1991
• 55 B.C.
• B-52
• 100.2.86.144
– Generally, don’t represent as text
– Creation dates for docs

slide from Raghavan, Schütze,

Larson
Possible Feature Ideas

• Look at capitalization (may indicated a

proper noun)

• Look for commonly occurring sequences

• E.g. New York, New York City
• Limit to 2-3 consecutive words
• Keep all that meet minimum threshold (e.g.
occur at least 5 or 10 times in corpus)

71
Case folding
• Reduce all letters to lower case
• Exception: upper case in mid-sentence
– e.g., General Motors
– Fed vs. fed
– SAIL vs. sail

slide from Raghavan, Schütze,

Larson
Thesauri and Soundex

• Handle synonyms and spelling variations

– Hand-constructed equivalence classes
• e.g., car = automobile

slide from Raghavan, Schütze,

Spell Correction

• Look for all words within (say) edit distance

3 (Insert/Delete/Replace) at query time
– e.g., arfiticial inteligence
• Spell correction is expensive and slows the
processing significantly
– Invoke only when index returns zero matches?

slide from Raghavan, Schütze,

Stemming

• Are there different index terms?

– retrieve, retrieving, retrieval, retrieved, retrieves…
• Stemming algorithm:
– (retrieve, retrieving, retrieval, retrieved, retrieves) 
retriev
– Strips prefixes of suffixes (-s, -ed, -ly, -ness)
– Morphological stemming
• Problems: sand / sander & wand / wander

Stemming Continued

• Can reduce vocabulary by ~ 1/3

• C, Java, Perl versions, python, c#
www.tartarus.org/~martin/PorterStemmer
• Criterion for removing a suffix
– Does "a document is about w1" mean the same as
– a "a document about w2"
• Problems: sand / sander & wand / wander

• Commercial SEs use giant in-memory tables

Features Sec. 15.3.2

• Domain-specific features and weights: very

important in real performance

• Upweighting: Counting a word as if it occurred

twice:
– title words (Cohen & Singer 1996)
– first sentence of each paragraph (Murata, 1999)
– In sentences that contain title words (Ko et al, 2002)

•79
Properties of Text
• Word frequencies - skewed distribution
• `The’ and `of’ account for 10% of all words
• Six most common words account for 40%

80
From [Croft, Metzler & Strohman 2010]
Associate Press Corpus `AP89’

81
From [Croft, Metzler & Strohman 2010]
Middle Ground

• Very common words  bad features

• Language-based stop list:
words that bear little meaning
20-500 words
http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

• Subject-dependent stop lists

• Very rare words also bad features

Drop words appearing less than k times / corpus

82
Word Frequency

• Which word is more indicative of document similarity?

– ‘book,’ or ‘Rumplestiltskin’?
– Need to consider “document frequency”--- how frequently the
word appears in doc collection.

• Which doc is a better match for the query “Kangaroo”?

– One with a single mention of Kangaroos… or a doc that
mentions it 10 times?
– Need to consider “term frequency”--- how many times the
word appears in the current document.

83
TF x IDF

wik  tfik * log( N / nk )

Tk  term k in document Di
tfik  frequency of term Tk in document Di
idfk  inverse document frequency of term Tk in C
idfk  log N 
 nk 
N  total number of documents in the collection C
nk  the number of documents in C that contain Tk
84
Inverse Document Frequency

• IDF provides high values for rare words and

low values for common words
 10000 
log 0
 10000 
 10000 
log   0.301
 5000 
 10000 
log   2.698
 20 
 10000 
log 4
 1 
• Add 1 to avoid 0. 85
TF-IDF normalization

• Normalize the term weights

– so longer docs not given more weight (fairness)
– force all values to fall within a certain range: [0, 1]

tfik (1  log( N / nk ))
wik 
k 1 ik
t
(tf ) 2
[1  log( N / nk )]2

86
Evaluation in
Multi-class Problems

•87
Evaluation:
Classic Reuters-21578 Data Set Sec. 15.2.4

• Most (over)used data set, 21,578 docs (each 90 types, 200 tokens)
• 9603 training, 3299 test articles (ModApte/Lewis split)
• 118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
• Average document (with at least one category) has 1.24 classes
• Only about 10 out of 118 categories are large
• Earn (2877, 1087) • Trade (369,119)
• Acquisitions (1650, 179) • Interest (347, 131)
Common categories • Ship (197, 89)
• Money-fx (538, 179)
(#train, #test) • Grain (433, 149) • Wheat (212, 71)
• Crude (389, 189) • Corn (182, 56)

•88
Reuters Text Categorization data set
(Reuters-21578) document Sec. 15.2.4

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981"

NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow,
March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions
on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future
direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to
endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry,
the NPPC added. Reuter

</BODY></TEXT></REUTERS>

•89
Precision & Recall
Multi-class situation:
Two class situation

Predicted
“P” “N”
Actual

P TP FN FP
TP
N FP TN
FP
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F-measure = 2pr/(p+r)

90
Micro-‐ vs. Macro-‐Averaging
• If we have more than one class, how do we combine
multiple performance measures into one quantity?

• Macroaveraging
– Compute performance for each class, then average.

• Microaveraging
– Collect decisions for all classes, compute contingency table,
evaluate

91
Precision & Recall
Multi-class Multi-label situation:
Aggregate
Average Macro Precision = Σpi/N
Average Macro Recall = Σri/N
Average Macro F-measure = 2pMrM/(pM+rM)

Average Micro Precision = ΣTPi/ ΣiColi

Average Micro Recall = ΣTPi/ ΣiRowi
Average Micro F-measure = 2pμrμ /(pμ+rμ)

Precision(class i) = TPi/(TPi+FPi) Precision(class 1) = 251/(Column1)

Recall(class i) = TPi/(TPi+FNi) Recall(class 1) = 251/(Row1)
F-measure(class i) = 2piri/(pi+ri) F-measure(class 1)) = 2piri/(pi+ri) 92
Precision & Recall
Multi-class situation: Missed predictions
Aggregate
Average Macro Precision = Σpi/N
Average Macro Recall = Σri/N
Average Macro F-measure = 2pMrM/(pM+rM)

Average Micro Precision = ΣTPi/ ΣiColi

Average Micro Recall = ΣTPi/ ΣiRowi
Average Micro F-measure = 2pμrμ /(pμ+rμ)

Aren’t μ prec and μ recall the same?

Classifier hallucinations

Precision(class i) = TPi/(TPi+FPi) Precision(class 1) = 251/(Column1)

Recall(class i) = TPi/(TPi+FNi) Recall(class 1) = 251/(Row1)
F-measure(class i) = 2piri/(pi+ri) F-measure(class 1)) = 2piri/(pi+ri) 93

Formal Languages And Automata Theory
From Everand
Formal Languages And Automata Theory
Ajit Singh
No ratings yet
A New Methodology To Predict Backbreak in Blasting Operation
No ratings yet
A New Methodology To Predict Backbreak in Blasting Operation
7 pages
04 Textcat
No ratings yet
04 Textcat
101 pages
NB 24 Aug
No ratings yet
NB 24 Aug
82 pages
4 NB 2024
No ratings yet
4 NB 2024
82 pages
Naivebayes 2021
No ratings yet
Naivebayes 2021
77 pages
20250129_Lecture03_naivebayes
No ratings yet
20250129_Lecture03_naivebayes
25 pages
02 Text Processing PDF
No ratings yet
02 Text Processing PDF
70 pages
Naïve Bayes: The Task of Text Classification
No ratings yet
Naïve Bayes: The Task of Text Classification
34 pages
MultinomialNB
No ratings yet
MultinomialNB
52 pages
04_1 06 naivebayes
No ratings yet
04_1 06 naivebayes
65 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
4.Machine Learning for Text Understanding-1
No ratings yet
4.Machine Learning for Text Understanding-1
45 pages
Multimedia Application L7_for
No ratings yet
Multimedia Application L7_for
46 pages
T4L1 Naive Bayes
No ratings yet
T4L1 Naive Bayes
50 pages
Text Classification
No ratings yet
Text Classification
53 pages
Slp3 TextClassification Reduced
No ratings yet
Slp3 TextClassification Reduced
60 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
48 pages
NLP NB
No ratings yet
NLP NB
52 pages
nb24aug
No ratings yet
nb24aug
85 pages
Naive Bayes With Sentiment Classification
No ratings yet
Naive Bayes With Sentiment Classification
82 pages
4 Naive Bayes
No ratings yet
4 Naive Bayes
82 pages
Naive Bayes
No ratings yet
Naive Bayes
56 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
nb24aug
No ratings yet
nb24aug
79 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
bag_of_words nlp
No ratings yet
bag_of_words nlp
23 pages
Classification
No ratings yet
Classification
81 pages
in4080_2022_lecture_03
No ratings yet
in4080_2022_lecture_03
62 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
Week4
No ratings yet
Week4
45 pages
Naive Bayes Sentiment Analysis
No ratings yet
Naive Bayes Sentiment Analysis
23 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
3 Classification 1
No ratings yet
3 Classification 1
55 pages
NLP ch4 l1
No ratings yet
NLP ch4 l1
23 pages
lm24aug
No ratings yet
lm24aug
84 pages
Text Classification and Naïve Bayes: The Task of Text Classifica1on
No ratings yet
Text Classification and Naïve Bayes: The Task of Text Classifica1on
74 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
7 - Text Classification Naive Bayes
No ratings yet
7 - Text Classification Naive Bayes
41 pages
Nlp4web Lecture 2 Text Classification
No ratings yet
Nlp4web Lecture 2 Text Classification
109 pages
Text Categorization and Classification
No ratings yet
Text Categorization and Classification
13 pages
3. Text Classification
No ratings yet
3. Text Classification
60 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
Lecture5 421
No ratings yet
Lecture5 421
115 pages
Document
No ratings yet
Document
7 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
Introduction To Machine Learning and Deep Learning
No ratings yet
Introduction To Machine Learning and Deep Learning
40 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
Chapter 4 Text Classification
No ratings yet
Chapter 4 Text Classification
28 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
Unit 2
No ratings yet
Unit 2
26 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Finite-Dimensional Vector Spaces: Second Edition
From Everand
Finite-Dimensional Vector Spaces: Second Edition
Paul R. Halmos
No ratings yet
lect33-textcat (1)
No ratings yet
lect33-textcat (1)
70 pages
reduction proofs
No ratings yet
reduction proofs
9 pages
Syntactic and Dependency Parsing
No ratings yet
Syntactic and Dependency Parsing
159 pages
ch07-consistency-replication (1)
No ratings yet
ch07-consistency-replication (1)
30 pages
Tut4_WordEmb nlp
No ratings yet
Tut4_WordEmb nlp
30 pages
2DI90_ch11 (1)
No ratings yet
2DI90_ch11 (1)
54 pages
Primes
No ratings yet
Primes
39 pages
2DI90_chID190-CH5
No ratings yet
2DI90_chID190-CH5
62 pages
new trends for authentication
No ratings yet
new trends for authentication
5 pages
10-estimators-pre-lecture
No ratings yet
10-estimators-pre-lecture
109 pages
imc_shift-cipher
No ratings yet
imc_shift-cipher
17 pages
slides08-lr-parsing
No ratings yet
slides08-lr-parsing
25 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
2DI90_ch9 (1)
No ratings yet
2DI90_ch9 (1)
83 pages
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 nlp
No ratings yet
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 nlp
126 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
Jarrar.LectureNotes.Ch1.Introduction
No ratings yet
Jarrar.LectureNotes.Ch1.Introduction
18 pages
13-oo-opolymorphism plc
No ratings yet
13-oo-opolymorphism plc
15 pages
02 Random Vars All Handout
No ratings yet
02 Random Vars All Handout
23 pages
07-covariance-answers-hidden-lecture
No ratings yet
07-covariance-answers-hidden-lecture
62 pages
3_slides corpus3
No ratings yet
3_slides corpus3
88 pages
61799956 POS tagging
No ratings yet
61799956 POS tagging
63 pages
01-introduction plc
No ratings yet
01-introduction plc
53 pages
13-neuralcrf pos tagging
No ratings yet
13-neuralcrf pos tagging
40 pages
4_slides Regualer expression
No ratings yet
4_slides Regualer expression
75 pages
01-bayes-all-handout prob
No ratings yet
01-bayes-all-handout prob
28 pages
Ch. 1 Notes
No ratings yet
Ch. 1 Notes
11 pages
2 Corpora and Smoothing
No ratings yet
2 Corpora and Smoothing
85 pages
Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8
No ratings yet
Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8
53 pages
Traffic Management System
100% (2)
Traffic Management System
42 pages
Prediction Reliability of QSAR Models An
No ratings yet
Prediction Reliability of QSAR Models An
17 pages
An Introduction To The Package GeoR
No ratings yet
An Introduction To The Package GeoR
17 pages
Intrusion Detection Using Big Data and Deep Learning Techniques
No ratings yet
Intrusion Detection Using Big Data and Deep Learning Techniques
9 pages
From Excel To Machine Learning
100% (1)
From Excel To Machine Learning
48 pages
Machine Learning Guide
No ratings yet
Machine Learning Guide
185 pages
(Ebook PDF) Business Analytics 1st Edition by Sanjiv Jaggia 2024 Scribd Download
100% (5)
(Ebook PDF) Business Analytics 1st Edition by Sanjiv Jaggia 2024 Scribd Download
49 pages
Book's Solutions
No ratings yet
Book's Solutions
20 pages
Human Resource Selection 8th Edition Gatewood Solutions Manual - Free Access To All Available Content For Download
100% (2)
Human Resource Selection 8th Edition Gatewood Solutions Manual - Free Access To All Available Content For Download
39 pages
Renner Warton
No ratings yet
Renner Warton
9 pages
Data Science Interview Questions
100% (1)
Data Science Interview Questions
68 pages
Machine Learning
100% (1)
Machine Learning
185 pages
Multi-Task Pre-Training of Deep Neural Networks For Digital Pathology
No ratings yet
Multi-Task Pre-Training of Deep Neural Networks For Digital Pathology
10 pages
Project Report 2022
No ratings yet
Project Report 2022
27 pages
Design and Implementation of IoT Based Ideal Fish Farm in The Context of Bangladesh Aquaculture System PDF
No ratings yet
Design and Implementation of IoT Based Ideal Fish Farm in The Context of Bangladesh Aquaculture System PDF
6 pages
Rtgender: A Corpus For Studying Differential Responses To Gender
No ratings yet
Rtgender: A Corpus For Studying Differential Responses To Gender
7 pages
SRM Formula Sheet-2
100% (1)
SRM Formula Sheet-2
11 pages
W7 Weka Experimenter
No ratings yet
W7 Weka Experimenter
6 pages
Unstructured Data Classification
No ratings yet
Unstructured Data Classification
5 pages
Topic Modeling on The Indian Express News Article
No ratings yet
Topic Modeling on The Indian Express News Article
7 pages
Predicting High Resolution Total Phosphorus Concentrations For Soils of The Upper Mississippi River Basin Using Machine Learning
No ratings yet
Predicting High Resolution Total Phosphorus Concentrations For Soils of The Upper Mississippi River Basin Using Machine Learning
22 pages
A Comparative Study of Data Splitting Algorithms For Machine Learning Model Selection
No ratings yet
A Comparative Study of Data Splitting Algorithms For Machine Learning Model Selection
29 pages
Artificial intelligence to optimize water consumption in agriculture_ A predictive algorithm-based irrigation management system
No ratings yet
Artificial intelligence to optimize water consumption in agriculture_ A predictive algorithm-based irrigation management system
11 pages
KNN - Problem Statement ANSWER
100% (1)
KNN - Problem Statement ANSWER
8 pages
Assignment_2
No ratings yet
Assignment_2
3 pages
Generalized Binary Interaction Parameters For The Peng-Robinson Equation of State
No ratings yet
Generalized Binary Interaction Parameters For The Peng-Robinson Equation of State
58 pages
Resampling Methods
No ratings yet
Resampling Methods
15 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

04-textcat text class

Uploaded by

04-textcat text class

Uploaded by

Text Categorization

MeSH Subject Category Hierarchy

I love this movie! It's sweet,

I love this movie! It's sweet,

x love xxxxxxxxxxxxxxxx sweet

• For a document d and a class c

cMAP = argmax P(c | d) MAP is “maximum a

cMAP = argmax P(d | c)P(c)

 argmax P( x1 , x2 ,, xn | c) P(c)

cMAP  argmax P( x1 , x2 ,, xn | c) P(c)

O(|X|n•|C|) parameters How often does this class

Could only be estimated if a very,

cMAP  argmax P( x1 , x2 ,, xn | c) P(c)

cNB = argmax P(c j )Õ P(x | c)

• First attempt: maximum likelihood estimates

• Zero probabilities cannot be conditioned

• If you do… it probably won’t work…

 We are multiplying lots of small numbers

 Solution? Use logs and add!

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

• Independence assumption wrong

Question: How do we estimate the

• We have some data {(d, c)} of paired

• Discriminative (conditional) models take

• A joint model gives probabilities P(d,c) and

Data (Zhang and Oles 2001)

– For a pair (x,y), features vote with their

• ϕi(x,y) = 1 if ϕi(x) = 1 and label(x) = y

Assign a weight for each feature ϕi(x,y), i.e., a different

For a pair (x,y), features vote with their weights:

 Learning: maximize the (log) conditional likelihood of training data

 Prediction: output argmaxy p(y|x;w)

• We can separate this into two components:

• The derivative is the difference between the

• Regularization (smoothing) for Log-linear models

• Maybe we want weights which give best training set

• Key to machine learning is having good

• In gen 2 ML, large effort devoted to

Cooper’s concordance of Wordsworth was published in

• Cooper’s vs. Cooper vs. Coopers.

slide from Raghavan, Schütze,

• Ne’er: use language-specific, handcrafted

slide from Raghavan, Schütze,

slide from Raghavan, Schütze,

• Look at capitalization (may indicated a

• Look for commonly occurring sequences

slide from Raghavan, Schütze,

• Handle synonyms and spelling variations

slide from Raghavan, Schütze,

• Look for all words within (say) edit distance

slide from Raghavan, Schütze,

• Are there different index terms?

Copyright © Weld 2002-2007 76 76

• Can reduce vocabulary by ~ 1/3

• Commercial SEs use giant in-memory tables

Copyright © Weld 2002-2007 77 77

• Domain-specific features and weights: very

• Upweighting: Counting a word as if it occurred

• Very common words  bad features

• Subject-dependent stop lists

• Very rare words also bad features

• Which word is more indicative of document similarity?

• Which doc is a better match for the query “Kangaroo”?

wik  tfik * log( N / nk )

• IDF provides high values for rare words and

• Normalize the term weights

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981"

Average Micro Precision = ΣTPi/ ΣiColi

Precision(class i) = TPi/(TPi+FPi) Precision(class 1) = 251/(Column1)

Average Micro Precision = ΣTPi/ ΣiColi

Aren’t μ prec and μ recall the same?

Precision(class i) = TPi/(TPi+FPi) Precision(class 1) = 251/(Column1)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.