0% found this document useful (0 votes)

20 views

Lecture 03

Language Models

Uploaded by

saima khosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Lecture 03

Language Models

Uploaded by

saima khosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

CS447: Natural Language Processing

http://courses.engr.illinois.edu/cs447

Lecture 3:
Language Models
(Intro to Probability Models for NLP)

Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
t1 :
P ar
0 3,
u re i ew
e ct r v
L O ve

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2

Last lecture’s key concepts
Dealing with words:
— Tokenization, normalization
— Zipf’s Law

Morphology (word structure):

— Stems, affixes
— Derivational vs. inflectional morphology
— Compounding
— Stem changes
— Morphological analysis and generation

Finite-state methods in NLP

— Finite-state automata vs. finite-state transducers
— Composing finite-state transducers

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 3

Finite-state transducers
– FSTs define a relation between two regular
languages.
– Each state transition maps (transduces) a
character from the input language to a character (or
a sequence of characters) in the output language
x:y

– By using the empty character (ε), characters can

be deleted (x:ε) or inserted(ε:y)
x:ε ε:y

– FSTs can be composed (cascaded), allowing us to

define intermediate representations.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 4
Today’s lecture
How can we distinguish word salad, spelling errors
and grammatical sentences?

Language models define probability distributions

over the strings in a language.

N-gram models are the simplest and most common kind of

language model.

We’ll look at how these models are defined,

how to estimate (learn) their parameters,
and what their shortcomings are.

We’ll also review some very basic probability theory.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 5
Why do we need language models?
Many NLP tasks require natural language output:
—Machine translation: return text in the target language
—Speech recognition: return a transcript of what was spoken
—Natural language generation: return natural language text
—Spell-checking: return corrected spelling of input

Language models define probability distributions

over (natural language) strings or sentences.
➔ We can use a language model to generate strings
➔ We can use a language model to score/rank candidate strings
so that we can choose the best (i.e. most likely) one:
if PLM(A) > PLM(B), return A, not B

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 6

Hmmm, but…
… what does it mean for a language model
to “define a probability distribution”?

… why would we want to define probability

distributions over languages?

… how can we construct a language model such that

it actually defines a probability distribution?

… how do we know how well our model works?

You should be able to answer these questions

after this lecture
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 7
Today’s class
Part 1: Overview (this video)

Part 2: Review of Basic Probability

Part 3: Language Modeling with N-Grams

Part 4: Generating Text with Language Models

Part 5: Evaluating Language Models

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 8

Today’s key concepts
N-gram language models
Independence assumptions
Getting from n-grams to a distribution over a language
Relative frequency (maximum likelihood) estimation
Smoothing
Intrinsic evaluation: Perplexity,
Extrinsic evaluation: WER

Today’s reading:
Chapter 3 (3rd Edition)

Next lecture: Basic intro to machine learning for NLP

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 9
2 :
a r t
3 , P c
0 a s i
u re f B r y
e ct o e o
L i ew T h
R ev l i t y
a b i
ro b
P
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 10
Sampling with replacement
Pick a random shape, then put it back in the bag.

P( ) = 2/15 P( ) = 1/15 P( or ) = 2/15

P(blue) = 5/15 P(red) = 5/15 P( |red) = 3/5
P(blue | ) = 2/5 P( ) = 5/15

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 11

Sampling with replacement
Pick a random shape, then put it back in the bag.
What sequence of shapes will you draw?
P( ) = 1/15 × 1/15 × 1/15 × 2/15
= 2/50625

P( ) = 3/15 × 2/15 × 2/15 × 3/15

= 36/50625
P( ) = 2/15 P( ) = 1/15 P( or ) = 2/15
P(blue) = 5/15 P(red) = 5/15 P( |red) = 3/5
P(blue | ) = 2/5 P( ) = 5/15

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 12

Now let’s look at natural language
Text as a bag of words
Alice was beginning to get very tired of
sitting by her sister on the bank, and of
having nothing to do: once or twice she
had peeped into the book her sister was
reading, but it had no pictures or
conversations in it, 'and what is the use
of a book,' thought Alice 'without
pictures or conversation?'

P(of) = 3/66 P(to) = 2/66 P(,) = 4/66

P(Alice) = 2/66 P(her) = 2/66 P(') = 4/66
P(was) = 2/66 P(sister) = 2/66

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 13

Sampling with replacement
A sampled sequence of words
beginning by, very Alice but was and?
reading no tired of to into sitting
sister the, bank, and thought of without
her nothing: having conversations Alice
once do or on she it get the book her had
peeped was conversation it pictures or
sister in, 'what is the use had twice of
a book''pictures or' to

P(of) = 3/66 P(to) = 2/66 P(,) = 4/66

P(Alice) = 2/66 P(her) = 2/66 P(') = 4/66
P(was) = 2/66 P(sister) = 2/66
In this model, P(English sentence) = P(word salad)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 14

Probability theory: terminology
Trial (aka “experiment”)
Picking a shape, predicting a word
Sample space Ω:
The set of all possible outcomes
(all shapes; all words in Alice in Wonderland)
Event ω ⊆ Ω:
An actual outcome (a subset of Ω)
(predicting ‘the’, picking a triangle)
Random variable X: Ω → T
A function from the sample space (often the identity function)
Provides a ‘measurement of interest’ from a trial/experiment
(Did we pick ‘Alice’/a noun/a word starting with “x”/…?
How often does the word ‘Alice’ occur?
How many words occur in each sentence?)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 15

What is a probability distribution?
P(ω) defines a distribution over Ω iff

1) Every event ω has a probability P(ω) between 0 and 1:

0 ⇥ P( )⇥1
P (⇤) = 0 and
0 ⇥ PP(( ) = )1⇥ 1 P(∅) = 0:
2) The null event ∅ has probability
P
0 ⇥ P((⇤) = 0 and
P (
)⇥1 i P) (= )1= 1
P (⇤) = 0 and P i( ) = 1 P ( i ) = 1
3) And the probability of all disjoint events sums to 1.
P ( i) =
i
1 if ⇥j = i : i ⌅ j =⇤
i and i i =

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 16

Discrete probability distributions:
Single Trials
‘Discrete’: a fixed (often finite) number of outcomes

Bernoulli distribution (Two possible outcomes (head, tail)

Defined by the probability of success (= head/yes)
The probability of head is p. The probability of tail is 1−p.

Categorical distribution (N possible outcomes c1…cN)

The probability of category/outcome ci is pi (0 ≤ pi ≤ 1; ∑i pi = 1).
e.g. the probability of getting a six when rolling a die once
e.g. the probability of the next word (picked among a vocabulary of N words)
(NB: Most of the distributions we will see in this class are categorical.
Some people call them multinomial distributions, but those refer to sequences
of trials, e.g. the probability of getting five sixes when rolling a die ten times)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 17

Joint and Conditional Probability
The conditional probability of X given Y, P(X | Y),
is defined in terms of the probability of Y, P(Y),
and the joint probability of X and Y, P(X, Y):
P (X, Y )
P (X|Y ) =
P (Y )
What is the probability that we get a blue shape
if we pick a square? P(blue | ) = 2/5

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 18

The chain rule
The joint probability P(X,Y) can also be expressed
in terms of the conditional probability P(X | Y)
P (X, Y ) = P (X|Y )P (Y )

Generalizing this to N joint events (or random

variables) leads to the so-called chain rule:
P (X1 , X2 , . . . , Xn ) = P (X1 )P (X2 |X1 )P (X3 |X2 , X1 )....P (Xn |X1 , ...Xn 1)
n
= P (X1 ) P (Xi |X1 . . . Xi 1)
i=2

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 19

Independence
Two events or random variables X and Y
are independent if
P (X, Y ) = P (X)P (Y )
If X and Y are independent, then P(X | Y) = P(X):

P (X, Y )
P (X|Y ) =
P (Y )
P (X)P (Y )
= (X , Y independent)
P (Y )
= P (X)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 20

Probability models
Building a probability model consists of two steps:
1. Defining the model
2. Estimating the model’s parameters (= training/learning )

Probability models (almost) always make

independence assumptions.
— Even though X and Y are not actually independent,
our model may treat them as independent.
— This can drastically reduce the number of parameters to estimate.
— Models without independence assumptions have (way)
too many parameters to estimate reliably from the data we have
— But since independence assumptions are often incorrect,
those models are often incorrect as well:
they assign probability mass to events that cannot occur
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 21
3 :
a r t
3, P i n g
0 d e l
u r e o
e ct e M m s
L u a g r a
n g N - G
L a t h
w i

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 22

Language modeling with N-grams
A language model over a vocabulary V
assigns probabilities to strings drawn from V*.

How do we compute the probability

(1) (i)
of a string w . . . w ?

Recall the chain rule:

P(w (1) . . . w (i)) = P(w (1)) ⋅ P(w (2) | w (1)) ⋅ . . . ⋅ P(w (i) | w (i−1), . . . , w (1))

An n-gram language model assumes each word

depends only on the last n−1 words:
Pngram(w (1) . . . w (i)) = P(w (1)) ⋅ P(w (2) | w (1)) ⋅ . . . ⋅ P(w (i) | w (i−1), . . . , w (1−(n+1)))

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 23

N-gram models
N-gram models assume each word (event)
depends only on the previous n−1 words (events):
N
P(w (1) . . . w (N)) = P(w (i))
∏
Unigram model:
i=1
N
P(w (1) . . . w (N)) = P(w (i) | w (i−1))
∏
Bigram model:
i=1
N
P(w (1) . . . w (N)) = P(w (i) | w (i−1), w (i−2))
∏
Trigram model:
i=1
NB: Independence assumptions where the n-th event in a sequence depends
only on the last n-1 events are called Markov assumptions (of order n−1).

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 24

How many parameters do n-gram
models have?
Given a vocabulary V of |V| word types: so, for |V| = 104:

Unigram model: |V| parameters 104 parameters

(one distribution P( w(i) ) with |V| outcomes
[each w ∈ V is one outcome])

Bigram model: |V|2 parameters 108 parameters

(|V| distributions P( w(i) | w(i-1) ), one distribution for each w ∈ V
with |V| outcomes each [each w ∈ V is one outcome])

Trigram model: |V|3 parameters 1012 parameters

(|V|2 distributions P(w(i) | w(i-1),w(i-2)), one per bigram w’w’’,
with |V| outcomes each [each w ∈ V is one outcome])
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 25
Sampling with replacement
beginning by, very Alice but was and?
reading no tired of to into sitting
sister the, bank, and thought of without
her nothing: having conversations Alice
once do or on she it get the book her had
peeped was conversation it pictures or
sister in, 'what is the use had twice of
a book''pictures or' to

P(of) = 3/66 P(to) = 2/66 P(,) = 4/66

P(Alice) = 2/66 P(her) = 2/66 P(') = 4/66
P(was) = 2/66 P(sister) = 2/66
In this model, P(English sentence) = P(word salad)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 26

A bigram model for Alice
Alice was beginning to get very tired of
sitting by her sister on the bank, and of
having nothing to do: once or twice she
had peeped into the book her sister was
reading, but it had no pictures or
conversations in it, 'and what is the use
of a book,' thought Alice 'without
pictures or conversation?'

P(w(i) = of | w(i–1) = tired) = 1 P(w(i) = bank | w(i–1) = the) = 1/3

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 27

Using a bigram model for Alice
English Word Salad
Alice was beginning to get very
beginning by, very Alice but was and?
tired of sitting by her sister on
reading no tired of to into sitting
the bank, and of having nothing to
sister the, bank, and thought of without
do: once or twice she had peeped
her nothing: having conversations Alice
into the book her sister was
once do or on she it get the book her had
reading, but it had no pictures or
peeped was conversation it pictures or
conversations in it, 'and what is
sister in, 'what is the use had twice of
the use of a book,' thought Alice
a book''pictures or' to
'without pictures or conversation?'

Now, P(English) ⪢ P(word salad)

P(w(i) = of | w(i–1) = tired) = 1 P(w(i) = bank | w(i–1) = the) = 1/3

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 28

From n-gram probabilities to language models

Recall: a language L ⊆ V* is a (possibly infinite) set of strings

over a (finite) vocabulary V.

P(w(i) | w(i-1)) defines a distribution over the words in V:

[∑ ]
∀w ∈ V : P(w (i) = w′ ∣ w (i−1) = w) = 1
w′∈V

By multiplying this distribution N times, we get

one distribution over all strings of the same length N (VN):
Prob. of one N-word string: P(w1 . . . wN) = P(w (i) = wi ∣ w (i−1) = wi−1)
∏
i=1...N

∑ [ ∏ ]
Prob. of all N-word strings P(V N ) = P(w (i) = w ∣ w (i−1) = w′) = 1
w,w′∈V i=1...N
But instead of N separate distributions…
…we want one distribution over strings of any length
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 29
From n-gram probabilities to language models

We have just seen how to use n-gram probabilities to

define one distribution P(VN) for each string length N.

But a language model P(L)=P(V*) should define

one distribution P(V*) that sums to one over all strings
in L ⊆ V*, regardless of their length:
P(L) = P(V) + P(V2) + P(V3) + … P(Vn) + … = 1

Solution:
Add an End-of-Sentence (EOS) token to V
Assume a) that each string ends in EOS and
b) that EOS can only appear at the end of a string.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 30
From n-gram probabilities to language models
with EOS
Think of a language model as a stochastic process:
— At each time step, randomly pick one more word.
— Stop generating more words when the word you pick
is a special end-of-sentence (EOS) token.

To be able to pick the EOS token, we have to modify our

training data so that each sentence ends in EOS.
This means our vocabulary is now VEOS = V ∪ {EOS}

We then get an actual language model,

i.e. a distribution over strings of any length
Technically, this is only true because P(EOS | …) will be high enough that we are always
guaranteed to stop after having generated a finite number of words
A leaky or inconsistent language model would have P(L) < 1. That could happen if EOS had a
very small probability (but doesn’t really happen in practice).

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 31

Why do we want one distribution over L?

Why do we care about having one probability distribution

for all lengths?

This allows us to compare the probabilities of strings of

different lengths, because they’re computed by the same
distribution.

This allows us to generate strings of arbitrary length

with one model.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 32

Parameter
Estimation
Or: Where do we get the probabilities from?

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 33

Learning (estimating) a language model
Where do we get the parameters of our model
(its actual probabilities) from?
P(w(i) = ‘the’ | w(i–1) = ‘on’) = ???
We need (a large amount of) text as training data
to estimate the parameters of a language model.

The most basic parameter estimation technique:

relative frequency estimation (frequency = counts)
P(w(i) = ‘the’ | w(i–1) = ‘on’) = C(‘on the’) / C(‘on’)
Also called Maximum Likelihood Estimation (MLE)
C(‘on the’) [or f(‘on the’) for frequency]:
How often does ‘on the’ appear in the training data?
NB: C(‘on’) = ∑w∈VC(‘on’ w)
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 34
Handling unknown words: UNK
Training:
— Define a fixed vocabulary V such that all words in V
appear at least n times in the training data
(e.g. all words that occur at least 5 times in the training corpus,
or the most common 10,000 words in training)
— Add a new token UNK to V, and replace all other words
in the corpus that are not in V by this token UNK
— Estimate the model on this modified training corpus.

Testing (when computing the probability of a string):

Replace any words not in the vocabulary by UNK

Refinements:
Use different UNK tokens for different types of words
(numbers, capitalized words, lower-case words, etc.)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 35

What about the beginning of the sentence?
In a trigram model
P(w (1)w (2)w (3)) = P(w (1))P(w (2) | w (1))P(w (3) | w (2), w (1))
only the third term P(w (3) | w (2), w (1)) is an actual trigram
probability. What about P(w (1)) and P(w (2) | w (1)) ?

If this bothers you:

Add n–1 beginning-of-sentence (BOS) symbols
to each sentence for an n–gram model:
BOS1 BOS2 Alice was …
Now the unigram and bigram probabilities
involve only BOS symbols.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 36

Summary: Estimating a bigram model with
BOS (<s>), EOS (</s>) and UNK using MLE

1. Replace all words not in V in the training corpus with UNK

2. Bracket each sentence by special start and end symbols:
<s> Alice was beginning to get very tired … </s>
3. Define the Vocabulary V’ = all tokens in modified training corpus
(all common words, UNK, <s>, </s>)
4. Count the frequency of each bigram….
C(<s> Alice) = 1, C(Alice was) = 1, …
5. .... and normalize these frequencies to get probabilities:
C(Alice was)
∑ C(Alice wi)
P(was | Alice) =
wi∈V′

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 37

4 :
ar t h
3 , P w i t
e 0 e x t
t u r g t e ls
L e c t i n o d
er a e M
e n a g
G n g u
L a

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 38

How do we use language models?
Independently of any application, we could use
a language model as a random sentence generator
(we sample sentences according to their language model probability)
NB: There are very few real world use cases where you want to actually generate language
randomly, but understanding how to do this and what happens when you do so will allow us to do
more interesting things later.

We can use a language model as a sentences ranker.

Systems for applications such as machine translation, speech
recognition, spell-checking, generation, etc. often produce
many candidate sentences as output.
We prefer output sentences SOut that have a higher language model probability.
We can use a language model P(SOut) to score and rank these different
candidate output sentences, e.g. as follows:
argmaxSOut P(SOut | Input) = argmaxSOut P(Input | SOut)P(SOut)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 39

Generating from a distribution
How do you generate text from an n-gram model?

That is, how do you sample from a distribution P(X |Y=y)?

- Assume X has N possible outcomes (values): {x1, …, xN}
and P(X=xi | Y=y) = pi
- Divide the interval [0,1] into N smaller intervals according to
the probabilities of the outcomes
- Generate a random number r between 0 and 1.
- Return the x1 whose interval the number is in.

x1 x2 x3 x4 x5
0 p1 p1+p2 p1+p2+p3 p1+p2+p3+p4 1
r

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 40

Generating the Wall Street Journal

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 41

Generating Shakespeare

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 42

Shakespeare as corpus
The Shakespeare corpus has N=884,647 word tokens
for a vocabulary of V=29,066 word types

Shakespeare used 300,000 bigram types

out of V2= 844 million possible bigram types.
99.96% of possible bigrams don’t occur in this corpus.

Corollary: A relative frequency estimate based on this corpus

assigns non-zero probability to only 0.04% of possible bigrams
That percentage is even lower for trigrams, 4-grams, etc.
4-grams look like Shakespeare because they are Shakespeare!

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 43

The UNK token
What would happen if we used an UNK token
on a corpus the size of Shakespeare’s?

1. If we set the frequency threshold for which words to

replace too high, a very large fraction of tokens
become UNK.

2. Even with a low threshold, UNK will have a very

high probability, because in such a small corpus,
many words appear only once.

3. But we would still only observe a small fraction of

possible bigrams (or trigrams, quadrigrams, etc.)
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 44
MLE doesn’t capture unseen events
We estimated a model on 884K word tokens, but:

Only 30,000 word types occur in the training data

Any word that does not occur in the training data
has zero probability!

Only 0.04% of all possible bigrams (for 30K word

types) occur in the training data
Any bigram that does not occur in the training data
has zero probability (even if we have seen both words
in the bigram by themselves)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 45

How can you assign non-zero probability
to unseen events?
We have to “smooth” our distributions to assign some
probability mass to unseen events
P(unseen)
> 0.0

??? P(seen)
P(seen) < 1.0
= 1.0

MLE model Smoothed model

We won’t talk much about smoothing this year.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 46
Smoothing methods
Add-one smoothing:
Hallucinate counts that didn’t occur in the data

Linear interpolation:
̂ | w′, w′′) + (1 − λ)P̃(w | w′)
P̃(w | w′, w′′) = λP(w
Interpolate n-gram model with (n–1)-gram model.

Absolute Discounting: Subtract constant count from

frequent events and add it to rare events
Kneser-Ney: AD with modified unigram probabilities

Good-Turing: Use probability of rare events to

estimate probability of unseen events
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 47
Add-One (Laplace) Smoothing
A really simple way to do smoothing:
Increment the actual observed count of every possible
event (e.g. bigram) by a hallucinated count of 1
(or by a hallucinated count of some k with 0 < k < 1).

Shakespeare bigram model (roughly):

0.88 million actual bigram counts
+ 844.xx million hallucinated bigram counts

Oops. Now almost none of the counts in our model

come from actual data. We’re back to word salad.
K needs to be really small. But it turns out that that still
doesn’t work very well.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 48
5 :
a r t
3, P
e 0 ng
t u r a t i els
L e c a lu o d
E v m
ag e
n g u
la
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 49
Intrinsic vs Extrinsic Evaluation
How do we know whether one language model
is better than another?

There are two ways to evaluate models:

- intrinsic evaluation measures how well the model captures
what it is supposed to capture (e.g. probabilities)
- extrinsic (task-based) evaluation measures how useful the
model is in a particular task.

Both cases require an evaluation metric

that allows us to measure and compare
the performance of different models.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 50

Intrinsic Evaluation
of Language Models:
Perplexity

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 51

Intrinsic evaluation
Define an evaluation metric (scoring function).
We will want to measure how similar the predictions
of the model are to real text.

Train the model on a ‘seen’ training set

Perhaps: tune some parameters based on held-out data
(disjoint from the training data, meant to emulate unseen data)

Test the model on an unseen test set

(usually from the same source (e.g. WSJ) as the training data)
Test data must be disjoint from training and held-out data
Compare models by their scores (more on this in the next
lecture).

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 52

Perplexity
The perplexity of a language models is defined as
the inverse ( P( 1. . . ) ) of the probability of the test set,
normalized ( . . . ) by the # of tokens (N) in the test set.
N

If a LM assigns probability P(w1, …, wN1) to a test

corpus wP1 P
…w(w1
N ...w
, the
N ) =
LM’s P (w ...w
perplexity,
1 N )
PP(w
N
1…wN), is
⇥
1 1
P P (w1 ...wN ) = P (w1 ...wN ) N
N

⇥P (w1 ...wN )
A LM with lower perplexity ⇧
is
⌅ 1
N better because it
= ⌅P N
⇤ (w ...w 1)
assigns a higher probability = N
to the 1 unseenN
⇧ P (wi |w1 ...wi 1 )
test corpus.
LM1 and LM2’s perplexity can only be compared ⌅i=1 N if they use the same vocabulary
⇧⌅ 1
— Trigram models have lower perplexity than
= ⌅⇤
N bigram models;
— Bigram models have lower perplexity than ⌅ Nunigram P (w 1...wetc.
models,
|w )
=def ⇤
N
i=1 i 1 i 1
P (wi |wi n ...wi 1 )
⇧https://courses.grainger.illinois.edu/cs447/
⌅N
CS447 Natural Language Processing (J. Hockenmaier) i=1 53
⌅
Practical issues: Use logarithms!
Since language model probabilities are very small,
multiplying them together often yields to underflow.

It is often better to use logarithms instead, so replace

s
N
1
PP(w1 ...wN ) =def N
’ P(wi |wi 1 , ..., wi n+1 )
i=1

with
✓ N ◆
1
PP(w1 ...wN ) =def exp Â
N i=1
log P(wi |wi 1 , ..., wi n+1

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 54

Extrinsic (Task-Based)
Evaluation of LMs:
Word Error Rate

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 55

Intrinsic vs. Extrinsic Evaluation
Perplexity tells us which LM assigns a higher
probability to unseen text

This doesn’t necessarily tell us which LM is better for

our task (i.e. is better at scoring candidate sentences)

Task-based evaluation:
- Train model A, plug it into your system for performing task T
- Evaluate performance of system A on task T.
- Train model B, plug it in, evaluate system B on same task T.
- Compare scores of system A and system B on task T.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 56

Word Error Rate (WER)
Originally developed for speech recognition.

How much does the predicted sequence of words

differ from the actual sequence of words
in the correct transcript?
Insertions + Deletions + Substitutions
WER =
Actual words in transcript

Insertions: “eat lunch” → “eat a lunch”

Deletions: “see a movie” → “see movie”
Substitutions: “drink ice tea”→ “drink nice tea”

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 57

n d
e E
T h
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 58

海外家用梯调试说明书5873374345690523426 PDF
90% (10)
海外家用梯调试说明书5873374345690523426 PDF
101 pages
Language Models Probabilistic Model 1735045992
No ratings yet
Language Models Probabilistic Model 1735045992
55 pages
Language Models
No ratings yet
Language Models
50 pages
Language Models: Last Lecture's Key Concepts
No ratings yet
Language Models: Last Lecture's Key Concepts
13 pages
Lecture 02
No ratings yet
Lecture 02
62 pages
Lecture02 Tokenization
No ratings yet
Lecture02 Tokenization
16 pages
1JgIM70DTxiYCDO9A98YgQ_b7f4f5c6d71a49758bb9ae5bb2b983f1_Lecture04-Part3 (1)
No ratings yet
1JgIM70DTxiYCDO9A98YgQ_b7f4f5c6d71a49758bb9ae5bb2b983f1_Lecture04-Part3 (1)
12 pages
Lec 3 slp04 LM and Ngrans
No ratings yet
Lec 3 slp04 LM and Ngrans
73 pages
2 Corpora and Smoothing
No ratings yet
2 Corpora and Smoothing
85 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
No ratings yet
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
54 pages
Language Models: Instructor: Rada Mihalcea Taught by Bonnie Dorr at Univ. of Maryland
No ratings yet
Language Models: Instructor: Rada Mihalcea Taught by Bonnie Dorr at Univ. of Maryland
74 pages
L3 LanguageModels
No ratings yet
L3 LanguageModels
118 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
Lecture 04
No ratings yet
Lecture 04
55 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Natural Language Processing:: N-Gram Language Models
No ratings yet
Natural Language Processing:: N-Gram Language Models
48 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lecture01_2020_TheNLPPipeline
No ratings yet
Lecture01_2020_TheNLPPipeline
20 pages
PCFG
No ratings yet
PCFG
79 pages
language modelling_
No ratings yet
language modelling_
17 pages
Lec15 17 N Gram Language Model Part1 Copy (1)
No ratings yet
Lec15 17 N Gram Language Model Part1 Copy (1)
49 pages
Lecture 2 Language Model
No ratings yet
Lecture 2 Language Model
127 pages
NLP
No ratings yet
NLP
46 pages
NLP Lecture 5 PDF
No ratings yet
NLP Lecture 5 PDF
36 pages
7MvcJmJaQ8uL3CZiWhPLDQ_848bd532a73b42ac974c5a2ee6cdf1f1_Lecture02_6_FST (1)
No ratings yet
7MvcJmJaQ8uL3CZiWhPLDQ_848bd532a73b42ac974c5a2ee6cdf1f1_Lecture02_6_FST (1)
16 pages
2 IPCC - Natural Language Processing
No ratings yet
2 IPCC - Natural Language Processing
4 pages
Lecture1 5 IntroToNLP
No ratings yet
Lecture1 5 IntroToNLP
73 pages
ai
No ratings yet
ai
13 pages
Machine Learning
No ratings yet
Machine Learning
39 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
NLP-week4-ngrams
No ratings yet
NLP-week4-ngrams
60 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
IS 7118 Unit-4 N-Grams
100% (2)
IS 7118 Unit-4 N-Grams
93 pages
Natural Language Processing Lecture Notes Columbia Cs4705 Itebooks download
No ratings yet
Natural Language Processing Lecture Notes Columbia Cs4705 Itebooks download
50 pages
Introduction To Computational Linguistics: Eugene Charniak and Mark Johnson
No ratings yet
Introduction To Computational Linguistics: Eugene Charniak and Mark Johnson
148 pages
Lecture 04
No ratings yet
Lecture 04
42 pages
PCFG
No ratings yet
PCFG
79 pages
N-Grams - Text Representation
No ratings yet
N-Grams - Text Representation
23 pages
Chapter Four 1
No ratings yet
Chapter Four 1
91 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
lm24aug
No ratings yet
lm24aug
84 pages
Lecture_4_N_grams
No ratings yet
Lecture_4_N_grams
29 pages
3_2
No ratings yet
3_2
26 pages
NLP_Week_03
No ratings yet
NLP_Week_03
33 pages
KEN2570 4 LanguageModel
No ratings yet
KEN2570 4 LanguageModel
17 pages
Lecture 01
No ratings yet
Lecture 01
59 pages
Lecture01_1_CS447
No ratings yet
Lecture01_1_CS447
12 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
Calculus Volume1
From Everand
Calculus Volume1
Ming Yao Tsai
No ratings yet
First Course in Mathematical Logic
From Everand
First Course in Mathematical Logic
Patrick Suppes
3.5/5 (4)
Download Study Resources for Practical Analytics Second Edition Nitin Kalé & Nancy Jones
100% (2)
Download Study Resources for Practical Analytics Second Edition Nitin Kalé & Nancy Jones
64 pages
DISHWASHER
No ratings yet
DISHWASHER
4 pages
Magnite OTT Is For Everyone
No ratings yet
Magnite OTT Is For Everyone
17 pages
Assignment 5: Int X 5, y 10
No ratings yet
Assignment 5: Int X 5, y 10
7 pages
Incident Response Management Policy Template For CIS Control 17
No ratings yet
Incident Response Management Policy Template For CIS Control 17
17 pages
Cyberbullying Detection On Twitter Using Machine Learning A Review
No ratings yet
Cyberbullying Detection On Twitter Using Machine Learning A Review
5 pages
WP Acronis Backup Protecting and Securing Critical Data EN-US 181122
No ratings yet
WP Acronis Backup Protecting and Securing Critical Data EN-US 181122
8 pages
Smart Door Lock System
No ratings yet
Smart Door Lock System
2 pages
AEM 6.5 - Build Websites and Components - Handbook
No ratings yet
AEM 6.5 - Build Websites and Components - Handbook
62 pages
PC Standards Enhancement Template
No ratings yet
PC Standards Enhancement Template
19 pages
Input: Fishbone Diagram Template
No ratings yet
Input: Fishbone Diagram Template
4 pages
Naskah BING4102 The 2
No ratings yet
Naskah BING4102 The 2
3 pages
2016 2017 PDF
No ratings yet
2016 2017 PDF
67 pages
Keysight Vision Orchestrator
No ratings yet
Keysight Vision Orchestrator
5 pages
Cisco Certified Network Associate (CCNA) v1.0 (200-301) - Full Access
No ratings yet
Cisco Certified Network Associate (CCNA) v1.0 (200-301) - Full Access
43 pages
Red Hat System Administration I (RH124)
No ratings yet
Red Hat System Administration I (RH124)
10 pages
SERAPHIC TV+Portal Datasheet
No ratings yet
SERAPHIC TV+Portal Datasheet
2 pages
AMOS For RAN Operations - Student Material
100% (1)
AMOS For RAN Operations - Student Material
200 pages
HV 48V 50AH LiFeP04
No ratings yet
HV 48V 50AH LiFeP04
1 page
Meeting Manage
No ratings yet
Meeting Manage
3 pages
Civil and Environmental Systems Engineering 2nd Edition Charles S. Revelle download
No ratings yet
Civil and Environmental Systems Engineering 2nd Edition Charles S. Revelle download
44 pages
Gabriela Doloiu
No ratings yet
Gabriela Doloiu
5 pages
1 To 13 MCQ STD 12
No ratings yet
1 To 13 MCQ STD 12
53 pages
CourseOutline MPPP1213 20202021 1
No ratings yet
CourseOutline MPPP1213 20202021 1
5 pages
Industrial Edge
No ratings yet
Industrial Edge
18 pages
Water Cooled Screw Chiller
No ratings yet
Water Cooled Screw Chiller
15 pages
Ksou Fee Str93
No ratings yet
Ksou Fee Str93
3 pages
Sliding Mode Control Strategy Based Lead Screw Control Design in Electromechanical Tracking Drive System
No ratings yet
Sliding Mode Control Strategy Based Lead Screw Control Design in Electromechanical Tracking Drive System
9 pages
SIP5 APN 023 Change Setting Groups Via CFC en
No ratings yet
SIP5 APN 023 Change Setting Groups Via CFC en
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.