0% found this document useful (0 votes)
20 views

Lecture 03

Language Models

Uploaded by

saima khosa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture 03

Language Models

Uploaded by

saima khosa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

CS447: Natural Language Processing

http://courses.engr.illinois.edu/cs447

Lecture 3:
Language Models
(Intro to Probability Models for NLP)

Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
t1 :
P ar
0 3,
u re i ew
e ct r v
L O ve

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2


Last lecture’s key concepts
Dealing with words:
— Tokenization, normalization
— Zipf’s Law

Morphology (word structure):


— Stems, affixes
— Derivational vs. inflectional morphology
— Compounding
— Stem changes
— Morphological analysis and generation

Finite-state methods in NLP


— Finite-state automata vs. finite-state transducers
— Composing finite-state transducers

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 3


Finite-state transducers
– FSTs define a relation between two regular
languages.
– Each state transition maps (transduces) a
character from the input language to a character (or
a sequence of characters) in the output language
x:y

– By using the empty character (ε), characters can


be deleted (x:ε) or inserted(ε:y)
x:ε ε:y

– FSTs can be composed (cascaded), allowing us to


define intermediate representations.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 4
Today’s lecture
How can we distinguish word salad, spelling errors
and grammatical sentences?

Language models define probability distributions


over the strings in a language.

N-gram models are the simplest and most common kind of


language model.

We’ll look at how these models are defined,


how to estimate (learn) their parameters,
and what their shortcomings are.

We’ll also review some very basic probability theory.


CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 5
Why do we need language models?
Many NLP tasks require natural language output:
—Machine translation: return text in the target language
—Speech recognition: return a transcript of what was spoken
—Natural language generation: return natural language text
—Spell-checking: return corrected spelling of input

Language models define probability distributions


over (natural language) strings or sentences.
➔ We can use a language model to generate strings
➔ We can use a language model to score/rank candidate strings
so that we can choose the best (i.e. most likely) one:
if PLM(A) > PLM(B), return A, not B

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 6


Hmmm, but…
… what does it mean for a language model
to “define a probability distribution”?

… why would we want to define probability


distributions over languages?

… how can we construct a language model such that


it actually defines a probability distribution?

… how do we know how well our model works?

You should be able to answer these questions


after this lecture
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 7
Today’s class
Part 1: Overview (this video)

Part 2: Review of Basic Probability

Part 3: Language Modeling with N-Grams

Part 4: Generating Text with Language Models

Part 5: Evaluating Language Models

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 8


Today’s key concepts
N-gram language models
Independence assumptions
Getting from n-grams to a distribution over a language
Relative frequency (maximum likelihood) estimation
Smoothing
Intrinsic evaluation: Perplexity,
Extrinsic evaluation: WER

Today’s reading:
Chapter 3 (3rd Edition)

Next lecture: Basic intro to machine learning for NLP


CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 9
2 :
a r t
3 , P c
0 a s i
u re f B r y
e ct o e o
L i ew T h
R ev l i t y
a b i
ro b
P
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 10
Sampling with replacement
Pick a random shape, then put it back in the bag.

P( ) = 2/15 P( ) = 1/15 P( or ) = 2/15


P(blue) = 5/15 P(red) = 5/15 P( |red) = 3/5
P(blue | ) = 2/5 P( ) = 5/15

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 11


Sampling with replacement
Pick a random shape, then put it back in the bag.
What sequence of shapes will you draw?
P( ) = 1/15 × 1/15 × 1/15 × 2/15
= 2/50625

P( ) = 3/15 × 2/15 × 2/15 × 3/15


= 36/50625
P( ) = 2/15 P( ) = 1/15 P( or ) = 2/15
P(blue) = 5/15 P(red) = 5/15 P( |red) = 3/5
P(blue | ) = 2/5 P( ) = 5/15

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 12


Now let’s look at natural language
Text as a bag of words
Alice was beginning to get very tired of
sitting by her sister on the bank, and of
having nothing to do: once or twice she
had peeped into the book her sister was
reading, but it had no pictures or
conversations in it, 'and what is the use
of a book,' thought Alice 'without
pictures or conversation?'

P(of) = 3/66 P(to) = 2/66 P(,) = 4/66


P(Alice) = 2/66 P(her) = 2/66 P(') = 4/66
P(was) = 2/66 P(sister) = 2/66

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 13


Sampling with replacement
A sampled sequence of words
beginning by, very Alice but was and?
reading no tired of to into sitting
sister the, bank, and thought of without
her nothing: having conversations Alice
once do or on she it get the book her had
peeped was conversation it pictures or
sister in, 'what is the use had twice of
a book''pictures or' to

P(of) = 3/66 P(to) = 2/66 P(,) = 4/66


P(Alice) = 2/66 P(her) = 2/66 P(') = 4/66
P(was) = 2/66 P(sister) = 2/66
In this model, P(English sentence) = P(word salad)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 14


Probability theory: terminology
Trial (aka “experiment”)
Picking a shape, predicting a word
Sample space Ω:
The set of all possible outcomes
(all shapes; all words in Alice in Wonderland)
Event ω ⊆ Ω:
An actual outcome (a subset of Ω)
(predicting ‘the’, picking a triangle)
Random variable X: Ω → T
A function from the sample space (often the identity function)
Provides a ‘measurement of interest’ from a trial/experiment
(Did we pick ‘Alice’/a noun/a word starting with “x”/…?
How often does the word ‘Alice’ occur?
How many words occur in each sentence?)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 15


What is a probability distribution?
P(ω) defines a distribution over Ω iff

1) Every event ω has a probability P(ω) between 0 and 1:


0 ⇥ P( )⇥1
P (⇤) = 0 and
0 ⇥ PP(( ) = )1⇥ 1 P(∅) = 0:
2) The null event ∅ has probability
P
0 ⇥ P((⇤) = 0 and
P (
)⇥1 i P) (= )1= 1
P (⇤) = 0 and P i( ) = 1 P ( i ) = 1
3) And the probability of all disjoint events sums to 1.
P ( i) =
i
1 if ⇥j = i : i ⌅ j =⇤
i and i i =

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 16


Discrete probability distributions:
Single Trials
‘Discrete’: a fixed (often finite) number of outcomes

Bernoulli distribution (Two possible outcomes (head, tail)


Defined by the probability of success (= head/yes)
The probability of head is p. The probability of tail is 1−p.

Categorical distribution (N possible outcomes c1…cN)


The probability of category/outcome ci is pi (0 ≤ pi ≤ 1; ∑i pi = 1).
e.g. the probability of getting a six when rolling a die once
e.g. the probability of the next word (picked among a vocabulary of N words)
(NB: Most of the distributions we will see in this class are categorical.
Some people call them multinomial distributions, but those refer to sequences
of trials, e.g. the probability of getting five sixes when rolling a die ten times)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 17


Joint and Conditional Probability
The conditional probability of X given Y, P(X | Y),
is defined in terms of the probability of Y, P(Y),
and the joint probability of X and Y, P(X, Y):
P (X, Y )
P (X|Y ) =
P (Y )
What is the probability that we get a blue shape
if we pick a square? P(blue | ) = 2/5

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 18


The chain rule
The joint probability P(X,Y) can also be expressed
in terms of the conditional probability P(X | Y)
P (X, Y ) = P (X|Y )P (Y )

Generalizing this to N joint events (or random


variables) leads to the so-called chain rule:
P (X1 , X2 , . . . , Xn ) = P (X1 )P (X2 |X1 )P (X3 |X2 , X1 )....P (Xn |X1 , ...Xn 1)
n
= P (X1 ) P (Xi |X1 . . . Xi 1)
i=2

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 19


Independence
Two events or random variables X and Y
are independent if
P (X, Y ) = P (X)P (Y )
If X and Y are independent, then P(X | Y) = P(X):

P (X, Y )
P (X|Y ) =
P (Y )
P (X)P (Y )
= (X , Y independent)
P (Y )
= P (X)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 20


Probability models
Building a probability model consists of two steps:
1. Defining the model
2. Estimating the model’s parameters (= training/learning )

Probability models (almost) always make


independence assumptions.
— Even though X and Y are not actually independent,
our model may treat them as independent.
— This can drastically reduce the number of parameters to estimate.
— Models without independence assumptions have (way)
too many parameters to estimate reliably from the data we have
— But since independence assumptions are often incorrect,
those models are often incorrect as well:
they assign probability mass to events that cannot occur
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 21
3 :
a r t
3, P i n g
0 d e l
u r e o
e ct e M m s
L u a g r a
n g N - G
L a t h
w i

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 22


Language modeling with N-grams
A language model over a vocabulary V
assigns probabilities to strings drawn from V*.

How do we compute the probability


(1) (i)
of a string w . . . w ?

Recall the chain rule:


P(w (1) . . . w (i)) = P(w (1)) ⋅ P(w (2) | w (1)) ⋅ . . . ⋅ P(w (i) | w (i−1), . . . , w (1))

An n-gram language model assumes each word


depends only on the last n−1 words:
Pngram(w (1) . . . w (i)) = P(w (1)) ⋅ P(w (2) | w (1)) ⋅ . . . ⋅ P(w (i) | w (i−1), . . . , w (1−(n+1)))

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 23


N-gram models
N-gram models assume each word (event)
depends only on the previous n−1 words (events):
N
P(w (1) . . . w (N)) = P(w (i))

Unigram model:
i=1
N
P(w (1) . . . w (N)) = P(w (i) | w (i−1))

Bigram model:
i=1
N
P(w (1) . . . w (N)) = P(w (i) | w (i−1), w (i−2))

Trigram model:
i=1
NB: Independence assumptions where the n-th event in a sequence depends
only on the last n-1 events are called Markov assumptions (of order n−1).

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 24


How many parameters do n-gram
models have?
Given a vocabulary V of |V| word types: so, for |V| = 104:

Unigram model: |V| parameters 104 parameters


(one distribution P( w(i) ) with |V| outcomes
[each w ∈ V is one outcome])

Bigram model: |V|2 parameters 108 parameters


(|V| distributions P( w(i) | w(i-1) ), one distribution for each w ∈ V
with |V| outcomes each [each w ∈ V is one outcome])

Trigram model: |V|3 parameters 1012 parameters


(|V|2 distributions P(w(i) | w(i-1),w(i-2)), one per bigram w’w’’,
with |V| outcomes each [each w ∈ V is one outcome])
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 25
Sampling with replacement
beginning by, very Alice but was and?
reading no tired of to into sitting
sister the, bank, and thought of without
her nothing: having conversations Alice
once do or on she it get the book her had
peeped was conversation it pictures or
sister in, 'what is the use had twice of
a book''pictures or' to

P(of) = 3/66 P(to) = 2/66 P(,) = 4/66


P(Alice) = 2/66 P(her) = 2/66 P(') = 4/66
P(was) = 2/66 P(sister) = 2/66
In this model, P(English sentence) = P(word salad)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 26


A bigram model for Alice
Alice was beginning to get very tired of
sitting by her sister on the bank, and of
having nothing to do: once or twice she
had peeped into the book her sister was
reading, but it had no pictures or
conversations in it, 'and what is the use
of a book,' thought Alice 'without
pictures or conversation?'

P(w(i) = of | w(i–1) = tired) = 1 P(w(i) = bank | w(i–1) = the) = 1/3


P(w(i) = of | w(i–1) = use) = 1 P(w(i) = book | w(i–1) = the) = 1/3
P(w(i) = sister | w(i–1) = her) = 1 P(w(i) = use | w(i–1) = the) = 1/3
P(w(i) = beginning | w(i–1) = was) = 1/2
P(w(i) = reading | w(i–1) = was) = 1/2

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 27


Using a bigram model for Alice
English Word Salad
Alice was beginning to get very
beginning by, very Alice but was and?
tired of sitting by her sister on
reading no tired of to into sitting
the bank, and of having nothing to
sister the, bank, and thought of without
do: once or twice she had peeped
her nothing: having conversations Alice
into the book her sister was
once do or on she it get the book her had
reading, but it had no pictures or
peeped was conversation it pictures or
conversations in it, 'and what is
sister in, 'what is the use had twice of
the use of a book,' thought Alice
a book''pictures or' to
'without pictures or conversation?'

Now, P(English) ⪢ P(word salad)

P(w(i) = of | w(i–1) = tired) = 1 P(w(i) = bank | w(i–1) = the) = 1/3


P(w(i) = of | w(i–1) = use) = 1 P(w(i) = book | w(i–1) = the) = 1/3
P(w(i) = sister | w(i–1) = her) = 1 P(w(i) = use | w(i–1) = the) = 1/3
P(w(i) = beginning | w(i–1) = was) = 1/2
P(w(i) = reading | w(i–1) = was) = 1/2

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 28


From n-gram probabilities to language models

Recall: a language L ⊆ V* is a (possibly infinite) set of strings


over a (finite) vocabulary V.

P(w(i) | w(i-1)) defines a distribution over the words in V:

[∑ ]
∀w ∈ V : P(w (i) = w′ ∣ w (i−1) = w) = 1
w′∈V

By multiplying this distribution N times, we get


one distribution over all strings of the same length N (VN):
Prob. of one N-word string: P(w1 . . . wN) = P(w (i) = wi ∣ w (i−1) = wi−1)

i=1...N

∑ [ ∏ ]
Prob. of all N-word strings P(V N ) = P(w (i) = w ∣ w (i−1) = w′) = 1
w,w′∈V i=1...N
But instead of N separate distributions…
…we want one distribution over strings of any length
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 29
From n-gram probabilities to language models

We have just seen how to use n-gram probabilities to


define one distribution P(VN) for each string length N.

But a language model P(L)=P(V*) should define


one distribution P(V*) that sums to one over all strings
in L ⊆ V*, regardless of their length:
P(L) = P(V) + P(V2) + P(V3) + … P(Vn) + … = 1

Solution:
Add an End-of-Sentence (EOS) token to V
Assume a) that each string ends in EOS and
b) that EOS can only appear at the end of a string.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 30
From n-gram probabilities to language models
with EOS
Think of a language model as a stochastic process:
— At each time step, randomly pick one more word.
— Stop generating more words when the word you pick
is a special end-of-sentence (EOS) token.

To be able to pick the EOS token, we have to modify our


training data so that each sentence ends in EOS.
This means our vocabulary is now VEOS = V ∪ {EOS}

We then get an actual language model,


i.e. a distribution over strings of any length
Technically, this is only true because P(EOS | …) will be high enough that we are always
guaranteed to stop after having generated a finite number of words
A leaky or inconsistent language model would have P(L) < 1. That could happen if EOS had a
very small probability (but doesn’t really happen in practice).

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 31


Why do we want one distribution over L?

Why do we care about having one probability distribution


for all lengths?

This allows us to compare the probabilities of strings of


different lengths, because they’re computed by the same
distribution.

This allows us to generate strings of arbitrary length


with one model.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 32


Parameter
Estimation
Or: Where do we get the probabilities from?

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 33


Learning (estimating) a language model
Where do we get the parameters of our model
(its actual probabilities) from?
P(w(i) = ‘the’ | w(i–1) = ‘on’) = ???
We need (a large amount of) text as training data
to estimate the parameters of a language model.

The most basic parameter estimation technique:


relative frequency estimation (frequency = counts)
P(w(i) = ‘the’ | w(i–1) = ‘on’) = C(‘on the’) / C(‘on’)
Also called Maximum Likelihood Estimation (MLE)
C(‘on the’) [or f(‘on the’) for frequency]:
How often does ‘on the’ appear in the training data?
NB: C(‘on’) = ∑w∈VC(‘on’ w)
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 34
Handling unknown words: UNK
Training:
— Define a fixed vocabulary V such that all words in V
appear at least n times in the training data
(e.g. all words that occur at least 5 times in the training corpus,
or the most common 10,000 words in training)
— Add a new token UNK to V, and replace all other words
in the corpus that are not in V by this token UNK
— Estimate the model on this modified training corpus.

Testing (when computing the probability of a string):


Replace any words not in the vocabulary by UNK

Refinements:
Use different UNK tokens for different types of words
(numbers, capitalized words, lower-case words, etc.)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 35


What about the beginning of the sentence?
In a trigram model
P(w (1)w (2)w (3)) = P(w (1))P(w (2) | w (1))P(w (3) | w (2), w (1))
only the third term P(w (3) | w (2), w (1)) is an actual trigram
probability. What about P(w (1)) and P(w (2) | w (1)) ?

If this bothers you:


Add n–1 beginning-of-sentence (BOS) symbols
to each sentence for an n–gram model:
BOS1 BOS2 Alice was …
Now the unigram and bigram probabilities
involve only BOS symbols.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 36


Summary: Estimating a bigram model with
BOS (<s>), EOS (</s>) and UNK using MLE

1. Replace all words not in V in the training corpus with UNK


2. Bracket each sentence by special start and end symbols:
<s> Alice was beginning to get very tired … </s>
3. Define the Vocabulary V’ = all tokens in modified training corpus
(all common words, UNK, <s>, </s>)
4. Count the frequency of each bigram….
C(<s> Alice) = 1, C(Alice was) = 1, …
5. .... and normalize these frequencies to get probabilities:
C(Alice was)
∑ C(Alice wi)
P(was | Alice) =
wi∈V′

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 37


4 :
ar t h
3 , P w i t
e 0 e x t
t u r g t e ls
L e c t i n o d
er a e M
e n a g
G n g u
L a

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 38


How do we use language models?
Independently of any application, we could use
a language model as a random sentence generator
(we sample sentences according to their language model probability)
NB: There are very few real world use cases where you want to actually generate language
randomly, but understanding how to do this and what happens when you do so will allow us to do
more interesting things later.

We can use a language model as a sentences ranker.


Systems for applications such as machine translation, speech
recognition, spell-checking, generation, etc. often produce
many candidate sentences as output.
We prefer output sentences SOut that have a higher language model probability.
We can use a language model P(SOut) to score and rank these different
candidate output sentences, e.g. as follows:
argmaxSOut P(SOut | Input) = argmaxSOut P(Input | SOut)P(SOut)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 39


Generating from a distribution
How do you generate text from an n-gram model?

That is, how do you sample from a distribution P(X |Y=y)?


- Assume X has N possible outcomes (values): {x1, …, xN}
and P(X=xi | Y=y) = pi
- Divide the interval [0,1] into N smaller intervals according to
the probabilities of the outcomes
- Generate a random number r between 0 and 1.
- Return the x1 whose interval the number is in.

x1 x2 x3 x4 x5
0 p1 p1+p2 p1+p2+p3 p1+p2+p3+p4 1
r

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 40


Generating the Wall Street Journal

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 41


Generating Shakespeare

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 42


Shakespeare as corpus
The Shakespeare corpus has N=884,647 word tokens
for a vocabulary of V=29,066 word types

Shakespeare used 300,000 bigram types


out of V2= 844 million possible bigram types.
99.96% of possible bigrams don’t occur in this corpus.

Corollary: A relative frequency estimate based on this corpus


assigns non-zero probability to only 0.04% of possible bigrams
That percentage is even lower for trigrams, 4-grams, etc.
4-grams look like Shakespeare because they are Shakespeare!

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 43


The UNK token
What would happen if we used an UNK token
on a corpus the size of Shakespeare’s?

1. If we set the frequency threshold for which words to


replace too high, a very large fraction of tokens
become UNK.

2. Even with a low threshold, UNK will have a very


high probability, because in such a small corpus,
many words appear only once.

3. But we would still only observe a small fraction of


possible bigrams (or trigrams, quadrigrams, etc.)
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 44
MLE doesn’t capture unseen events
We estimated a model on 884K word tokens, but:

Only 30,000 word types occur in the training data


Any word that does not occur in the training data
has zero probability!

Only 0.04% of all possible bigrams (for 30K word


types) occur in the training data
Any bigram that does not occur in the training data
has zero probability (even if we have seen both words
in the bigram by themselves)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 45


How can you assign non-zero probability
to unseen events?
We have to “smooth” our distributions to assign some
probability mass to unseen events
P(unseen)
> 0.0

??? P(seen)
P(seen) < 1.0
= 1.0

MLE model Smoothed model


We won’t talk much about smoothing this year.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 46
Smoothing methods
Add-one smoothing:
Hallucinate counts that didn’t occur in the data

Linear interpolation:
̂ | w′, w′′) + (1 − λ)P̃(w | w′)
P̃(w | w′, w′′) = λP(w
Interpolate n-gram model with (n–1)-gram model.

Absolute Discounting: Subtract constant count from


frequent events and add it to rare events
Kneser-Ney: AD with modified unigram probabilities

Good-Turing: Use probability of rare events to


estimate probability of unseen events
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 47
Add-One (Laplace) Smoothing
A really simple way to do smoothing:
Increment the actual observed count of every possible
event (e.g. bigram) by a hallucinated count of 1
(or by a hallucinated count of some k with 0 < k < 1).

Shakespeare bigram model (roughly):


0.88 million actual bigram counts
+ 844.xx million hallucinated bigram counts

Oops. Now almost none of the counts in our model


come from actual data. We’re back to word salad.
K needs to be really small. But it turns out that that still
doesn’t work very well.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 48
5 :
a r t
3, P
e 0 ng
t u r a t i els
L e c a lu o d
E v m
ag e
n g u
la
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 49
Intrinsic vs Extrinsic Evaluation
How do we know whether one language model
is better than another?

There are two ways to evaluate models:


- intrinsic evaluation measures how well the model captures
what it is supposed to capture (e.g. probabilities)
- extrinsic (task-based) evaluation measures how useful the
model is in a particular task.

Both cases require an evaluation metric


that allows us to measure and compare
the performance of different models.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 50


Intrinsic Evaluation
of Language Models:
Perplexity

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 51


Intrinsic evaluation
Define an evaluation metric (scoring function).
We will want to measure how similar the predictions
of the model are to real text.

Train the model on a ‘seen’ training set


Perhaps: tune some parameters based on held-out data
(disjoint from the training data, meant to emulate unseen data)

Test the model on an unseen test set


(usually from the same source (e.g. WSJ) as the training data)
Test data must be disjoint from training and held-out data
Compare models by their scores (more on this in the next
lecture).

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 52


Perplexity
The perplexity of a language models is defined as
the inverse ( P( 1. . . ) ) of the probability of the test set,
normalized ( . . . ) by the # of tokens (N) in the test set.
N

If a LM assigns probability P(w1, …, wN1) to a test


corpus wP1 P
…w(w1
N ...w
, the
N ) =
LM’s P (w ...w
perplexity,
1 N )
PP(w
N
1…wN), is

1 1
P P (w1 ...wN ) = P (w1 ...wN ) N
N

⇥P (w1 ...wN )
A LM with lower perplexity ⇧
is
⌅ 1
N better because it
= ⌅P N
⇤ (w ...w 1)
assigns a higher probability = N
to the 1 unseenN
⇧ P (wi |w1 ...wi 1 )
test corpus.
LM1 and LM2’s perplexity can only be compared ⌅i=1 N if they use the same vocabulary
⇧⌅ 1
— Trigram models have lower perplexity than
= ⌅⇤
N bigram models;
— Bigram models have lower perplexity than ⌅ Nunigram P (w 1...wetc.
models,
|w )
=def ⇤
N
i=1 i 1 i 1
P (wi |wi n ...wi 1 )
⇧https://courses.grainger.illinois.edu/cs447/
⌅N
CS447 Natural Language Processing (J. Hockenmaier) i=1 53

Practical issues: Use logarithms!
Since language model probabilities are very small,
multiplying them together often yields to underflow.

It is often better to use logarithms instead, so replace


s
N
1
PP(w1 ...wN ) =def N
’ P(wi |wi 1 , ..., wi n+1 )
i=1

with
✓ N ◆
1
PP(w1 ...wN ) =def exp Â
N i=1
log P(wi |wi 1 , ..., wi n+1

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 54


Extrinsic (Task-Based)
Evaluation of LMs:
Word Error Rate

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 55


Intrinsic vs. Extrinsic Evaluation
Perplexity tells us which LM assigns a higher
probability to unseen text

This doesn’t necessarily tell us which LM is better for


our task (i.e. is better at scoring candidate sentences)

Task-based evaluation:
- Train model A, plug it into your system for performing task T
- Evaluate performance of system A on task T.
- Train model B, plug it in, evaluate system B on same task T.
- Compare scores of system A and system B on task T.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 56


Word Error Rate (WER)
Originally developed for speech recognition.

How much does the predicted sequence of words


differ from the actual sequence of words
in the correct transcript?
Insertions + Deletions + Substitutions
WER =
Actual words in transcript

Insertions: “eat lunch” → “eat a lunch”


Deletions: “see a movie” → “see movie”
Substitutions: “drink ice tea”→ “drink nice tea”

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 57


n d
e E
T h
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 58

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy