Lecture 03
Lecture 03
http://courses.engr.illinois.edu/cs447
Lecture 3:
Language Models
(Intro to Probability Models for NLP)
Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
t1 :
P ar
0 3,
u re i ew
e ct r v
L O ve
Today’s reading:
Chapter 3 (3rd Edition)
P (X, Y )
P (X|Y ) =
P (Y )
P (X)P (Y )
= (X , Y independent)
P (Y )
= P (X)
[∑ ]
∀w ∈ V : P(w (i) = w′ ∣ w (i−1) = w) = 1
w′∈V
∑ [ ∏ ]
Prob. of all N-word strings P(V N ) = P(w (i) = w ∣ w (i−1) = w′) = 1
w,w′∈V i=1...N
But instead of N separate distributions…
…we want one distribution over strings of any length
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 29
From n-gram probabilities to language models
Solution:
Add an End-of-Sentence (EOS) token to V
Assume a) that each string ends in EOS and
b) that EOS can only appear at the end of a string.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 30
From n-gram probabilities to language models
with EOS
Think of a language model as a stochastic process:
— At each time step, randomly pick one more word.
— Stop generating more words when the word you pick
is a special end-of-sentence (EOS) token.
Refinements:
Use different UNK tokens for different types of words
(numbers, capitalized words, lower-case words, etc.)
x1 x2 x3 x4 x5
0 p1 p1+p2 p1+p2+p3 p1+p2+p3+p4 1
r
??? P(seen)
P(seen) < 1.0
= 1.0
Linear interpolation:
̂ | w′, w′′) + (1 − λ)P̃(w | w′)
P̃(w | w′, w′′) = λP(w
Interpolate n-gram model with (n–1)-gram model.
⇥P (w1 ...wN )
A LM with lower perplexity ⇧
is
⌅ 1
N better because it
= ⌅P N
⇤ (w ...w 1)
assigns a higher probability = N
to the 1 unseenN
⇧ P (wi |w1 ...wi 1 )
test corpus.
LM1 and LM2’s perplexity can only be compared ⌅i=1 N if they use the same vocabulary
⇧⌅ 1
— Trigram models have lower perplexity than
= ⌅⇤
N bigram models;
— Bigram models have lower perplexity than ⌅ Nunigram P (w 1...wetc.
models,
|w )
=def ⇤
N
i=1 i 1 i 1
P (wi |wi n ...wi 1 )
⇧https://courses.grainger.illinois.edu/cs447/
⌅N
CS447 Natural Language Processing (J. Hockenmaier) i=1 53
⌅
Practical issues: Use logarithms!
Since language model probabilities are very small,
multiplying them together often yields to underflow.
with
✓ N ◆
1
PP(w1 ...wN ) =def exp Â
N i=1
log P(wi |wi 1 , ..., wi n+1
Task-based evaluation:
- Train model A, plug it into your system for performing task T
- Evaluate performance of system A on task T.
- Train model B, plug it in, evaluate system B on same task T.
- Compare scores of system A and system B on task T.