2 Corpora and Smoothing
2 Corpora and Smoothing
Overview
•(Statistical) language models (n-gram models)
•Counting
•Data
•Definitions
•Evaluations
•Distributions
•Smoothing
• Some slides are based on content from Bob Carpenter, Dan Klein, Roger Levy, Josh
Goodman, Dan Jurafsky, Christopher Manning, Gerald Penn, and Bill MacCartney.
Training data
CSC401/2511 – Spring 2022 8
Training vs testing
Loaded die Language
• If you consider all of the past, you will never gather enough
data in order to be useful in practice.
• Imagine you’ve only seen the Brown corpus.
CSC401/2511 – Spring 13
2022
Word prediction
•Guess the next word…
•*Spoilers* You can do quite
well by counting how
often certain tokens occur
given their contexts.
𝑤 𝑡 " 1 , 𝑤𝑡
count of
in
corpus
CSC401/2511 – Spring 2022 14
Word prediction with N-grams
• N-grams: n.pl. token sequences of length N.
• The fragment ‘in this sentence is’ contains the following
2-grams (i.e., ‘bigrams’):
• (in this), (this sentence), (sentence is)
is,⋅
• What word is most likely to follow ‘is’?
• Derived from bigrams
• E.g., 𝑃
> 𝑃 𝑖𝑠 𝑎
𝑖𝑠 𝑡ℎ𝑒 ∴
𝑃 𝑡ℎ𝑒 𝑖𝑠 > 𝑃(𝑎|𝑖𝑠)
Then we would predict:
i.e., 𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛
ordered sequence.
𝐶𝑜𝑢𝑛𝑡 𝑤, 𝐶
• In theory, 𝐶lim 𝐹𝐶 =
�
𝑃(𝑤).
→
𝑤
(the “frequentist view”)
#
CSC401/2511 – Spring 2022 19
The chain rule
•Recall,
𝑃 𝐴, 𝐵 = 𝑃𝐵 𝐴𝑃
𝐴 = 𝑃𝐴 𝐵 𝑃(𝐵)
𝑃 = 𝑃(
𝐵 𝐴 𝑃(𝐴, 𝐵)
𝐴)
•This extends to longer sequences, e.g.,
𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 𝑃 𝐵 𝐴 𝑃 𝐶
𝐴, 𝐵 𝑃(𝐷|𝐴, 𝐵, 𝐶)
• Or, in general,
𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛 = 𝑃 𝑤1 𝑃(𝑤2 |𝑤1 ) ⋯
𝑃(𝑤𝑛 |𝑤1 , 𝑤2 , … , 𝑤𝑛 $ 1 )
CSC401/2511 – Spring 2022 20
Very simple predictions
• Let’s return to word prediction.
• We want to know the probability of the next word given
the previous words in a sequence.
𝑛"1 )
•Bigram version:
I 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
wt-1 Chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0
Unigram counts: 2533 927 2417 746 158 1093 341 278
Unigram counts: 2533 927 2417 746 158 1093 341 278
• E.g., 𝑃(𝑒𝑛𝑔𝑙𝑖𝑠ℎ|𝑤𝑎𝑛𝑡) =
0.0011
World
𝑃(𝑐ℎ𝑖𝑛𝑒𝑠𝑒|𝑤𝑎𝑛𝑡) =
knowledge
0.0065
𝑃 𝑡𝑜 =
𝑤𝑎𝑛𝑡 0.66
𝑃 𝑒𝑎𝑡
𝑃(𝑓𝑜𝑜𝑑|𝑡𝑜)=
Syntax
= 𝑡𝑜 0.28
0
𝑃(𝑖| < 𝑠 >)=
0.25
Discourse
𝑃 = 𝖦𝑡 𝑃(𝑤$|𝑤$
of the conditional probabilities of its N-grams:
𝑠 ( &𝑤$( ) )
trigram
$%&
𝑃 = 𝖦𝑡 𝑃(𝑤$|
𝑠 𝑤$( ) )
bigram
$%)
• Which of these two models is better?
CSC401/2511 – Spring 33
2022
Shannon’s method
•We can use a language model to generate
random sequences.
the
Cat
in
Hat
</
s>
CSC401/2511 – Spring 2022 35
Problem with unigrams
•Unigrams give high probability to odd phrases.
e.g., 𝑃 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑡ℎ𝑒 </s> = 𝑃
𝑡ℎ𝑒 $
⋅ 𝑃(</s>)
> 𝑃(𝑡ℎ𝑒 𝐶𝑎𝑡 𝑖𝑛 𝑡ℎ𝑒 𝐻𝑎𝑡
</s>)
the
Cat
in
Hat
</
s>
CSC401/2511 – Spring 2022 36
Shannon’s method – bigrams
•Bigrams have fixed context once that context
𝑃(# |
has been sampled.
•e.g.
, 𝑡ℎ𝑒)
the the
Cat Cat
in in
Hat Hat
</ </
Time Step 1 s> Time Step 2 s>
CSC401/2511 – Spring 2022 37
Shannon and the Wall Street Journal
Unigram
• Months the my and issue of year foreign new exchange’s September were
recession exchange new endorsed a acquire to six executives.
• Last December through the way to preserve the Hudson corporation N.B.E.C.
Taylor would seem to complete the major central planners one point five percent
Bigram
of U.S.E. has already old M.X. corporation of living on information such as more
frequently fishing to keep her.
• They also point to ninety nine point six billion dollars from two hundred four oh
Trigram
six three percent of the rates of interest stores as Mexico and Brazil on market
conditions.
• Hill he late speaks; or! A more to leg less first you enter
• Are where exeunt and sighs have rise excellency took of.. Sleep knave we. Near;
vile like.
• What means, sir. I confess she? Then all sorts, he is trim, captain.
• Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.
Bigram
• King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch.
Quadrigram
•E.g., 𝑃 𝐺𝑙𝑜𝑢𝑐𝑒𝑠𝑡𝑒𝑟
possible next words, given the training data.
CSC401/2511 – Spring 2022 40
Evaluating a language model
•How can we quantify the goodness of a model?
of a corpus, 𝑃(𝐶).
model, we first need to estimate the probability
𝑃(𝑠*)
CSC401/2511 – Spring 2022 43
Intrinsic evaluation
•We estimate 𝑃 ⋅ given a particular corpus, e.g.,
Brown.
•A good model of the Brown corpus is one that makes
Brown very likely (even if that model is bad for other
𝑃1( ) ≥ 𝑃�
corpora).
𝑃1 𝑡𝑜 𝑤𝑎𝑛𝑡
… If
=⋯
∀𝑗
…
Brown Brown
)
then
corpus
( � corpus
=⋯
…
of the Brown corpus.
CSC401/2511 – Spring 2022 44
Maximum likelihood estimate
• Maximum likelihood estimate (MLE) of parameters 𝜃
in a model M, given training data T is
tables
𝜽 𝑡𝑜 𝑤𝑎𝑛𝑡 is 𝑃 𝑡𝑜 𝑤𝑎𝑛𝑡 .
• In fact, we have been doing MLE, within the N-gram
context, all along with our simple counting*
*(assuming an end-of-sentence token)
𝒱
• Then
l#$2 l#$ * ) $
! !
𝑃𝑃 𝐶 = = =
`
𝒱
2
𝐶
2
𝒱
2 2 2
�
(* ⁄ 𝒱 )
=
�
=2
CSC401/2511 – Spring 49
2022
Sparseness
•Problem with N-gram models:
•New words appear often as we read new data.
• e.g., interfrastic, espepsia, $182,321.09
•New bigrams occur even more often.
• Recall that Shakespeare only wrote ~0.04% of all
the bigrams he could have, given his vocabulary.
• Because there are so many possible bigrams, we
encounter new ones more frequently as we read.
•New trigrams occur even more even-more-
often.
CSC401/2511 – Spring 2022 50
Sparseness of unigrams vs. bigrams
• Conversely, we can see lots of every unigram, but still
miss many bigrams:
I want to eat Chinese food lunch spend
Unigram counts: 2533 927 2417 746 158 1093 341 278
wt
Count(wt-1,wt)
I want to eat Chinese food lunch spend
I 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
wt-1 Chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0
•Clearly, some words (e.g., want) are very popular and will
occur in many bigrams just from random chance.
𝑓𝖺 𝑓2𝑟
� k =𝑘
i.e., for some
1
�
(*) This does not make it true.
From Wikipedia
CSC401/2511 – Spring 2022 58
Zipf’s Law in perspective
• Zipf’s explanation of the phenomenon involved human
laziness.
CSC401/2511 – Spring 61
2022
Zero probability in Shakespeare
on
k
sh
sh
ut
ut
le
le
oc
oc
so
so
tf i
tf i
tro
tro
lm
lm
dd
dd
ca
ca
sa
sa
ha
ha
Actual counts Adjusted Imaginary
counts
CSC401/2511 – Spring 2022 65
1. Add-1 smoothing (“Laplace discounting”)
𝒱 and corpus size 𝑁 =
𝐶
• Given vocab size
.
:𝑃𝑤 =
• Just add 1 to all the counts! No more zeros!
• MLE
𝐶𝑜𝑢𝑛𝑡(𝑤)⁄𝑁 =
𝐶𝑜𝑢𝑛𝑡 ;
𝑁<
: 𝑃𝐿𝑎𝑝 𝑤
<) 𝒱
• Laplace estimate
• Does this give a proper probability distribution? Yes:
* 𝑃 𝐶𝑜𝑢𝑛𝑡 𝑤 +1 ∑ =
= * 𝑁+𝒱 𝑁 +𝒱 = 𝑁
𝐿𝑎 -
𝑤 𝑝 1
𝐶𝑜𝑢𝑛𝑡 +
- -
𝑤 +∑- 1 𝑁+ 𝒱
=
𝒱
CSC401/2511 – Spring 2022 66
1. Add-1 smoothing for
bigrams
• Same principle for bigrams:
=
𝑃𝐿𝑎𝑝 𝑤𝑡 𝐶𝑜𝑢𝑛𝑡
𝐶𝑜𝑢𝑛𝑡
𝑤𝑡 ( ) + 𝒱
𝑤 𝑡()
𝒱 𝑤 𝑡 ( ) 𝑤 𝑡over “imaginary”
/(𝑁 + 𝒱 ) uniformly
• We are essentially holding out and spreading
+1
events.
𝒱𝒱
categorization), sometimes
not…
CSC401/2511 – Spring 2022 71
Is there another way?
• Choice of 𝛿 is ad-hoc
• Has Zipf taught us nothing?
• Unseen words should behave more like hapax legomena.
• Words that occur a lot should behave like other words
that occur a lot.
• If I keep reading from a corpus, by the time I see a
new
word like ‘zenzizenzizenzic’, I will have seen ‘the’ a lot
more than once more.
• For some word in ‘bin’ 𝑁𝑐 , the MLE is that I saw that word
𝑐
of words
CSC401/2511 that occur 𝑐 + 1 times.
– Spring 2022 73
2. Good-Turing intuition/example
• Imagine you have this toy scenario:
Word ship pass camp frock soccer mother tops
Frequency 8 7 3 2 1 1 1
= 23 words total
•𝑃 𝑠𝑜𝑐𝑐𝑒𝑟 = 1/23
• What is the MLE prior probability of hearing ‘soccer’?
•𝑐 = 0 •𝑐 = 1
• Unseen words • Seen once (e.g., soccer)
• MLE: 𝑝 = • MLE: 𝑝 =
• 0/23
𝑃𝐺∗ [𝑢𝑛𝑠𝑒𝑒𝑛] 𝑁 • 1/23
𝑐∗ 𝑠𝑜𝑐𝑐𝑒𝑟 = 𝑁
𝑇= 𝑁O = 2⋅ =2⋅
𝑁P
3/23
• 𝑃 1/3
O
/
&
𝐺
∗𝑇 𝑠𝑜𝑐𝑐𝑒𝑟 D
23
CSC401/2511 – Spring 2022 75
2. Good-Turing limitations
• Q: What happens when you want to estimate 𝑃(𝑤)
when 𝑤 occurs 𝑐 times, but no word occurs 𝑐 + 1
• E.g., what is 𝑃𝐺∗ 𝑐𝑎𝑚𝑝since 𝑁E =
0?
times?
Word ship
𝑇pass camp frock soccer mother tops
Frequency 8 7 3 2 1 1 1
𝑐<)
• A1: We can re-estimate count 𝑐∗ 𝐸[𝑁 ]
=• Uses Expectation-Maximization (method
𝐸[𝑁 𝑐NO
𝑐]
.
used later)
𝑀𝑐𝐺𝑖𝑙𝑙)
‘Francisco’
has a very high unigram probability because it occurs a
lot.
• But ‘Francisco’ only occurs after ‘San’.
• Idea: We should give ‘Francisco’ a low unigram probability,
because it only occurs within the well-modeled ‘San Francisco’.
CSC401/2511 – Spring 2022 81
5. Kneser-Ney smoothing
• Let the unigram count be the number of different words
𝑁)< ⦁ 𝑤𝑡 = 𝑤 : 𝐶 >
that it follows. I.e.:
𝑡()
0
𝑁 ⦁⦁ = Ä ←The total number of bigram types.
)<
𝑤 𝑡()
; 𝑤𝑡
𝑁)<(⦁𝑤$)
• So, the unigram probability is 𝑃K𝑁 𝑤𝑡 =
i
𝑁ON
⦁; 𝑡
𝑁ON ⦁⦁
, and:
=
𝑃𝐾𝑁 𝑤𝑡 𝑤 𝑡 !
− 𝛿, 𝛿𝑁 ⦁
+ 𝑤 𝑛5*:𝑡!* )
max(𝐶 𝑤 𝑡 ! *5 𝑡!
0)
𝑛5*:𝑡!*
𝑃 𝐾𝑁 (𝑤 𝑡 |𝑤 𝑡!
𝑛 5 * :∑
𝑡 𝑡 𝐶(𝑤
- 𝑡! ∑ -𝑡 𝐶 𝑤 𝑡 !
𝑛57:𝑡!*
𝑛5*:𝑡
Where 𝑁$% 𝑤 𝑡)! 𝑛 % $ : ) ! $ ⦁ is the number of possible words that follow the
𝑛5*:𝑡
context.
CSC401/2511 – Spring 2022 82
5. Modified Kneser-Ney smoothing
• Use different absolute discounts 𝛿 depending on the n-
count s.t. 𝐶 𝑤 𝑡 –
≥ 𝛿 𝐶 ( F 𝑡"𝑛$1:𝑡) ≥
gram
=
𝑃𝑀𝐾𝑁𝑛 C
𝑤1 :𝑡
𝑡 𝑤 𝑡!
0
𝑛5*:𝑡!* 𝐶 𝑤 𝑡! − 𝛿 𝐶 ( - 𝑡)𝑛+ 1: 𝑡 )
+1 )
𝑛 5 * : 𝑡∑ 𝑡 𝐶 ( 𝑤 - 𝑃 𝑀𝐾𝑁 (𝑤 𝑡 |𝑤 𝑡!
- 𝑡! 𝑡)𝑛+1:𝑡)1 𝑛57:𝑡!*
−𝜆
)
• 𝛿 𝐶 ( F 𝑡 " 𝑛 $ 1 : 𝑡 ) could be learned or approximated,
𝑛5*:𝑡