0% found this document useful (0 votes)

19 views85 pages

2 Corpora and Smoothing

nlp

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views85 pages

2 Corpora and Smoothing

nlp

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 85

CSC401/2511 – Spring 2022 1

Overview
•(Statistical) language models (n-gram models)
•Counting
•Data
•Definitions
•Evaluations
•Distributions
•Smoothing
• Some slides are based on content from Bob Carpenter, Dan Klein, Roger Levy, Josh
Goodman, Dan Jurafsky, Christopher Manning, Gerald Penn, and Bill MacCartney.

CSC401/2511 – Spring 2022 2

Statistics: what are we counting?
•Statistical language models are based on simple
counting.
•What are we counting?
First, we shape our tools and thereafter
our tools shape us.

•Tokens: n.pl. instances of words or

punctuation (13).

•Types: n.pl. ‘kinds’ of words or punctuation

(10).
CSC401/2511 – Spring 2022 3
Confounding factors
•Are the following word pairs one type or two?
• (run, runs) (verb conjugation)
• (happy, happily) (adjective vs. adverb)
• (fra(1)gment, fragme(1)nt) (spoken stress)
• (realize, realise) (spelling)
• (We, we) (capitalization)

•How do we count speech disfluencies?

• e.g., I uh main-mainly do data processing
• Answer: It depends on your task.
• e.g., if you’re doing
summarization, you usually
don’t care about ‘uh’.
CSC401/2511 – Spring 2022 4
Does it matter how we count things?
•Answer: See lecture on feature extraction.
•Preview: yes, it matters…(sometimes)
•E.g., to diagnose Alzheimer’s disease from a
patient’s speech, you may want to measure:
• Excessive pauses (disfluencies),
• Excessive word type repetition, and
• Simplistic or short sentences.

•Where do we count things?

CSC401/2511 – Spring 2022 5
Corpora
•Corpus: n. A body of language data of a
particular sort (pl. corpora).

•Most useful corpora occur naturally.

• e.g., newspaper articles, telephone conversations,
multilingual transcripts of the United Nations, tweets.

•We use corpora to gather statistics.

•More is better (typically between 10M and 1T words).
•Be aware of bias.
CSC401/2511 – Spring 2022 6
Statistical modelling
•Insofar as language can be modelled statistically, it
might help to think of it in terms of dice.
Fair die Language

• Vocabulary: numbers • Vocabulary: words

• Vocabulary size: 6 • Vocabulary size: 2– 200,000

CSC401/2511 – Spring 2022 7

Learning probabilities
•What if the symbols are not equally likely?
• We have to estimate the bias using training data.
Loaded die
Language
• Observe many rolls of the die. • Observe many words.
• e.g., • e.g.,
1,6,5,4,1,3,2,2,…. …and then I will…

Training data
CSC401/2511 – Spring 2022 8
Training vs testing
Loaded die Language

•So you’ve learned your probabilities.

• Do they model unseen data from the same source well?
• Keep rolling the same dice. • Keep reading words.
• Do sides keep appearing in the • Do words keep appearing in the
same proportion as we expect? same proportion as we expect?

CSC401/2511 – Spring 2022 9

Sequences with no dependencies
• If you ignore the past entirely, the probability of a sequence
is the product of prior probabilities.

Loaded die Language

𝑃 2,1,4 = 𝑃 2 𝑃 1 𝑃(4) 𝑃 𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 =𝑃

𝑡ℎ𝑒 𝑃 𝑜𝑙𝑑 𝑃(𝑐𝑎𝑟)

𝑃 =𝑃 2 𝑃 1 𝑃 4 𝑃 𝑡ℎ𝑒 = 𝑃 𝑡ℎ𝑒 𝑃 𝑜𝑙𝑑 𝑃

• Language involves context. Ignoring that gives weird results, e.g.,
2,1,4 𝑜𝑙𝑑 𝑐𝑎𝑟 𝑐𝑎𝑟
=𝑃 2 𝑃 4 𝑃 1 = 𝑃(2,4,1) = 𝑃 𝑡ℎ𝑒 𝑃 𝑐𝑎𝑟 𝑃
𝑜𝑙𝑑
CSC401/2511 – Spring 2022 10 = 𝑃(𝑡ℎ𝑒 𝑐𝑎𝑟 𝑜𝑙𝑑)
Sequences with full dependencies
Magic die
Language
(with total memory)

𝑃 2,1,4 = 𝑃 2 𝑃 1|2 𝑃(4|2,1) 𝑃 𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 = 𝑃 𝑡ℎ𝑒 𝑃 𝑜𝑙𝑑|

𝑡ℎ𝑒 𝑃(𝑐𝑎𝑟|𝑡ℎ𝑒 𝑜𝑙𝑑)

• If you consider all of the past, you will never gather enough
data in order to be useful in practice.
• Imagine you’ve only seen the Brown corpus.

• 𝑃 𝑐𝑎𝑟 𝑡ℎ𝑒 𝑜𝑙𝑑 = 0 ∴ 𝑃 𝑡ℎ𝑒

• The sequence ‘the old car’ never appears therein.
𝑜𝑙𝑑 𝑐𝑎𝑟
CSC401/2511 –
= 011
Sequences with fewer dependencies?
Magic die
Language
(with recent memory)

𝑃 2,1,4 = 𝑃 2 𝑃 1|2 𝑃(4|1) 𝑃 = 𝑃 𝑡ℎ𝑒 𝑃

𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 𝑜𝑙𝑑|𝑡ℎ𝑒 3
𝑃(𝑐𝑎𝑟|𝑜𝑙𝑑)
• Only consider
• Imagine two words
you’ve at athe
only seen time...
Brown corpus.

• 𝑃 𝑜𝑙𝑑 𝑡ℎ𝑒 > 0, 𝑃 𝑐𝑎𝑟|𝑜𝑙𝑑

• The sequences ‘the old’ & ‘old car’ do appear therein!
>0∴𝑃 𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 > 0
• Also, 𝑃 𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 > 𝑃(𝑡ℎ𝑒 𝑐𝑎𝑟 𝑜𝑙𝑑)
CSC401/2511 – Spring 2022 12
LANGUAGE MODELS

CSC401/2511 – Spring 13
2022
Word prediction
•Guess the next word…
•*Spoilers* You can do quite
well by counting how
often certain tokens occur
given their contexts.

𝑃(𝑤 𝑡 |𝑤 𝑡"1 ) from

•E.g., estimate

𝑤 𝑡 " 1 , 𝑤𝑡
count of
in
corpus
CSC401/2511 – Spring 2022 14
Word prediction with N-grams
• N-grams: n.pl. token sequences of length N.
• The fragment ‘in this sentence is’ contains the following
2-grams (i.e., ‘bigrams’):
• (in this), (this sentence), (sentence is)

• The next bigram must start with ‘is’.

is,⋅
• What word is most likely to follow ‘is’?
• Derived from bigrams

CSC401/2511 – Spring 2022 15

Use of N-gram models
• Given the probabilities of N-grams, we can compute the
conditional probabilities of possible subsequent words.

• E.g., 𝑃
> 𝑃 𝑖𝑠 𝑎
𝑖𝑠 𝑡ℎ𝑒 ∴
𝑃 𝑡ℎ𝑒 𝑖𝑠 > 𝑃(𝑎|𝑖𝑠)
Then we would predict:

‘the last word in this sentence is the.’

(The last word in this sentence is missing.)

CSC401/2511 – Spring 2022 16
Language models
•Language model: n. The statistical (or neural…)
model of a language.
• e.g., probabilities of words in an

i.e., 𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛
ordered sequence.

•Word prediction is at the heart of language

modelling.

•What do we do with a language model?

CSC401/2511 – Spring 2022 17
Language model usage
•Language models can score and sort sentences.
• e.g., 𝑃 𝐼 𝑙𝑖𝑘𝑒 𝑎𝑝𝑝𝑙𝑒𝑠 ≫ 𝑃(𝐼 𝑙𝑖𝑐𝑘 𝑎𝑝𝑝𝑙𝑒𝑠)
• Commonly used to (re-)rank hypotheses in other tasks

•Infer properties of natural language

• e.g., 𝑃 𝑙𝑒𝑠 𝑝𝑜𝑚𝑚𝑒𝑠 𝑟𝑜𝑢𝑔𝑒𝑠 > 𝑃(𝑙𝑒𝑠
𝑟𝑜𝑢𝑔𝑒𝑠 𝑝𝑜𝑚𝑚𝑒𝑠)
• Embedding spaces

•Efficiently compress text

•How do we calculate 𝑃 … ?
CSC401/2511 – Spring 2022 18
Frequency statistics
•Term count (𝑪𝒐𝒖𝒏𝒕) of term w in corpus C is
the number of tokens of term w
C.
in

𝐶𝑜𝑢𝑛𝑡 𝑤, 𝐶

•Relative frequency (𝑭𝑪) is defined relative to

𝐹 𝐶𝑜𝑢𝑛𝑡(𝑤,
= in the 𝐶.
the
total number of𝐶tokens
𝑤 𝐶)
�corpus,

• In theory, 𝐶lim 𝐹𝐶 =
�

𝑃(𝑤).
→
𝑤
(the “frequentist view”)

#
CSC401/2511 – Spring 2022 19
The chain rule
•Recall,
𝑃 𝐴, 𝐵 = 𝑃𝐵 𝐴𝑃
𝐴 = 𝑃𝐴 𝐵 𝑃(𝐵)
𝑃 = 𝑃(
𝐵 𝐴 𝑃(𝐴, 𝐵)
𝐴)
•This extends to longer sequences, e.g.,
𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 𝑃 𝐵 𝐴 𝑃 𝐶
𝐴, 𝐵 𝑃(𝐷|𝐴, 𝐵, 𝐶)

• Or, in general,
𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛 = 𝑃 𝑤1 𝑃(𝑤2 |𝑤1 ) ⋯
𝑃(𝑤𝑛 |𝑤1 , 𝑤2 , … , 𝑤𝑛 $ 1 )
CSC401/2511 – Spring 2022 20
Very simple predictions
• Let’s return to word prediction.
• We want to know the probability of the next word given
the previous words in a sequence.

• We can approximate conditional probabilities by

• E.g., 𝑃 𝑓𝑜𝑜𝑑 𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒) =
counting occurrences in large corpora of data.

𝑃(𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒 𝑓𝑜𝑜𝑑)

𝑃(𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒 ⋅)
𝐶𝑜𝑢𝑛𝑡(𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒 𝑓𝑜𝑜𝑑)
≈
𝐶𝑜𝑢𝑛𝑡(𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒)
CSC401/2511 – Spring 2022 21
Problem with the chain rule
•There are many (∞?) possible sentences.
•In general, we won’t have enough data to compute
reliable statistics for long prefixes

𝑃(𝑝𝑟𝑒𝑡𝑡𝑦|𝐼 ℎ𝑒𝑎𝑟𝑑 𝑡ℎ𝑖𝑠 𝑔𝑢𝑦 𝑡𝑎𝑙𝑘𝑠

•E.g.,

𝑡𝑜𝑜 𝑓𝑎𝑠𝑡 𝑏𝑢𝑡

𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 ℎ𝑖𝑠 𝑠𝑙𝑖𝑑𝑒𝑠 𝑎𝑟𝑒) =
𝑃(𝐼 ℎ𝑒𝑎𝑟𝑑 … 𝑎𝑟𝑒 𝑝𝑟𝑒𝑡𝑡𝑦) 0
=
𝑃(𝐼 ℎ𝑒𝑎𝑟𝑑 … 𝑎𝑟𝑒) 0

•How can we avoid {0, ∞}-probabilities?

CSC401/2511 – Spring 2022 22
Independence!
•We can simplify things if we’re willing to break
from the distant past and focus on recent history.

𝑃(𝑝𝑟𝑒𝑡𝑡𝑦|𝐼 ℎ𝑒𝑎𝑟𝑑 𝑡ℎ𝑖𝑠 𝑔𝑢𝑦 𝑡𝑎𝑙𝑘𝑠

•e.g.,
𝑡𝑜𝑜 𝑓𝑎𝑠𝑡
𝑏𝑢𝑡 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 ℎ𝑖𝑠 𝑠𝑙𝑖𝑑𝑒𝑠
𝑎𝑟𝑒)
≈𝑃 𝑝𝑟𝑒𝑡𝑡𝑦𝑠𝑙𝑖𝑑𝑒𝑠
𝑎𝑟𝑒
≈𝑃 𝑝𝑟𝑒𝑡𝑡𝑦𝑎𝑟𝑒
•I.e.,– Spring
CSC401/2511 we2022
assume statistical
23 independence.
Markov assumption

a short linear history of length 𝑁 − 1.

•Assume each observation only depends on

𝑃(𝑤𝑛|𝑤1: 𝑛 " 1 ) ≈ 𝑃(𝑤𝑛|𝑤 𝑛 " 𝑁 ' 1 :

𝑛"1 )

•Bigram version:

𝑃(𝑤𝑛|𝑤1: 𝑛"1 )≈𝑃 𝑤𝑛

CSC401/2511 – Spring 2022
𝑤24𝑛 " 1 )
Berkeley Restaurant Project corpus
• Let’s compute simple N-gram models of speech queries
about restaurants in Berkeley California.
• E.g.,
• can you tell me about any good cantonese
restaurants close by
• mid priced thai food is what i’m looking
for
• tell me about chez panisse
• can you give me a listing of the kinds of food that
are available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day
CSC401/2511 – Spring 2022 25
Example bigram counts
• Out of 9222 sentences,
• e.g., “I want” occurred 827 times
wt
Count(wt-1,wt)
I want to eat Chinese food lunch spend

I 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
wt-1 Chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0

CSC401/2511 – Spring 2022 26

Example bigram probabilities
• Obtain likelihoods by dividing bigram counts by unigram
counts. I want to eat Chinese food lunch spend

Unigram counts: 2533 927 2417 746 158 1093 341 278

P(wt|wt-1) I want to eat Chinese food lunch spend

I 0.002 0.33 0 0.0036 0 0 0 0.00079

𝑃 𝐶𝑜𝑢𝑛𝑡(𝐼 𝑤𝑎𝑛𝑡) 827

≈ =
𝑤𝑎𝑛𝑡 𝐼
≈ 0.33
𝐶𝑜𝑢𝑛𝑡(𝐼) 𝑃 2533 ≈ 𝐶𝑜𝑢𝑛𝑡(𝐼 𝑠𝑝𝑒𝑛𝑑) = ≈ 7.9×10 !
𝑠𝑝𝑒𝑛𝑑 𝐼 4
2

CSC401/2511 – Spring 2022 27 𝐶𝑜𝑢𝑛𝑡(𝐼)

Example bigram probabilities
• Obtain likelihoods by dividing bigram counts by unigram
counts. I want to eat Chinese food lunch spend

Unigram counts: 2533 927 2417 746 158 1093 341 278

P(wt|wt-1) I want to eat Chinese food lunch spend

I 0.002 0.33 0 0.0036 0 0 0 0.00079
want 0
to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087
eat 0 0 0 0
Chinese 0.0063 0 0 0 0 0.52 0.0063 0
food 0 0 0 0
lunch 0.0059 0 0 0 0 0.0029 0 0
spend 0 0 0 0 0 0

CSC401/2511 – Spring 2022 28

Bigram estimate of an unseen phrase
• We can string bigram probabilities together to estimate
the probability of whole sentences.
• We use the start (<s>) and end (</s>) tags here.

P (<s> I want english food </s>) ≈

CSC401/2511 – Spring 2022 29

N-grams as linguistic knowledge
• Despite their simplicity, N-gram probabilities can crudely
capture interesting facts about language and the world.

• E.g., 𝑃(𝑒𝑛𝑔𝑙𝑖𝑠ℎ|𝑤𝑎𝑛𝑡) =
0.0011
World

𝑃(𝑐ℎ𝑖𝑛𝑒𝑠𝑒|𝑤𝑎𝑛𝑡) =
knowledge

0.0065
𝑃 𝑡𝑜 =
𝑤𝑎𝑛𝑡 0.66
𝑃 𝑒𝑎𝑡
𝑃(𝑓𝑜𝑜𝑑|𝑡𝑜)=
Syntax

= 𝑡𝑜 0.28
0
𝑃(𝑖| < 𝑠 >)=
0.25
Discourse

CSC401/2511 – Spring 2022 30

Probabilities of sentences
• The probability of a sentence 𝑠 is defined as the
product

𝑃 = 𝖦𝑡 𝑃(𝑤$|𝑤$
of the conditional probabilities of its N-grams:

𝑠 ( &𝑤$( ) )
trigram

$%&
𝑃 = 𝖦𝑡 𝑃(𝑤$|
𝑠 𝑤$( ) )
bigram

$%)
• Which of these two models is better?

CSC401/2511 – Spring 2022 31

Aside - are N-grams still relevant?
• Appropriately smoothed N-gram LMs:
(Shareghi et al. 2019):
• Are often cheaper to train/query
than neural LMs
• Are interpolated with neural LMs to often achieve
state-of-the-art performance
• Occasionally outperform neural LMs
• At least are a good baseline
• Usually handle previously unseen tokens in a more
principled (and fairer) way than neural LMs
• N-gram probabilities are interpretable
• Convenient
CSC401/2511 – Spring 2022 32
EVALUATING LANGUAGE MODELS

CSC401/2511 – Spring 33
2022
Shannon’s method
•We can use a language model to generate
random sequences.

•We ought to see sequences that are similar to

those we used for training.

•This approach is attributed to Claude

Shannon.

CSC401/2511 – Spring 2022 34

Shannon’s method – unigrams
•Sample a model according to its probability.
•For unigrams, keep picking tokens.
•e.g., imagine throwing darts at this:

the

Cat
in
Hat
</
s>
CSC401/2511 – Spring 2022 35
Problem with unigrams
•Unigrams give high probability to odd phrases.
e.g., 𝑃 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑡ℎ𝑒 </s> = 𝑃
𝑡ℎ𝑒 $
⋅ 𝑃(</s>)
> 𝑃(𝑡ℎ𝑒 𝐶𝑎𝑡 𝑖𝑛 𝑡ℎ𝑒 𝐻𝑎𝑡
</s>)

the

Cat
in
Hat
</
s>
CSC401/2511 – Spring 2022 36
Shannon’s method – bigrams
•Bigrams have fixed context once that context

𝑃(# |
has been sampled.
•e.g.
, 𝑡ℎ𝑒)

the the

Cat Cat
in in
Hat Hat
</ </
Time Step 1 s> Time Step 2 s>
CSC401/2511 – Spring 2022 37
Shannon and the Wall Street Journal
Unigram

• Months the my and issue of year foreign new exchange’s September were
recession exchange new endorsed a acquire to six executives.

• Last December through the way to preserve the Hudson corporation N.B.E.C.
Taylor would seem to complete the major central planners one point five percent
Bigram

of U.S.E. has already old M.X. corporation of living on information such as more
frequently fishing to keep her.

• They also point to ninety nine point six billion dollars from two hundred four oh
Trigram

six three percent of the rates of interest stores as Mexico and Brazil on market
conditions.

CSC401/2511 – Spring 2022 38

Shannon’s method on Shakespeare
• To him swallowed confess hear both. Which. Of save on trail for are ay device and
rote life have
Unigram

• Hill he late speaks; or! A more to leg less first you enter
• Are where exeunt and sighs have rise excellency took of.. Sleep knave we. Near;
vile like.
• What means, sir. I confess she? Then all sorts, he is trim, captain.
• Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.
Bigram

Live king. Follow.

• What we, hat got so she that I rest and sent to scold and nature bankrupt nor the
first gentleman?

• Sweet prince, Falstaff shall die. Harry of Monmouth’s grave.

Trigram

• This shall forbid it should be branded, if renown made it empty.

• Indeed the duke; and had a very good friend.

• King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch.
Quadrigram

• Will you not tell me who I am?

• It cannot be but so.
• Indeed the short and the long. Marry. ‘tis a noble Lepidus.

CSC401/2511 – Spring 2022 39

Shakespeare as a corpus
•884,647 tokens, vocabulary of 𝑉 = 29,066

•Shakespeare produced about 300,000 bigram

types.

types out of 𝑉2 ≈ 845𝑀 possible bigram

•∴ 99.96% of possible bigrams were never seen

types.

(i.e., they have 0 probability in the bigram table).

•Quadrigrams appear more similar to Shakespeare

because, for increasing context, there are fewer

•E.g., 𝑃 𝐺𝑙𝑜𝑢𝑐𝑒𝑠𝑡𝑒𝑟
possible next words, given the training data.
CSC401/2511 – Spring 2022 40
Evaluating a language model
•How can we quantify the goodness of a model?

•How do we know whether one model is better

than another?
•There are 2 general ways of evaluating LMs:
• Extrinsic: in terms of some external
measure
(this depends on some task or application).
• Intrinsic: in terms of properties of the LM
itself.

CSC401/2511 – Spring 2022 41

Extrinsic evaluation
•The utility of a language model is
often determined in situ (i.e., in
practice).

Alternately embed LMs 𝐴 and 𝐵

• e.g.,
1.
into a speech recognizer.
2. Run speech recognition using each
model.

the system that uses LM 𝐴 and

3. Compare recognition rates between

the system that uses LM 𝐵.

CSC401/2511 – Spring 2022 42
Intrinsic evaluation
•To measure the intrinsic value of a language

of a corpus, 𝑃(𝐶).
model, we first need to estimate the probability

(e.g., 𝑃(𝑡𝑜|𝑤𝑎𝑛𝑡)) to maximize 𝑃(𝐶𝑜𝑟𝑝𝑢𝑠).

• This will also let us adjust/estimate model parameters

•For a corpus of sentences, 𝐶, we sometimes

conditionally independent: 𝑃 𝐶 = ∏*
make the assumption that the sentences are

𝑃(𝑠*)
CSC401/2511 – Spring 2022 43
Intrinsic evaluation
•We estimate 𝑃 ⋅ given a particular corpus, e.g.,
Brown.
•A good model of the Brown corpus is one that makes
Brown very likely (even if that model is bad for other

𝑃1( ) ≥ 𝑃�
corpora).
𝑃1 𝑡𝑜 𝑤𝑎𝑛𝑡
… If
=⋯

∀𝑗
…
Brown Brown
)
then
corpus
( � corpus

𝑃1 is the best model

𝑃2 𝑡𝑜 𝑤𝑎𝑛𝑡
…

=⋯
…
of the Brown corpus.
CSC401/2511 – Spring 2022 44
Maximum likelihood estimate
• Maximum likelihood estimate (MLE) of parameters 𝜃
in a model M, given training data T is

𝜃∗ = argmax+𝐿𝑀 𝜃|𝑇 , 𝐿𝑀 𝜃|𝑇

= 𝑃𝑀 + 𝑇
𝑻 is the Brown corpus,
𝑴 is the bigram and unigram
• e.g.,

tables
𝜽 𝑡𝑜 𝑤𝑎𝑛𝑡 is 𝑃 𝑡𝑜 𝑤𝑎𝑛𝑡 .
• In fact, we have been doing MLE, within the N-gram
context, all along with our simple counting*
*(assuming an end-of-sentence token)

CSC401/2511 – Spring 2022 45

Perplexity
l&g 𝘗
•Perplexity corp. 𝐶, 𝑃𝑃 𝐶 =
#* (
= 𝑃 𝐶 #$/
*

• If2 you have a vocabulary 𝒱 with 𝒱 word𝐶

and your LM is uniform (i.e., 𝑃 𝑤 1J ∀𝑤
= ∈ 𝒱),
types,

𝒱
• Then
l#$2 l#$ * ) $
! !
𝑃𝑃 𝐶 = = =
`

𝑃(𝐶) ! l#$ l#$2

𝒱

𝒱
2
𝐶
2
𝒱
2 2 2
�
(* ⁄ 𝒱 )
=
�

•Perplexity is sort of like a ‘branching factor’.

•Minimizing perplexity ≡ maximizing probability of corpus
CSC401/2511 – Spring 2022 46
Perplexity as an evaluation metric
•Lower perplexity → a better model.
•(more on this in the section on information theory)

•e.g., splitting WSJ corpus into a 38M word

training set and a 1.5M word test set gives:
N-gram order Unigram Bigram Trigram
Perplexity 962 170 109

CSC401/2511 – Spring 2022 47

Modelling language
• So far, we’ve modelled language as a surface phenomenon
using only our observations (i.e., words).

• Language is hugely complex and involves hidden structure

(recall: syntax, semantics, pragmatics).

• A ‘true’ model of language would take into account all

those things and the proper relations between them.

• Our first hint of modelling hidden structure will come with

uncovering grammatical roles (i.e., parts-of-speech)
CSC401/2511 – Spring 2022 48
ZIPF AND THE NATURAL DISTRIBUTIONS
IN LANGUAGE

CSC401/2511 – Spring 49
2022
Sparseness
•Problem with N-gram models:
•New words appear often as we read new data.
• e.g., interfrastic, espepsia, $182,321.09
•New bigrams occur even more often.
• Recall that Shakespeare only wrote ~0.04% of all
the bigrams he could have, given his vocabulary.
• Because there are so many possible bigrams, we
encounter new ones more frequently as we read.
•New trigrams occur even more even-more-
often.
CSC401/2511 – Spring 2022 50
Sparseness of unigrams vs. bigrams
• Conversely, we can see lots of every unigram, but still
miss many bigrams:
I want to eat Chinese food lunch spend

Unigram counts: 2533 927 2417 746 158 1093 341 278

wt
Count(wt-1,wt)
I want to eat Chinese food lunch spend

I 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
wt-1 Chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0

CSC401/2511 – Spring 2022 51

Why does sparseness happen?
•The bigram table appears to be filled in non-uniformly.

•Clearly, some words (e.g., want) are very popular and will
occur in many bigrams just from random chance.

•Other words are not-so-popular (e.g., hippopotomonstrosesquipedalian).

partner word will have its own 𝑃(𝑤).

They will occur infrequently, and when they do their

•Is there some phenomenon that describes 𝑃 𝑤

in real language?
CSC401/2511 – Spring 2022 52
Patterns of unigrams
•Words in Tom Sawyer by Mark Twain:
Word Frequency
the 3332 •A few words occur
and 2972 very frequently.
a 1775 • Aside: the most frequent 256 English
to 1725 word types account for 50% of English
of 1440 tokens.
• Aside: for Hungarian, we need the top
was 1161
4096 to account for 50%.
it 1027
in
that
906
877
•Many words occur
he 877 very infrequently.
… …

CSC401/2511 – Spring 2022 53

Frequency of frequencies
• How many words occur 𝑋 number of times in Tom
Hapax Sawyer?
legomena: n.pl. Word frequency # of word types with that frequency
words that occur 1 3993 e.g.,
once in a corpus. 2 1292 1292 word types
3 664 occur twice
4 410
Notice how many
5 243 word types are
6 199 relatively rare!
7 172
8 131
9 82
10 91
11-50 540
51-100 99
>100 102

CSC401/2511 – Spring 2022 54

Ranking words in Tom
Sawyer
• Rank word types in order of decreasing frequency.
Word Freq. Rank f·r Word Freq. Rank f·r
(f) (r) (f) (r)
the 3332 1 3332 name 21 400 8400
and 2972 2 5944 comes 16 500 8000
With some
a 1775 3 5235 group 13 600 7800 (relatively minor)
he 877 10 8770 lead 11 700 7700 exceptions,
f·r is very
but 410 20 8400 friends 10 800 8000
consistent!
be 294 30 8820 begin 9 900 8100
there 222 40 8880 family 8 1000 8000
one 172 50 8600 brushed 4 2000 8000
about 158 60 9480 sins 2 3000 6000
more 138 70 9660 Could 2 4000 8000
never 124 80 9920 Applausive 1 8000 8000

CSC401/2511 – Spring 2022 55

Zipf’s Law
• In Human Behavior and the Principle of Least Effort, Zipf
argues(*) that all human endeavour depends on laziness.
• Speaker minimizes effort by having a small vocabulary of
common words.
• Hearer minimizes effort by having a large vocabulary of
less ambiguous words.
• Compromise: frequency and rank are inversely proportional.

𝑓𝖺 𝑓2𝑟
� k =𝑘
i.e., for some
1
�
(*) This does not make it true.

CSC401/2511 – Spring 2022 56

Zipf’s Law on the Brown corpus

From Manning & Schütze

CSC401/2511 – Spring 2022 57
Zipf’s Law on the novel Moby Dick

From Wikipedia
CSC401/2511 – Spring 2022 58
Zipf’s Law in perspective
• Zipf’s explanation of the phenomenon involved human
laziness.

• Simon’s discourse model (1956) argued that the phenomenon

could equally be explained by two processes:
• People imitate relative frequencies of words they hear
• People innovate new words with small, constant
probability

• There are other explanations.

CSC401/2511 – Spring 2022 59

Aside – Zipf’s Law in perspective
• Zipf also observed that frequency correlates with several other
properties of words, e.g.:
• Age (frequent words are old)
• Polysem (frequent words often have many meanings or
y higher-order functions of meaning, e.g.,
chair) (frequent words are spelled with few
• Length letters)
• He also showed that there are hyperbolic distributions in the world
(crucially, they’re not Gaussian), just like:
• Yule’s Law: B = 1 + g
• s: probability ofsmutation becoming dominant in species
• g: probability of mutation that expels species from genus
• Pareto distributions (wealth distribution)
CSC401/2511 – Spring 2022 60
SMOOTHING

CSC401/2511 – Spring 61
2022
Zero probability in Shakespeare

300,000 bigrams out of a possible

• Shakespeare’s collected writings account for about
𝑉2 ≈ 845𝑀 bigrams, given his lexicon.
• So 99.96% of the possible bigrams were never seen.
• Now imagine that someone finds a new play and wants
to know whether it is Shakespearean…
• Shakespeare isn’t very predictable! Every time the
play uses one of those 99.96% bigrams, the sentence
that contains it (and the play!) gets 0 probability.
• This is bad.

CSC401/2511 – Spring 2022 62

Zero probability in general

•Some N-grams are just really rare.

• e.g., perhaps ‘negative press covfefe’

•If we had more data, perhaps we’d see them.

•If we have no way to determine the distribution
of unseen N-grams, how can we estimate them?

CSC401/2511 – Spring 2022 63

Smoothing mechanisms
•Smoothing methods we will cover:
1. Add-𝛿 smoothing (Laplace)
2. Good-Turing
3. Simple interpolation (Jelinek-Mercer)
4. Absolute discounting
5. Kneser-Ney smoothing
6. Modified Kneser-Ney smoothing

CSC401/2511 – Spring 2022 64

Smoothing as redistribution
• Make the distribution more uniform.
• This moves the probability mass from ‘the rich’ towards
‘the poor’.
10 10
8 8
6 6
4 4
2 2
0 0
on

k
sh

sh
ut

ut
le

le
oc

oc
so

so
tf i

tf i
tro

tro
lm

lm
dd

dd
ca

ca
sa

sa
ha

ha
Actual counts Adjusted Imaginary
counts
CSC401/2511 – Spring 2022 65
1. Add-1 smoothing (“Laplace discounting”)
𝒱 and corpus size 𝑁 =
𝐶
• Given vocab size
.

:𝑃𝑤 =
• Just add 1 to all the counts! No more zeros!
• MLE
𝐶𝑜𝑢𝑛𝑡(𝑤)⁄𝑁 =
𝐶𝑜𝑢𝑛𝑡 ;
𝑁<
: 𝑃𝐿𝑎𝑝 𝑤
<) 𝒱
• Laplace estimate
• Does this give a proper probability distribution? Yes:
* 𝑃 𝐶𝑜𝑢𝑛𝑡 𝑤 +1 ∑ =
= * 𝑁+𝒱 𝑁 +𝒱 = 𝑁
𝐿𝑎 -
𝑤 𝑝 1
𝐶𝑜𝑢𝑛𝑡 +
- -
𝑤 +∑- 1 𝑁+ 𝒱
=
𝒱
CSC401/2511 – Spring 2022 66
1. Add-1 smoothing for
bigrams
• Same principle for bigrams:
=
𝑃𝐿𝑎𝑝 𝑤𝑡 𝐶𝑜𝑢𝑛𝑡
𝐶𝑜𝑢𝑛𝑡
𝑤𝑡 ( ) + 𝒱
𝑤 𝑡()

𝒱 𝑤 𝑡 ( ) 𝑤 𝑡over “imaginary”
/(𝑁 + 𝒱 ) uniformly
• We are essentially holding out and spreading

+1
events.

• Does this work?

CSC401/2511 – Spring 2022 67

1. Laplace smoothed bigram counts
• Out of 9222 sentences in Berkeley restaurant corpus,
• e.g., “I want” occurred 827 times so Laplace gives 828
wt
Count(wt-1,wt)
I want to eat Chinese food lunch spend

I 5+1 827+1 1 9+1 1 1 1 2+1

want 2+1 1 608+1 1+1 6+1 6+1 5+1 1+1
to 2+1 1 4+1 686+1 2+1 1 6+1 211+1
eat 1 1 2+1 1 16+1 2+1 42+1 1
wt-1 Chinese 1+1 1 1 1 1 82+1 1+1 1
food 15+1 1 15+1 1 1+1 4+1 1 1
lunch 2+1 1 1 1 1 1+1 1 1
spend 1+1 1 1+1 1 1 1 1 1

CSC401/2511 – Spring 2022 68

1. Laplace smoothed probabilities
+
𝑃𝐿𝑎 𝑤𝑡 𝐶
= +1
𝑤𝑝 𝑡( ) 𝐶
𝑤 𝑡() 𝑤 𝑡 𝒱 food
0.0025𝑤 𝑡 0.00025
()
P(wt|wt-1) I want to eat Chinese lunch spend
I 0.0015 0.21 0.00025 0.00025 0.00025 0.00075
want 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0.00084
to 0.00083 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055
eat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046
Chinese 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0.00062
food 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0.00039
lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056
spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058

CSC401/2511 – Spring 2022 69

1. Add-𝟏
smoothing
𝑃 𝑡𝑜𝑤𝑎𝑛𝑡 went from 0.66 to 0.26.
• According to this method,
• That’s a huge change!
• In extrinsic evaluations, the results are not great.
• Sometimes ~90% of the probability mass is spread across
• It only works if we know 𝒱 beforehand.
unseen events.

CSC401/2511 – Spring 2022 70

1. Add-𝛿
• Generalize Laplace: Add 𝜹 < 𝟏 to be a bit less
smoothing
:𝑃𝑤 =
generous.
• MLE
𝐶𝑜𝑢𝑛𝑡(𝑤)⁄𝑁 : 𝑃 𝑎𝑑𝑑(@ =
𝐶𝑜𝑢𝑛𝑡 ;
𝑁<@
• Add- 𝛿 estimate <@ 𝒱

• Does this give a proper𝑤probability distribution? Yes:

+ 𝑃𝑎 𝑑𝑑 – 𝐶𝑜𝑢𝑛𝑡 𝑤 +𝛿
= + 𝑁+𝛿 𝒱 =
𝑤 I
𝑁 +𝑤𝛿 + ∑E 𝛿 𝒱
E E
∑ E 𝐶𝑜𝑢𝑛𝑡
𝑁+𝛿 =
=𝑁+𝛿
1
This sometimes works
empirically (e.g., in text

𝒱𝒱
categorization), sometimes
not…
CSC401/2511 – Spring 2022 71
Is there another way?
• Choice of 𝛿 is ad-hoc
• Has Zipf taught us nothing?
• Unseen words should behave more like hapax legomena.
• Words that occur a lot should behave like other words
that occur a lot.
• If I keep reading from a corpus, by the time I see a
new
word like ‘zenzizenzizenzic’, I will have seen ‘the’ a lot
more than once more.

CSC401/2511 – Spring 2022 72

2. Good-Turing
• Define 𝑁𝑐 as the number of N-grams that occur c
times.
• “Count of counts”
Word
frequency
# of words (i.e., unigrams)
with that frequency
1 𝑵𝟏 = 3993
2 𝑁& = 1292
3 𝑁3 = 664
… …
(from Tom Sawyer)

• For some word in ‘bin’ 𝑁𝑐 , the MLE is that I saw that word
𝑐

• Idea: get rid of zeros by re-estimating 𝑐 using the MLE

times.

of words
CSC401/2511 that occur 𝑐 + 1 times.
– Spring 2022 73
2. Good-Turing intuition/example
• Imagine you have this toy scenario:
Word ship pass camp frock soccer mother tops
Frequency 8 7 3 2 1 1 1

= 23 words total

•𝑃 𝑠𝑜𝑐𝑐𝑒𝑟 = 1/23
• What is the MLE prior probability of hearing ‘soccer’?

• What is the probability of seeing something new?

• No way to tell, but 3/23 words are hapax legomena (𝑁J = 3).
• If we use 3/23 to approximate things we’ve never seen, then we
have to also adjust other probabilities (e.g., 𝑃𝐺𝑇 𝑠𝑜𝑐𝑐𝑒𝑟
< 1/23).
CSC401/2511 – Spring 2022 74
2. Good-Turing adjustments
• 𝑃∗ [𝑢𝑛𝑠𝑒𝑒𝑛]
= 𝑁)/𝑁
𝑐<)
• Re-estimate count 𝑐∗
𝐺𝑇
𝑁𝑐NO
𝑁
= 𝑐

•𝑐 = 0 •𝑐 = 1
• Unseen words • Seen once (e.g., soccer)

• MLE: 𝑝 = • MLE: 𝑝 =
• 0/23
𝑃𝐺∗ [𝑢𝑛𝑠𝑒𝑒𝑛] 𝑁 • 1/23
𝑐∗ 𝑠𝑜𝑐𝑐𝑒𝑟 = 𝑁
𝑇= 𝑁O = 2⋅ =2⋅
𝑁P

3/23
• 𝑃 1/3
O

/
&
𝐺
∗𝑇 𝑠𝑜𝑐𝑐𝑒𝑟 D
23
CSC401/2511 – Spring 2022 75
2. Good-Turing limitations
• Q: What happens when you want to estimate 𝑃(𝑤)
when 𝑤 occurs 𝑐 times, but no word occurs 𝑐 + 1
• E.g., what is 𝑃𝐺∗ 𝑐𝑎𝑚𝑝since 𝑁E =
0?
times?
Word ship
𝑇pass camp frock soccer mother tops
Frequency 8 7 3 2 1 1 1

𝑐<)
• A1: We can re-estimate count 𝑐∗ 𝐸[𝑁 ]
=• Uses Expectation-Maximization (method
𝐸[𝑁 𝑐NO
𝑐]
.
used later)

values of 𝑐 that we do have.

• A2: We can interpolate linearly, in log-log, between

CSC401/2511 – Spring 2022 76

2. Good-Turing limitations
• Q: What happens when 𝐶𝑜𝑢𝑛𝑡 𝑀𝑐𝐺𝑖𝑙𝑙 𝑔𝑒𝑛𝑖𝑢𝑠
= 0 and 𝐶𝑜𝑢𝑛𝑡 𝑀𝑐𝐺𝑖𝑙𝑙 𝑏𝑟𝑎𝑖𝑛𝑏𝑜𝑥 = 0, and

• A: 𝑃 𝑔𝑒𝑛𝑖𝑢𝑠 𝑀𝑐𝐺𝑖𝑙𝑙 = 𝑃(𝑏𝑟𝑎𝑖𝑛𝑏𝑜𝑥|

we smooth bigrams?

𝑀𝑐𝐺𝑖𝑙𝑙)

𝑃 𝑔𝑒𝑛𝑖𝑢𝑠 𝑀𝑐𝐺𝑖𝑙𝑙 >𝑃

• But we’d expect
𝑏𝑟𝑎𝑖𝑛𝑏𝑜𝑥 𝑀𝑐𝐺𝑖𝑙𝑙
(context notwithstanding) because ‘genius’ is a more
common word than ‘brainbox’).

• The solution may be to combine

CSC401/2511 – Spring 2022 77
bigram and unigram
3. Simple interpolation (Jelinek-Mercer)
• Combine trigram, bigram, and unigram probabilities.
= 𝜆 ) 𝑃 𝑤𝑡
•𝑃v 𝑤𝑡 𝑤 𝑡(& 𝑤 𝑡()
𝑤 𝑡(& 𝑤 𝑡() +𝜆&𝑃 𝑤𝑡 𝑤 𝑡 ( )
+𝜆D𝑃(𝑤𝑡)
• With ∑$ 𝜆$ = 1, this constitutes a real distribution.

• 𝜆$ determined from held-out (aka development)

data
• Expectation maximization
CSC401/2511 – Spring 2022 78
4. Absolute discounting
• Instead of multiplying highest 𝑁-gram by a 𝜆$, just
subtract a fixed discount 0 ≤ 𝛿 ≤ 1 from each non-zero
− 𝛿, 0)
count. max(𝐶 𝑤 𝑡!
𝑃𝑎𝑏𝑠 𝑤𝑡 𝑤 𝑡 ! = +1 - 𝑃 𝑎𝑏𝑠 (𝑤 𝑡 |𝑤 𝑡!
𝑛5*:𝑡 𝐶 ( 𝑤 𝑡!𝑛5*:𝑡!
𝑛5*:𝑡!* 𝑡)𝑛+1:𝑡)1
−𝜆 𝑛57:𝑡!* )
*)
The n-1 words
of context The discounted ML estimate
The weighting factor And recurse using
for the n-1 words the n-2 words
of context of context

• 𝜆 ; 𝑡 R 𝑛 N O : 𝑡 R O are chosen s.t. ∑ ; 𝑡 𝑃𝑎𝑏𝑠 𝑤𝑡

…) =1
•CSC401/2511
You can learn
– Spring 2022 𝛿 using held-out
79 data.
4. Why absolute discounting?
• Both simple interpolation and absolute discounting
redistribute probability mass, why absolute discounting?
• Compare GT counts to observed counts on this
database:
𝒄 0 1 2 3 4 5 6 7 8 9
𝒄∗ 0.0000270 0.446 1.26 2.24 3.24 4.22 5.19 6.21 7.24 8.25

• As 𝑐 increases, 𝑐 − 𝑐∗ → 0.75. Good 𝛿!

AP newswire, J&M 2nd Ed.

training set (𝑐) vs. held-out set (≈ 𝑐 ∗ )

• Similar trend observed when comparing counts of

CSC401/2511 – Spring 2022 80

5. Kneser-Ney smoothing
• In interpolation, lower-order (e.g., 𝑁 − 1) models
should only be useful if the 𝑁-gram counts are close to
0.
• E.g., unigram models should be optimized for when
• Imagine the bigram ‘San Francisco’ is common ∴
bigrams are not sufficient.

‘Francisco’
has a very high unigram probability because it occurs a
lot.
• But ‘Francisco’ only occurs after ‘San’.
• Idea: We should give ‘Francisco’ a low unigram probability,
because it only occurs within the well-modeled ‘San Francisco’.
CSC401/2511 – Spring 2022 81
5. Kneser-Ney smoothing
• Let the unigram count be the number of different words

𝑁)< ⦁ 𝑤𝑡 = 𝑤 : 𝐶 >
that it follows. I.e.:
𝑡()
0
𝑁 ⦁⦁ = Ä ←The total number of bigram types.
)<
𝑤 𝑡()
; 𝑤𝑡
𝑁)<(⦁𝑤$)
• So, the unigram probability is 𝑃K𝑁 𝑤𝑡 =
i
𝑁ON
⦁; 𝑡
𝑁ON ⦁⦁
, and:
=
𝑃𝐾𝑁 𝑤𝑡 𝑤 𝑡 !
− 𝛿, 𝛿𝑁 ⦁
+ 𝑤 𝑛5*:𝑡!* )
max(𝐶 𝑤 𝑡 ! *5 𝑡!
0)
𝑛5*:𝑡!*
𝑃 𝐾𝑁 (𝑤 𝑡 |𝑤 𝑡!
𝑛 5 * :∑
𝑡 𝑡 𝐶(𝑤
- 𝑡! ∑ -𝑡 𝐶 𝑤 𝑡 !
𝑛57:𝑡!*
𝑛5*:𝑡
Where 𝑁$% 𝑤 𝑡)! 𝑛 % $ : ) ! $ ⦁ is the number of possible words that follow the
𝑛5*:𝑡
context.
CSC401/2511 – Spring 2022 82
5. Modified Kneser-Ney smoothing
• Use different absolute discounts 𝛿 depending on the n-
count s.t. 𝐶 𝑤 𝑡 –
≥ 𝛿 𝐶 ( F 𝑡"𝑛$1:𝑡) ≥
gram
=
𝑃𝑀𝐾𝑁𝑛 C
𝑤1 :𝑡
𝑡 𝑤 𝑡!
0
𝑛5*:𝑡!* 𝐶 𝑤 𝑡! − 𝛿 𝐶 ( - 𝑡)𝑛+ 1: 𝑡 )
+1 )
𝑛 5 * : 𝑡∑ 𝑡 𝐶 ( 𝑤 - 𝑃 𝑀𝐾𝑁 (𝑤 𝑡 |𝑤 𝑡!
- 𝑡! 𝑡)𝑛+1:𝑡)1 𝑛57:𝑡!*
−𝜆
)
• 𝛿 𝐶 ( F 𝑡 " 𝑛 $ 1 : 𝑡 ) could be learned or approximated,
𝑛5*:𝑡

𝜆 chosen to sum to one

usually aggregated for counts above 3
• 𝒄 0 1 2 3
𝒄∗ 0.0000270 0.446 1.26 2.24

CSC401/2511 – Spring 2022 83

Smoothing over smoothing
• Modified Kneser-Ney is arguably the most popular choice for n-
gram language modelling
• Popular open-source toolkits like KenLM, SRILM, and
IRSTLM all implement it
• New smoothing methods are occasionally published
• Huang et al., Interspeech 2020
• While n-gram LMs are still around, most interest in language
modelling research has shifted to neural networks
• We will discuss neural language modelling soon

CSC401/2511 – Spring 2022 84

Readings
•Chen & Goodman (1998) “An Empirical Study of
Smoothing Techniques for Language Modeling,”
Harvard Computer Science Technical Report

•Jurafsky & Martin (2nd ed): 4.1-4.7

•Manning & Schütze: 6.1-6.2.2, 6.2.5, 6.3

•Shareghi et al (2019): https://www.aclweb.org

/anthology/N19-1417.pdf (From the aside –
completely optional)
CSC401/2511 – Spring 2022 85

RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
100% (3)
RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
10 pages
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
Digital Logic Design Notes
100% (5)
Digital Logic Design Notes
298 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
Lec 3 slp04 LM and Ngrans
No ratings yet
Lec 3 slp04 LM and Ngrans
73 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
Introduction To Computational Linguistics: Eugene Charniak and Mark Johnson
No ratings yet
Introduction To Computational Linguistics: Eugene Charniak and Mark Johnson
148 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
Session10 - cs2731 NLP LM
No ratings yet
Session10 - cs2731 NLP LM
47 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
Chapter 03-Number System
No ratings yet
Chapter 03-Number System
88 pages
Language Models: CS6370: Natural Language Processing
No ratings yet
Language Models: CS6370: Natural Language Processing
35 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
IS 7118 Unit-4 N-Grams
100% (2)
IS 7118 Unit-4 N-Grams
93 pages
Language Model PDF
No ratings yet
Language Model PDF
76 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Natural Language Processing:: N-Gram Language Models
No ratings yet
Natural Language Processing:: N-Gram Language Models
48 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
N-Grams and Smoothing: CSC 371: Spring 2012
No ratings yet
N-Grams and Smoothing: CSC 371: Spring 2012
39 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
NLP2 7
No ratings yet
NLP2 7
400 pages
Chapter Four 1
No ratings yet
Chapter Four 1
91 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Language Modeling and Spelling Correction
No ratings yet
Language Modeling and Spelling Correction
97 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Lecture 2. N-Gram LMs
No ratings yet
Lecture 2. N-Gram LMs
77 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
Module5 DS PPT
No ratings yet
Module5 DS PPT
38 pages
Lecture 6 N Gram Language Models Contd Annotations
No ratings yet
Lecture 6 N Gram Language Models Contd Annotations
36 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
Lecture5 Ngrams
No ratings yet
Lecture5 Ngrams
40 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
Language Models Probabilistic Model 1735045992
No ratings yet
Language Models Probabilistic Model 1735045992
55 pages
N Grams
No ratings yet
N Grams
51 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
76 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
N-Gram Language Model: Based On Speech and Language Processing. Daniel Jurafsky & James H. Martin Book, 2023
No ratings yet
N-Gram Language Model: Based On Speech and Language Processing. Daniel Jurafsky & James H. Martin Book, 2023
46 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
Lecture 03
No ratings yet
Lecture 03
58 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Artificial Intelligence: Natural Language Processing
No ratings yet
Artificial Intelligence: Natural Language Processing
13 pages
Language Models
No ratings yet
Language Models
50 pages
Reduction Proofs
No ratings yet
Reduction Proofs
9 pages
Syntactic and Dependency Parsing
No ratings yet
Syntactic and Dependency Parsing
159 pages
Lect33 Textcat
No ratings yet
Lect33 Textcat
70 pages
Primes
No ratings yet
Primes
39 pages
10 Estimators Pre Lecture
No ratings yet
10 Estimators Pre Lecture
109 pages
2DI90 ch9
No ratings yet
2DI90 ch9
83 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
ch07 Consistency Replication
No ratings yet
ch07 Consistency Replication
30 pages
New Trends For Authentication
No ratings yet
New Trends For Authentication
5 pages
Tut4 - WordEmb NLP
No ratings yet
Tut4 - WordEmb NLP
30 pages
2DI90 ch11
No ratings yet
2DI90 ch11
54 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
Slides08 LR Parsing
No ratings yet
Slides08 LR Parsing
25 pages
2DI90 chID190-CH5
No ratings yet
2DI90 chID190-CH5
62 pages
13-Oo-Opolymorphism PLC
No ratings yet
13-Oo-Opolymorphism PLC
15 pages
Jarrar LectureNotes Ch1 Introduction
No ratings yet
Jarrar LectureNotes Ch1 Introduction
18 pages
13-Neuralcrf Pos Tagging
No ratings yet
13-Neuralcrf Pos Tagging
40 pages
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 NLP
No ratings yet
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 NLP
126 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Imc Shift-Cipher
No ratings yet
Imc Shift-Cipher
17 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
01-Introduction PLC
No ratings yet
01-Introduction PLC
53 pages
3 - Slides Corpus3
No ratings yet
3 - Slides Corpus3
88 pages
4 - Slides Regualer Expression
No ratings yet
4 - Slides Regualer Expression
75 pages
04-Textcat Text Class
No ratings yet
04-Textcat Text Class
77 pages
01-Bayes-All-Handout Prob
No ratings yet
01-Bayes-All-Handout Prob
28 pages
07 Covariance Answers Hidden Lecture
No ratings yet
07 Covariance Answers Hidden Lecture
62 pages
02 Random Vars All Handout
No ratings yet
02 Random Vars All Handout
23 pages
POS Tagging
No ratings yet
POS Tagging
63 pages
Ch. 1 Notes
No ratings yet
Ch. 1 Notes
11 pages
Module 3 User's Guide - Planning and Assessing Health Worker Activities
No ratings yet
Module 3 User's Guide - Planning and Assessing Health Worker Activities
149 pages
Profile Skills: Contacto
No ratings yet
Profile Skills: Contacto
1 page
What Is New in Netbackup 6.5
No ratings yet
What Is New in Netbackup 6.5
42 pages
Chapter 4
No ratings yet
Chapter 4
49 pages
NIPS2019 TGAN Supplementary PDF
No ratings yet
NIPS2019 TGAN Supplementary PDF
7 pages
Abyip 2024 1
No ratings yet
Abyip 2024 1
11 pages
The WLTP Consumer Information Guide
No ratings yet
The WLTP Consumer Information Guide
12 pages
7.19a - Abnormal Events
No ratings yet
7.19a - Abnormal Events
10 pages
Simplex Algorithm - Wikipedia
No ratings yet
Simplex Algorithm - Wikipedia
20 pages
Dual Clutch Transmission
100% (1)
Dual Clutch Transmission
7 pages
2023 2024 SPGBHS Main Teaching Load
No ratings yet
2023 2024 SPGBHS Main Teaching Load
2 pages
Guo2017 Recent Developments of Miniature Ion Trap Mass Spectrometers
No ratings yet
Guo2017 Recent Developments of Miniature Ion Trap Mass Spectrometers
10 pages
Safety Data Sheet: 1 Identification of The Substance/Mixture and of The Company/Undertaking
No ratings yet
Safety Data Sheet: 1 Identification of The Substance/Mixture and of The Company/Undertaking
6 pages
VFD Transformers (03!24!2025)
No ratings yet
VFD Transformers (03!24!2025)
2 pages
AIB The Mock (Recall) Myth PDF
No ratings yet
AIB The Mock (Recall) Myth PDF
2 pages
Comparison Fagll03 Fbl3n Fbl5n
No ratings yet
Comparison Fagll03 Fbl3n Fbl5n
2 pages
Electrostatic Lens (10 Points) : Theory
No ratings yet
Electrostatic Lens (10 Points) : Theory
4 pages
Footnote 12 To The Youth PDF Free
No ratings yet
Footnote 12 To The Youth PDF Free
5 pages
Adobe Scan 04-Mar-2024
No ratings yet
Adobe Scan 04-Mar-2024
12 pages
RHHTT 65a R4
No ratings yet
RHHTT 65a R4
5 pages
By B. Deutsch: The Male Privilege Checklist:An Unabashed Imitation of An Article by Peggy Mcintosh
No ratings yet
By B. Deutsch: The Male Privilege Checklist:An Unabashed Imitation of An Article by Peggy Mcintosh
3 pages
Dsa 24 H Imp
No ratings yet
Dsa 24 H Imp
1 page
Assignment - Professional Commiunications and Negotiation Skills-1
33% (3)
Assignment - Professional Commiunications and Negotiation Skills-1
5 pages
Under Balanced Managed Pressure Drilling
No ratings yet
Under Balanced Managed Pressure Drilling
19 pages
DE09 Sol
No ratings yet
DE09 Sol
157 pages
How Living Things Grow and Change
No ratings yet
How Living Things Grow and Change
14 pages
Blake PDF
No ratings yet
Blake PDF
96 pages
Facial Expressionsandthe Abilityto Recognize Emotionsfromthe Eyesor Mouth AComparison Between Childrenand Adults
No ratings yet
Facial Expressionsandthe Abilityto Recognize Emotionsfromthe Eyesor Mouth AComparison Between Childrenand Adults
11 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.