0% found this document useful (0 votes)
19 views85 pages

2 Corpora and Smoothing

nlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views85 pages

2 Corpora and Smoothing

nlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 85

CSC401/2511 – Spring 2022 1

Overview
•(Statistical) language models (n-gram models)
•Counting
•Data
•Definitions
•Evaluations
•Distributions
•Smoothing
• Some slides are based on content from Bob Carpenter, Dan Klein, Roger Levy, Josh
Goodman, Dan Jurafsky, Christopher Manning, Gerald Penn, and Bill MacCartney.

CSC401/2511 – Spring 2022 2


Statistics: what are we counting?
•Statistical language models are based on simple
counting.
•What are we counting?
First, we shape our tools and thereafter
our tools shape us.

•Tokens: n.pl. instances of words or


punctuation (13).

•Types: n.pl. ‘kinds’ of words or punctuation


(10).
CSC401/2511 – Spring 2022 3
Confounding factors
•Are the following word pairs one type or two?
• (run, runs) (verb conjugation)
• (happy, happily) (adjective vs. adverb)
• (fra(1)gment, fragme(1)nt) (spoken stress)
• (realize, realise) (spelling)
• (We, we) (capitalization)

•How do we count speech disfluencies?


• e.g., I uh main-mainly do data processing
• Answer: It depends on your task.
• e.g., if you’re doing
summarization, you usually
don’t care about ‘uh’.
CSC401/2511 – Spring 2022 4
Does it matter how we count things?
•Answer: See lecture on feature extraction.
•Preview: yes, it matters…(sometimes)
•E.g., to diagnose Alzheimer’s disease from a
patient’s speech, you may want to measure:
• Excessive pauses (disfluencies),
• Excessive word type repetition, and
• Simplistic or short sentences.

•Where do we count things?


CSC401/2511 – Spring 2022 5
Corpora
•Corpus: n. A body of language data of a
particular sort (pl. corpora).

•Most useful corpora occur naturally.


• e.g., newspaper articles, telephone conversations,
multilingual transcripts of the United Nations, tweets.

•We use corpora to gather statistics.


•More is better (typically between 10M and 1T words).
•Be aware of bias.
CSC401/2511 – Spring 2022 6
Statistical modelling
•Insofar as language can be modelled statistically, it
might help to think of it in terms of dice.
Fair die Language

• Vocabulary: numbers • Vocabulary: words


• Vocabulary size: 6 • Vocabulary size: 2– 200,000

CSC401/2511 – Spring 2022 7


Learning probabilities
•What if the symbols are not equally likely?
• We have to estimate the bias using training data.
Loaded die
Language
• Observe many rolls of the die. • Observe many words.
• e.g., • e.g.,
1,6,5,4,1,3,2,2,…. …and then I will…

Training data
CSC401/2511 – Spring 2022 8
Training vs testing
Loaded die Language

•So you’ve learned your probabilities.


• Do they model unseen data from the same source well?
• Keep rolling the same dice. • Keep reading words.
• Do sides keep appearing in the • Do words keep appearing in the
same proportion as we expect? same proportion as we expect?

CSC401/2511 – Spring 2022 9


Sequences with no dependencies
• If you ignore the past entirely, the probability of a sequence
is the product of prior probabilities.

Loaded die Language

𝑃 2,1,4 = 𝑃 2 𝑃 1 𝑃(4) 𝑃 𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 =𝑃


𝑡ℎ𝑒 𝑃 𝑜𝑙𝑑 𝑃(𝑐𝑎𝑟)

𝑃 =𝑃 2 𝑃 1 𝑃 4 𝑃 𝑡ℎ𝑒 = 𝑃 𝑡ℎ𝑒 𝑃 𝑜𝑙𝑑 𝑃


• Language involves context. Ignoring that gives weird results, e.g.,
2,1,4 𝑜𝑙𝑑 𝑐𝑎𝑟 𝑐𝑎𝑟
=𝑃 2 𝑃 4 𝑃 1 = 𝑃(2,4,1) = 𝑃 𝑡ℎ𝑒 𝑃 𝑐𝑎𝑟 𝑃
𝑜𝑙𝑑
CSC401/2511 – Spring 2022 10 = 𝑃(𝑡ℎ𝑒 𝑐𝑎𝑟 𝑜𝑙𝑑)
Sequences with full dependencies
Magic die
Language
(with total memory)

𝑃 2,1,4 = 𝑃 2 𝑃 1|2 𝑃(4|2,1) 𝑃 𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 = 𝑃 𝑡ℎ𝑒 𝑃 𝑜𝑙𝑑|


𝑡ℎ𝑒 𝑃(𝑐𝑎𝑟|𝑡ℎ𝑒 𝑜𝑙𝑑)

• If you consider all of the past, you will never gather enough
data in order to be useful in practice.
• Imagine you’ve only seen the Brown corpus.

• 𝑃 𝑐𝑎𝑟 𝑡ℎ𝑒 𝑜𝑙𝑑 = 0 ∴ 𝑃 𝑡ℎ𝑒


• The sequence ‘the old car’ never appears therein.
𝑜𝑙𝑑 𝑐𝑎𝑟
CSC401/2511 –
= 011
Sequences with fewer dependencies?
Magic die
Language
(with recent memory)

𝑃 2,1,4 = 𝑃 2 𝑃 1|2 𝑃(4|1) 𝑃 = 𝑃 𝑡ℎ𝑒 𝑃


𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 𝑜𝑙𝑑|𝑡ℎ𝑒 3
𝑃(𝑐𝑎𝑟|𝑜𝑙𝑑)
• Only consider
• Imagine two words
you’ve at athe
only seen time...
Brown corpus.

• 𝑃 𝑜𝑙𝑑 𝑡ℎ𝑒 > 0, 𝑃 𝑐𝑎𝑟|𝑜𝑙𝑑


• The sequences ‘the old’ & ‘old car’ do appear therein!
>0∴𝑃 𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 > 0
• Also, 𝑃 𝑡ℎ𝑒 𝑜𝑙𝑑 𝑐𝑎𝑟 > 𝑃(𝑡ℎ𝑒 𝑐𝑎𝑟 𝑜𝑙𝑑)
CSC401/2511 – Spring 2022 12
LANGUAGE MODELS

CSC401/2511 – Spring 13
2022
Word prediction
•Guess the next word…
•*Spoilers* You can do quite
well by counting how
often certain tokens occur
given their contexts.

𝑃(𝑤 𝑡 |𝑤 𝑡"1 ) from


•E.g., estimate

𝑤 𝑡 " 1 , 𝑤𝑡
count of
in
corpus
CSC401/2511 – Spring 2022 14
Word prediction with N-grams
• N-grams: n.pl. token sequences of length N.
• The fragment ‘in this sentence is’ contains the following
2-grams (i.e., ‘bigrams’):
• (in this), (this sentence), (sentence is)

• The next bigram must start with ‘is’.

is,⋅
• What word is most likely to follow ‘is’?
• Derived from bigrams

CSC401/2511 – Spring 2022 15


Use of N-gram models
• Given the probabilities of N-grams, we can compute the
conditional probabilities of possible subsequent words.

• E.g., 𝑃
> 𝑃 𝑖𝑠 𝑎
𝑖𝑠 𝑡ℎ𝑒 ∴
𝑃 𝑡ℎ𝑒 𝑖𝑠 > 𝑃(𝑎|𝑖𝑠)
Then we would predict:

‘the last word in this sentence is the.’

(The last word in this sentence is missing.)


CSC401/2511 – Spring 2022 16
Language models
•Language model: n. The statistical (or neural…)
model of a language.
• e.g., probabilities of words in an

i.e., 𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛
ordered sequence.

•Word prediction is at the heart of language


modelling.

•What do we do with a language model?


CSC401/2511 – Spring 2022 17
Language model usage
•Language models can score and sort sentences.
• e.g., 𝑃 𝐼 𝑙𝑖𝑘𝑒 𝑎𝑝𝑝𝑙𝑒𝑠 ≫ 𝑃(𝐼 𝑙𝑖𝑐𝑘 𝑎𝑝𝑝𝑙𝑒𝑠)
• Commonly used to (re-)rank hypotheses in other tasks

•Infer properties of natural language


• e.g., 𝑃 𝑙𝑒𝑠 𝑝𝑜𝑚𝑚𝑒𝑠 𝑟𝑜𝑢𝑔𝑒𝑠 > 𝑃(𝑙𝑒𝑠
𝑟𝑜𝑢𝑔𝑒𝑠 𝑝𝑜𝑚𝑚𝑒𝑠)
• Embedding spaces

•Efficiently compress text


•How do we calculate 𝑃 … ?
CSC401/2511 – Spring 2022 18
Frequency statistics
•Term count (𝑪𝒐𝒖𝒏𝒕) of term w in corpus C is
the number of tokens of term w
C.
in

𝐶𝑜𝑢𝑛𝑡 𝑤, 𝐶

•Relative frequency (𝑭𝑪) is defined relative to


𝐹 𝐶𝑜𝑢𝑛𝑡(𝑤,
= in the 𝐶.
the
total number of𝐶tokens
𝑤 𝐶)
�corpus,

• In theory, 𝐶lim 𝐹𝐶 =

𝑃(𝑤).

𝑤
(the “frequentist view”)

#
CSC401/2511 – Spring 2022 19
The chain rule
•Recall,
𝑃 𝐴, 𝐵 = 𝑃𝐵 𝐴𝑃
𝐴 = 𝑃𝐴 𝐵 𝑃(𝐵)
𝑃 = 𝑃(
𝐵 𝐴 𝑃(𝐴, 𝐵)
𝐴)
•This extends to longer sequences, e.g.,
𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 𝑃 𝐵 𝐴 𝑃 𝐶
𝐴, 𝐵 𝑃(𝐷|𝐴, 𝐵, 𝐶)

• Or, in general,
𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛 = 𝑃 𝑤1 𝑃(𝑤2 |𝑤1 ) ⋯
𝑃(𝑤𝑛 |𝑤1 , 𝑤2 , … , 𝑤𝑛 $ 1 )
CSC401/2511 – Spring 2022 20
Very simple predictions
• Let’s return to word prediction.
• We want to know the probability of the next word given
the previous words in a sequence.

• We can approximate conditional probabilities by


• E.g., 𝑃 𝑓𝑜𝑜𝑑 𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒) =
counting occurrences in large corpora of data.

𝑃(𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒 𝑓𝑜𝑜𝑑)


𝑃(𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒 ⋅)
𝐶𝑜𝑢𝑛𝑡(𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒 𝑓𝑜𝑜𝑑)

𝐶𝑜𝑢𝑛𝑡(𝐼 𝑙𝑖𝑘𝑒 𝐶ℎ𝑖𝑛𝑒𝑠𝑒)
CSC401/2511 – Spring 2022 21
Problem with the chain rule
•There are many (∞?) possible sentences.
•In general, we won’t have enough data to compute
reliable statistics for long prefixes

𝑃(𝑝𝑟𝑒𝑡𝑡𝑦|𝐼 ℎ𝑒𝑎𝑟𝑑 𝑡ℎ𝑖𝑠 𝑔𝑢𝑦 𝑡𝑎𝑙𝑘𝑠


•E.g.,

𝑡𝑜𝑜 𝑓𝑎𝑠𝑡 𝑏𝑢𝑡


𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 ℎ𝑖𝑠 𝑠𝑙𝑖𝑑𝑒𝑠 𝑎𝑟𝑒) =
𝑃(𝐼 ℎ𝑒𝑎𝑟𝑑 … 𝑎𝑟𝑒 𝑝𝑟𝑒𝑡𝑡𝑦) 0
=
𝑃(𝐼 ℎ𝑒𝑎𝑟𝑑 … 𝑎𝑟𝑒) 0

•How can we avoid {0, ∞}-probabilities?


CSC401/2511 – Spring 2022 22
Independence!
•We can simplify things if we’re willing to break
from the distant past and focus on recent history.

𝑃(𝑝𝑟𝑒𝑡𝑡𝑦|𝐼 ℎ𝑒𝑎𝑟𝑑 𝑡ℎ𝑖𝑠 𝑔𝑢𝑦 𝑡𝑎𝑙𝑘𝑠


•e.g.,
𝑡𝑜𝑜 𝑓𝑎𝑠𝑡
𝑏𝑢𝑡 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 ℎ𝑖𝑠 𝑠𝑙𝑖𝑑𝑒𝑠
𝑎𝑟𝑒)
≈𝑃 𝑝𝑟𝑒𝑡𝑡𝑦𝑠𝑙𝑖𝑑𝑒𝑠
𝑎𝑟𝑒
≈𝑃 𝑝𝑟𝑒𝑡𝑡𝑦𝑎𝑟𝑒
•I.e.,– Spring
CSC401/2511 we2022
assume statistical
23 independence.
Markov assumption

a short linear history of length 𝑁 − 1.


•Assume each observation only depends on

𝑃(𝑤𝑛|𝑤1: 𝑛 " 1 ) ≈ 𝑃(𝑤𝑛|𝑤 𝑛 " 𝑁 ' 1 :

𝑛"1 )

•Bigram version:

𝑃(𝑤𝑛|𝑤1: 𝑛"1 )≈𝑃 𝑤𝑛


CSC401/2511 – Spring 2022
𝑤24𝑛 " 1 )
Berkeley Restaurant Project corpus
• Let’s compute simple N-gram models of speech queries
about restaurants in Berkeley California.
• E.g.,
• can you tell me about any good cantonese
restaurants close by
• mid priced thai food is what i’m looking
for
• tell me about chez panisse
• can you give me a listing of the kinds of food that
are available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day
CSC401/2511 – Spring 2022 25
Example bigram counts
• Out of 9222 sentences,
• e.g., “I want” occurred 827 times
wt
Count(wt-1,wt)
I want to eat Chinese food lunch spend

I 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
wt-1 Chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0

CSC401/2511 – Spring 2022 26


Example bigram probabilities
• Obtain likelihoods by dividing bigram counts by unigram
counts. I want to eat Chinese food lunch spend

Unigram counts: 2533 927 2417 746 158 1093 341 278

P(wt|wt-1) I want to eat Chinese food lunch spend


I 0.002 0.33 0 0.0036 0 0 0 0.00079

𝑃 𝐶𝑜𝑢𝑛𝑡(𝐼 𝑤𝑎𝑛𝑡) 827


≈ =
𝑤𝑎𝑛𝑡 𝐼
≈ 0.33
𝐶𝑜𝑢𝑛𝑡(𝐼) 𝑃 2533 ≈ 𝐶𝑜𝑢𝑛𝑡(𝐼 𝑠𝑝𝑒𝑛𝑑) = ≈ 7.9×10 !
𝑠𝑝𝑒𝑛𝑑 𝐼 4
2

CSC401/2511 – Spring 2022 27 𝐶𝑜𝑢𝑛𝑡(𝐼)


Example bigram probabilities
• Obtain likelihoods by dividing bigram counts by unigram
counts. I want to eat Chinese food lunch spend

Unigram counts: 2533 927 2417 746 158 1093 341 278

P(wt|wt-1) I want to eat Chinese food lunch spend


I 0.002 0.33 0 0.0036 0 0 0 0.00079
want 0
to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087
eat 0 0 0 0
Chinese 0.0063 0 0 0 0 0.52 0.0063 0
food 0 0 0 0
lunch 0.0059 0 0 0 0 0.0029 0 0
spend 0 0 0 0 0 0

CSC401/2511 – Spring 2022 28


Bigram estimate of an unseen phrase
• We can string bigram probabilities together to estimate
the probability of whole sentences.
• We use the start (<s>) and end (</s>) tags here.

P (<s> I want english food </s>) ≈


• E.g.,
P (I | <s>) P (want | I )∙
P(english | want) P(food |
english) ∙ P(</s> | food)
≈ 0.000031

CSC401/2511 – Spring 2022 29


N-grams as linguistic knowledge
• Despite their simplicity, N-gram probabilities can crudely
capture interesting facts about language and the world.

• E.g., 𝑃(𝑒𝑛𝑔𝑙𝑖𝑠ℎ|𝑤𝑎𝑛𝑡) =
0.0011
World

𝑃(𝑐ℎ𝑖𝑛𝑒𝑠𝑒|𝑤𝑎𝑛𝑡) =
knowledge

0.0065
𝑃 𝑡𝑜 =
𝑤𝑎𝑛𝑡 0.66
𝑃 𝑒𝑎𝑡
𝑃(𝑓𝑜𝑜𝑑|𝑡𝑜)=
Syntax

= 𝑡𝑜 0.28
0
𝑃(𝑖| < 𝑠 >)=
0.25
Discourse

CSC401/2511 – Spring 2022 30


Probabilities of sentences
• The probability of a sentence 𝑠 is defined as the
product

𝑃 = 𝖦𝑡 𝑃(𝑤$|𝑤$
of the conditional probabilities of its N-grams:

𝑠 ( &𝑤$( ) )
trigram

$%&
𝑃 = 𝖦𝑡 𝑃(𝑤$|
𝑠 𝑤$( ) )
bigram

$%)
• Which of these two models is better?

CSC401/2511 – Spring 2022 31


Aside - are N-grams still relevant?
• Appropriately smoothed N-gram LMs:
(Shareghi et al. 2019):
• Are often cheaper to train/query
than neural LMs
• Are interpolated with neural LMs to often achieve
state-of-the-art performance
• Occasionally outperform neural LMs
• At least are a good baseline
• Usually handle previously unseen tokens in a more
principled (and fairer) way than neural LMs
• N-gram probabilities are interpretable
• Convenient
CSC401/2511 – Spring 2022 32
EVALUATING LANGUAGE MODELS

CSC401/2511 – Spring 33
2022
Shannon’s method
•We can use a language model to generate
random sequences.

•We ought to see sequences that are similar to


those we used for training.

•This approach is attributed to Claude


Shannon.

CSC401/2511 – Spring 2022 34


Shannon’s method – unigrams
•Sample a model according to its probability.
•For unigrams, keep picking tokens.
•e.g., imagine throwing darts at this:

the

Cat
in
Hat
</
s>
CSC401/2511 – Spring 2022 35
Problem with unigrams
•Unigrams give high probability to odd phrases.
e.g., 𝑃 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑡ℎ𝑒 </s> = 𝑃
𝑡ℎ𝑒 $
⋅ 𝑃(</s>)
> 𝑃(𝑡ℎ𝑒 𝐶𝑎𝑡 𝑖𝑛 𝑡ℎ𝑒 𝐻𝑎𝑡
</s>)

the

Cat
in
Hat
</
s>
CSC401/2511 – Spring 2022 36
Shannon’s method – bigrams
•Bigrams have fixed context once that context

𝑃(# |
has been sampled.
•e.g.
, 𝑡ℎ𝑒)

the the

Cat Cat
in in
Hat Hat
</ </
Time Step 1 s> Time Step 2 s>
CSC401/2511 – Spring 2022 37
Shannon and the Wall Street Journal
Unigram

• Months the my and issue of year foreign new exchange’s September were
recession exchange new endorsed a acquire to six executives.

• Last December through the way to preserve the Hudson corporation N.B.E.C.
Taylor would seem to complete the major central planners one point five percent
Bigram

of U.S.E. has already old M.X. corporation of living on information such as more
frequently fishing to keep her.

• They also point to ninety nine point six billion dollars from two hundred four oh
Trigram

six three percent of the rates of interest stores as Mexico and Brazil on market
conditions.

CSC401/2511 – Spring 2022 38


Shannon’s method on Shakespeare
• To him swallowed confess hear both. Which. Of save on trail for are ay device and
rote life have
Unigram

• Hill he late speaks; or! A more to leg less first you enter
• Are where exeunt and sighs have rise excellency took of.. Sleep knave we. Near;
vile like.
• What means, sir. I confess she? Then all sorts, he is trim, captain.
• Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.
Bigram

Live king. Follow.


• What we, hat got so she that I rest and sent to scold and nature bankrupt nor the
first gentleman?

• Sweet prince, Falstaff shall die. Harry of Monmouth’s grave.


Trigram

• This shall forbid it should be branded, if renown made it empty.


• Indeed the duke; and had a very good friend.

• King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch.
Quadrigram

• Will you not tell me who I am?


• It cannot be but so.
• Indeed the short and the long. Marry. ‘tis a noble Lepidus.

CSC401/2511 – Spring 2022 39


Shakespeare as a corpus
•884,647 tokens, vocabulary of 𝑉 = 29,066

•Shakespeare produced about 300,000 bigram


types.

types out of 𝑉2 ≈ 845𝑀 possible bigram

•∴ 99.96% of possible bigrams were never seen


types.

(i.e., they have 0 probability in the bigram table).

•Quadrigrams appear more similar to Shakespeare


because, for increasing context, there are fewer

•E.g., 𝑃 𝐺𝑙𝑜𝑢𝑐𝑒𝑠𝑡𝑒𝑟
possible next words, given the training data.
CSC401/2511 – Spring 2022 40
Evaluating a language model
•How can we quantify the goodness of a model?

•How do we know whether one model is better


than another?
•There are 2 general ways of evaluating LMs:
• Extrinsic: in terms of some external
measure
(this depends on some task or application).
• Intrinsic: in terms of properties of the LM
itself.

CSC401/2511 – Spring 2022 41


Extrinsic evaluation
•The utility of a language model is
often determined in situ (i.e., in
practice).

Alternately embed LMs 𝐴 and 𝐵


• e.g.,
1.
into a speech recognizer.
2. Run speech recognition using each
model.

the system that uses LM 𝐴 and


3. Compare recognition rates between

the system that uses LM 𝐵.


CSC401/2511 – Spring 2022 42
Intrinsic evaluation
•To measure the intrinsic value of a language

of a corpus, 𝑃(𝐶).
model, we first need to estimate the probability

(e.g., 𝑃(𝑡𝑜|𝑤𝑎𝑛𝑡)) to maximize 𝑃(𝐶𝑜𝑟𝑝𝑢𝑠).


• This will also let us adjust/estimate model parameters

•For a corpus of sentences, 𝐶, we sometimes


conditionally independent: 𝑃 𝐶 = ∏*
make the assumption that the sentences are

𝑃(𝑠*)
CSC401/2511 – Spring 2022 43
Intrinsic evaluation
•We estimate 𝑃 ⋅ given a particular corpus, e.g.,
Brown.
•A good model of the Brown corpus is one that makes
Brown very likely (even if that model is bad for other

𝑃1( ) ≥ 𝑃�
corpora).
𝑃1 𝑡𝑜 𝑤𝑎𝑛𝑡
… If
=⋯

∀𝑗

Brown Brown
)
then
corpus
( � corpus

𝑃1 is the best model


𝑃2 𝑡𝑜 𝑤𝑎𝑛𝑡

=⋯

of the Brown corpus.
CSC401/2511 – Spring 2022 44
Maximum likelihood estimate
• Maximum likelihood estimate (MLE) of parameters 𝜃
in a model M, given training data T is

𝜃∗ = argmax+𝐿𝑀 𝜃|𝑇 , 𝐿𝑀 𝜃|𝑇


= 𝑃𝑀 + 𝑇
𝑻 is the Brown corpus,
𝑴 is the bigram and unigram
• e.g.,

tables
𝜽 𝑡𝑜 𝑤𝑎𝑛𝑡 is 𝑃 𝑡𝑜 𝑤𝑎𝑛𝑡 .
• In fact, we have been doing MLE, within the N-gram
context, all along with our simple counting*
*(assuming an end-of-sentence token)

CSC401/2511 – Spring 2022 45


Perplexity
l&g 𝘗
•Perplexity corp. 𝐶, 𝑃𝑃 𝐶 =
#* (
= 𝑃 𝐶 #$/
*

• If2 you have a vocabulary 𝒱 with 𝒱 word𝐶


and your LM is uniform (i.e., 𝑃 𝑤 1J ∀𝑤
= ∈ 𝒱),
types,

𝒱
• Then
l#$2 l#$ * ) $
! !
𝑃𝑃 𝐶 = = =
`

𝑃(𝐶) ! l#$ l#$2


𝒱

𝒱
2
𝐶
2
𝒱
2 2 2

(* ⁄ 𝒱 )
=

=2

•Perplexity is sort of like a ‘branching factor’.


•Minimizing perplexity ≡ maximizing probability of corpus
CSC401/2511 – Spring 2022 46
Perplexity as an evaluation metric
•Lower perplexity → a better model.
•(more on this in the section on information theory)

•e.g., splitting WSJ corpus into a 38M word


training set and a 1.5M word test set gives:
N-gram order Unigram Bigram Trigram
Perplexity 962 170 109

CSC401/2511 – Spring 2022 47


Modelling language
• So far, we’ve modelled language as a surface phenomenon
using only our observations (i.e., words).

• Language is hugely complex and involves hidden structure


(recall: syntax, semantics, pragmatics).

• A ‘true’ model of language would take into account all


those things and the proper relations between them.

• Our first hint of modelling hidden structure will come with


uncovering grammatical roles (i.e., parts-of-speech)
CSC401/2511 – Spring 2022 48
ZIPF AND THE NATURAL DISTRIBUTIONS
IN LANGUAGE

CSC401/2511 – Spring 49
2022
Sparseness
•Problem with N-gram models:
•New words appear often as we read new data.
• e.g., interfrastic, espepsia, $182,321.09
•New bigrams occur even more often.
• Recall that Shakespeare only wrote ~0.04% of all
the bigrams he could have, given his vocabulary.
• Because there are so many possible bigrams, we
encounter new ones more frequently as we read.
•New trigrams occur even more even-more-
often.
CSC401/2511 – Spring 2022 50
Sparseness of unigrams vs. bigrams
• Conversely, we can see lots of every unigram, but still
miss many bigrams:
I want to eat Chinese food lunch spend

Unigram counts: 2533 927 2417 746 158 1093 341 278

wt
Count(wt-1,wt)
I want to eat Chinese food lunch spend

I 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
wt-1 Chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0

CSC401/2511 – Spring 2022 51


Why does sparseness happen?
•The bigram table appears to be filled in non-uniformly.

•Clearly, some words (e.g., want) are very popular and will
occur in many bigrams just from random chance.

•Other words are not-so-popular (e.g., hippopotomonstrosesquipedalian).

partner word will have its own 𝑃(𝑤).


They will occur infrequently, and when they do their

•Is there some phenomenon that describes 𝑃 𝑤


in real language?
CSC401/2511 – Spring 2022 52
Patterns of unigrams
•Words in Tom Sawyer by Mark Twain:
Word Frequency
the 3332 •A few words occur
and 2972 very frequently.
a 1775 • Aside: the most frequent 256 English
to 1725 word types account for 50% of English
of 1440 tokens.
• Aside: for Hungarian, we need the top
was 1161
4096 to account for 50%.
it 1027
in
that
906
877
•Many words occur
he 877 very infrequently.
… …

CSC401/2511 – Spring 2022 53


Frequency of frequencies
• How many words occur 𝑋 number of times in Tom
Hapax Sawyer?
legomena: n.pl. Word frequency # of word types with that frequency
words that occur 1 3993 e.g.,
once in a corpus. 2 1292 1292 word types
3 664 occur twice
4 410
Notice how many
5 243 word types are
6 199 relatively rare!
7 172
8 131
9 82
10 91
11-50 540
51-100 99
>100 102

CSC401/2511 – Spring 2022 54


Ranking words in Tom
Sawyer
• Rank word types in order of decreasing frequency.
Word Freq. Rank f·r Word Freq. Rank f·r
(f) (r) (f) (r)
the 3332 1 3332 name 21 400 8400
and 2972 2 5944 comes 16 500 8000
With some
a 1775 3 5235 group 13 600 7800 (relatively minor)
he 877 10 8770 lead 11 700 7700 exceptions,
f·r is very
but 410 20 8400 friends 10 800 8000
consistent!
be 294 30 8820 begin 9 900 8100
there 222 40 8880 family 8 1000 8000
one 172 50 8600 brushed 4 2000 8000
about 158 60 9480 sins 2 3000 6000
more 138 70 9660 Could 2 4000 8000
never 124 80 9920 Applausive 1 8000 8000

CSC401/2511 – Spring 2022 55


Zipf’s Law
• In Human Behavior and the Principle of Least Effort, Zipf
argues(*) that all human endeavour depends on laziness.
• Speaker minimizes effort by having a small vocabulary of
common words.
• Hearer minimizes effort by having a large vocabulary of
less ambiguous words.
• Compromise: frequency and rank are inversely proportional.

𝑓𝖺 𝑓2𝑟
� k =𝑘
i.e., for some
1

(*) This does not make it true.

CSC401/2511 – Spring 2022 56


Zipf’s Law on the Brown corpus

From Manning & Schütze


CSC401/2511 – Spring 2022 57
Zipf’s Law on the novel Moby Dick

From Wikipedia
CSC401/2511 – Spring 2022 58
Zipf’s Law in perspective
• Zipf’s explanation of the phenomenon involved human
laziness.

• Simon’s discourse model (1956) argued that the phenomenon


could equally be explained by two processes:
• People imitate relative frequencies of words they hear
• People innovate new words with small, constant
probability

• There are other explanations.

CSC401/2511 – Spring 2022 59


Aside – Zipf’s Law in perspective
• Zipf also observed that frequency correlates with several other
properties of words, e.g.:
• Age (frequent words are old)
• Polysem (frequent words often have many meanings or
y higher-order functions of meaning, e.g.,
chair) (frequent words are spelled with few
• Length letters)
• He also showed that there are hyperbolic distributions in the world
(crucially, they’re not Gaussian), just like:
• Yule’s Law: B = 1 + g
• s: probability ofsmutation becoming dominant in species
• g: probability of mutation that expels species from genus
• Pareto distributions (wealth distribution)
CSC401/2511 – Spring 2022 60
SMOOTHING

CSC401/2511 – Spring 61
2022
Zero probability in Shakespeare

300,000 bigrams out of a possible


• Shakespeare’s collected writings account for about
𝑉2 ≈ 845𝑀 bigrams, given his lexicon.
• So 99.96% of the possible bigrams were never seen.
• Now imagine that someone finds a new play and wants
to know whether it is Shakespearean…
• Shakespeare isn’t very predictable! Every time the
play uses one of those 99.96% bigrams, the sentence
that contains it (and the play!) gets 0 probability.
• This is bad.

CSC401/2511 – Spring 2022 62


Zero probability in general

•Some N-grams are just really rare.


• e.g., perhaps ‘negative press covfefe’

•If we had more data, perhaps we’d see them.


•If we have no way to determine the distribution
of unseen N-grams, how can we estimate them?

CSC401/2511 – Spring 2022 63


Smoothing mechanisms
•Smoothing methods we will cover:
1. Add-𝛿 smoothing (Laplace)
2. Good-Turing
3. Simple interpolation (Jelinek-Mercer)
4. Absolute discounting
5. Kneser-Ney smoothing
6. Modified Kneser-Ney smoothing

CSC401/2511 – Spring 2022 64


Smoothing as redistribution
• Make the distribution more uniform.
• This moves the probability mass from ‘the rich’ towards
‘the poor’.
10 10
8 8
6 6
4 4
2 2
0 0
on

on

k
sh

sh
ut

ut
le

le
oc

oc
so

so
tf i

tf i
tro

tro
lm

lm
dd

dd
ca

ca
sa

sa
ha

ha
Actual counts Adjusted Imaginary
counts
CSC401/2511 – Spring 2022 65
1. Add-1 smoothing (“Laplace discounting”)
𝒱 and corpus size 𝑁 =
𝐶
• Given vocab size
.

:𝑃𝑤 =
• Just add 1 to all the counts! No more zeros!
• MLE
𝐶𝑜𝑢𝑛𝑡(𝑤)⁄𝑁 =
𝐶𝑜𝑢𝑛𝑡 ;
𝑁<
: 𝑃𝐿𝑎𝑝 𝑤
<) 𝒱
• Laplace estimate
• Does this give a proper probability distribution? Yes:
* 𝑃 𝐶𝑜𝑢𝑛𝑡 𝑤 +1 ∑ =
= * 𝑁+𝒱 𝑁 +𝒱 = 𝑁
𝐿𝑎 -
𝑤 𝑝 1
𝐶𝑜𝑢𝑛𝑡 +
- -
𝑤 +∑- 1 𝑁+ 𝒱
=
𝒱
CSC401/2511 – Spring 2022 66
1. Add-1 smoothing for
bigrams
• Same principle for bigrams:
=
𝑃𝐿𝑎𝑝 𝑤𝑡 𝐶𝑜𝑢𝑛𝑡
𝐶𝑜𝑢𝑛𝑡
𝑤𝑡 ( ) + 𝒱
𝑤 𝑡()

𝒱 𝑤 𝑡 ( ) 𝑤 𝑡over “imaginary”
/(𝑁 + 𝒱 ) uniformly
• We are essentially holding out and spreading

+1
events.

• Does this work?

CSC401/2511 – Spring 2022 67


1. Laplace smoothed bigram counts
• Out of 9222 sentences in Berkeley restaurant corpus,
• e.g., “I want” occurred 827 times so Laplace gives 828
wt
Count(wt-1,wt)
I want to eat Chinese food lunch spend

I 5+1 827+1 1 9+1 1 1 1 2+1


want 2+1 1 608+1 1+1 6+1 6+1 5+1 1+1
to 2+1 1 4+1 686+1 2+1 1 6+1 211+1
eat 1 1 2+1 1 16+1 2+1 42+1 1
wt-1 Chinese 1+1 1 1 1 1 82+1 1+1 1
food 15+1 1 15+1 1 1+1 4+1 1 1
lunch 2+1 1 1 1 1 1+1 1 1
spend 1+1 1 1+1 1 1 1 1 1

CSC401/2511 – Spring 2022 68


1. Laplace smoothed probabilities
+
𝑃𝐿𝑎 𝑤𝑡 𝐶
= +1
𝑤𝑝 𝑡( ) 𝐶
𝑤 𝑡() 𝑤 𝑡 𝒱 food
0.0025𝑤 𝑡 0.00025
()
P(wt|wt-1) I want to eat Chinese lunch spend
I 0.0015 0.21 0.00025 0.00025 0.00025 0.00075
want 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0.00084
to 0.00083 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055
eat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046
Chinese 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0.00062
food 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0.00039
lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056
spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058

CSC401/2511 – Spring 2022 69


1. Add-𝟏
smoothing
𝑃 𝑡𝑜𝑤𝑎𝑛𝑡 went from 0.66 to 0.26.
• According to this method,
• That’s a huge change!
• In extrinsic evaluations, the results are not great.
• Sometimes ~90% of the probability mass is spread across
• It only works if we know 𝒱 beforehand.
unseen events.

CSC401/2511 – Spring 2022 70


1. Add-𝛿
• Generalize Laplace: Add 𝜹 < 𝟏 to be a bit less
smoothing
:𝑃𝑤 =
generous.
• MLE
𝐶𝑜𝑢𝑛𝑡(𝑤)⁄𝑁 : 𝑃 𝑎𝑑𝑑(@ =
𝐶𝑜𝑢𝑛𝑡 ;
𝑁<@
• Add- 𝛿 estimate <@ 𝒱

• Does this give a proper𝑤probability distribution? Yes:


+ 𝑃𝑎 𝑑𝑑 – 𝐶𝑜𝑢𝑛𝑡 𝑤 +𝛿
= + 𝑁+𝛿 𝒱 =
𝑤 I
𝑁 +𝑤𝛿 + ∑E 𝛿 𝒱
E E
∑ E 𝐶𝑜𝑢𝑛𝑡
𝑁+𝛿 =
=𝑁+𝛿
1
This sometimes works
empirically (e.g., in text

𝒱𝒱
categorization), sometimes
not…
CSC401/2511 – Spring 2022 71
Is there another way?
• Choice of 𝛿 is ad-hoc
• Has Zipf taught us nothing?
• Unseen words should behave more like hapax legomena.
• Words that occur a lot should behave like other words
that occur a lot.
• If I keep reading from a corpus, by the time I see a
new
word like ‘zenzizenzizenzic’, I will have seen ‘the’ a lot
more than once more.

CSC401/2511 – Spring 2022 72


2. Good-Turing
• Define 𝑁𝑐 as the number of N-grams that occur c
times.
• “Count of counts”
Word
frequency
# of words (i.e., unigrams)
with that frequency
1 𝑵𝟏 = 3993
2 𝑁& = 1292
3 𝑁3 = 664
… …
(from Tom Sawyer)

• For some word in ‘bin’ 𝑁𝑐 , the MLE is that I saw that word
𝑐

• Idea: get rid of zeros by re-estimating 𝑐 using the MLE


times.

of words
CSC401/2511 that occur 𝑐 + 1 times.
– Spring 2022 73
2. Good-Turing intuition/example
• Imagine you have this toy scenario:
Word ship pass camp frock soccer mother tops
Frequency 8 7 3 2 1 1 1

= 23 words total

•𝑃 𝑠𝑜𝑐𝑐𝑒𝑟 = 1/23
• What is the MLE prior probability of hearing ‘soccer’?

• What is the probability of seeing something new?


• No way to tell, but 3/23 words are hapax legomena (𝑁J = 3).
• If we use 3/23 to approximate things we’ve never seen, then we
have to also adjust other probabilities (e.g., 𝑃𝐺𝑇 𝑠𝑜𝑐𝑐𝑒𝑟
< 1/23).
CSC401/2511 – Spring 2022 74
2. Good-Turing adjustments
• 𝑃∗ [𝑢𝑛𝑠𝑒𝑒𝑛]
= 𝑁)/𝑁
𝑐<)
• Re-estimate count 𝑐∗
𝐺𝑇
𝑁𝑐NO
𝑁
= 𝑐

•𝑐 = 0 •𝑐 = 1
• Unseen words • Seen once (e.g., soccer)

• MLE: 𝑝 = • MLE: 𝑝 =
• 0/23
𝑃𝐺∗ [𝑢𝑛𝑠𝑒𝑒𝑛] 𝑁 • 1/23
𝑐∗ 𝑠𝑜𝑐𝑐𝑒𝑟 = 𝑁
𝑇= 𝑁O = 2⋅ =2⋅
𝑁P

3/23
• 𝑃 1/3
O

/
&
𝐺
∗𝑇 𝑠𝑜𝑐𝑐𝑒𝑟 D
23
CSC401/2511 – Spring 2022 75
2. Good-Turing limitations
• Q: What happens when you want to estimate 𝑃(𝑤)
when 𝑤 occurs 𝑐 times, but no word occurs 𝑐 + 1
• E.g., what is 𝑃𝐺∗ 𝑐𝑎𝑚𝑝since 𝑁E =
0?
times?
Word ship
𝑇pass camp frock soccer mother tops
Frequency 8 7 3 2 1 1 1

𝑐<)
• A1: We can re-estimate count 𝑐∗ 𝐸[𝑁 ]
=• Uses Expectation-Maximization (method
𝐸[𝑁 𝑐NO
𝑐]
.
used later)

values of 𝑐 that we do have.


• A2: We can interpolate linearly, in log-log, between

CSC401/2511 – Spring 2022 76


2. Good-Turing limitations
• Q: What happens when 𝐶𝑜𝑢𝑛𝑡 𝑀𝑐𝐺𝑖𝑙𝑙 𝑔𝑒𝑛𝑖𝑢𝑠
= 0 and 𝐶𝑜𝑢𝑛𝑡 𝑀𝑐𝐺𝑖𝑙𝑙 𝑏𝑟𝑎𝑖𝑛𝑏𝑜𝑥 = 0, and

• A: 𝑃 𝑔𝑒𝑛𝑖𝑢𝑠 𝑀𝑐𝐺𝑖𝑙𝑙 = 𝑃(𝑏𝑟𝑎𝑖𝑛𝑏𝑜𝑥|


we smooth bigrams?

𝑀𝑐𝐺𝑖𝑙𝑙)

𝑃 𝑔𝑒𝑛𝑖𝑢𝑠 𝑀𝑐𝐺𝑖𝑙𝑙 >𝑃


• But we’d expect
𝑏𝑟𝑎𝑖𝑛𝑏𝑜𝑥 𝑀𝑐𝐺𝑖𝑙𝑙
(context notwithstanding) because ‘genius’ is a more
common word than ‘brainbox’).

• The solution may be to combine


CSC401/2511 – Spring 2022 77
bigram and unigram
3. Simple interpolation (Jelinek-Mercer)
• Combine trigram, bigram, and unigram probabilities.
= 𝜆 ) 𝑃 𝑤𝑡
•𝑃v 𝑤𝑡 𝑤 𝑡(& 𝑤 𝑡()
𝑤 𝑡(& 𝑤 𝑡() +𝜆&𝑃 𝑤𝑡 𝑤 𝑡 ( )
+𝜆D𝑃(𝑤𝑡)
• With ∑$ 𝜆$ = 1, this constitutes a real distribution.

• 𝜆$ determined from held-out (aka development)


data
• Expectation maximization
CSC401/2511 – Spring 2022 78
4. Absolute discounting
• Instead of multiplying highest 𝑁-gram by a 𝜆$, just
subtract a fixed discount 0 ≤ 𝛿 ≤ 1 from each non-zero
− 𝛿, 0)
count. max(𝐶 𝑤 𝑡!
𝑃𝑎𝑏𝑠 𝑤𝑡 𝑤 𝑡 ! = +1 - 𝑃 𝑎𝑏𝑠 (𝑤 𝑡 |𝑤 𝑡!
𝑛5*:𝑡 𝐶 ( 𝑤 𝑡!𝑛5*:𝑡!
𝑛5*:𝑡!* 𝑡)𝑛+1:𝑡)1
−𝜆 𝑛57:𝑡!* )
*)
The n-1 words
of context The discounted ML estimate
The weighting factor And recurse using
for the n-1 words the n-2 words
of context of context

• 𝜆 ; 𝑡 R 𝑛 N O : 𝑡 R O are chosen s.t. ∑ ; 𝑡 𝑃𝑎𝑏𝑠 𝑤𝑡


…) =1
•CSC401/2511
You can learn
– Spring 2022 𝛿 using held-out
79 data.
4. Why absolute discounting?
• Both simple interpolation and absolute discounting
redistribute probability mass, why absolute discounting?
• Compare GT counts to observed counts on this
database:
𝒄 0 1 2 3 4 5 6 7 8 9
𝒄∗ 0.0000270 0.446 1.26 2.24 3.24 4.22 5.19 6.21 7.24 8.25

• As 𝑐 increases, 𝑐 − 𝑐∗ → 0.75. Good 𝛿!


AP newswire, J&M 2nd Ed.

training set (𝑐) vs. held-out set (≈ 𝑐 ∗ )


• Similar trend observed when comparing counts of

CSC401/2511 – Spring 2022 80


5. Kneser-Ney smoothing
• In interpolation, lower-order (e.g., 𝑁 − 1) models
should only be useful if the 𝑁-gram counts are close to
0.
• E.g., unigram models should be optimized for when
• Imagine the bigram ‘San Francisco’ is common ∴
bigrams are not sufficient.

‘Francisco’
has a very high unigram probability because it occurs a
lot.
• But ‘Francisco’ only occurs after ‘San’.
• Idea: We should give ‘Francisco’ a low unigram probability,
because it only occurs within the well-modeled ‘San Francisco’.
CSC401/2511 – Spring 2022 81
5. Kneser-Ney smoothing
• Let the unigram count be the number of different words

𝑁)< ⦁ 𝑤𝑡 = 𝑤 : 𝐶 >
that it follows. I.e.:
𝑡()
0
𝑁 ⦁⦁ = Ä ←The total number of bigram types.
)<
𝑤 𝑡()
; 𝑤𝑡
𝑁)<(⦁𝑤$)
• So, the unigram probability is 𝑃K𝑁 𝑤𝑡 =
i
𝑁ON
⦁; 𝑡
𝑁ON ⦁⦁
, and:
=
𝑃𝐾𝑁 𝑤𝑡 𝑤 𝑡 !
− 𝛿, 𝛿𝑁 ⦁
+ 𝑤 𝑛5*:𝑡!* )
max(𝐶 𝑤 𝑡 ! *5 𝑡!
0)
𝑛5*:𝑡!*
𝑃 𝐾𝑁 (𝑤 𝑡 |𝑤 𝑡!
𝑛 5 * :∑
𝑡 𝑡 𝐶(𝑤
- 𝑡! ∑ -𝑡 𝐶 𝑤 𝑡 !
𝑛57:𝑡!*
𝑛5*:𝑡
Where 𝑁$% 𝑤 𝑡)! 𝑛 % $ : ) ! $ ⦁ is the number of possible words that follow the
𝑛5*:𝑡
context.
CSC401/2511 – Spring 2022 82
5. Modified Kneser-Ney smoothing
• Use different absolute discounts 𝛿 depending on the n-
count s.t. 𝐶 𝑤 𝑡 –
≥ 𝛿 𝐶 ( F 𝑡"𝑛$1:𝑡) ≥
gram
=
𝑃𝑀𝐾𝑁𝑛 C
𝑤1 :𝑡
𝑡 𝑤 𝑡!
0
𝑛5*:𝑡!* 𝐶 𝑤 𝑡! − 𝛿 𝐶 ( - 𝑡)𝑛+ 1: 𝑡 )
+1 )
𝑛 5 * : 𝑡∑ 𝑡 𝐶 ( 𝑤 - 𝑃 𝑀𝐾𝑁 (𝑤 𝑡 |𝑤 𝑡!
- 𝑡! 𝑡)𝑛+1:𝑡)1 𝑛57:𝑡!*
−𝜆
)
• 𝛿 𝐶 ( F 𝑡 " 𝑛 $ 1 : 𝑡 ) could be learned or approximated,
𝑛5*:𝑡

𝜆 chosen to sum to one


usually aggregated for counts above 3
• 𝒄 0 1 2 3
𝒄∗ 0.0000270 0.446 1.26 2.24

CSC401/2511 – Spring 2022 83


Smoothing over smoothing
• Modified Kneser-Ney is arguably the most popular choice for n-
gram language modelling
• Popular open-source toolkits like KenLM, SRILM, and
IRSTLM all implement it
• New smoothing methods are occasionally published
• Huang et al., Interspeech 2020
• While n-gram LMs are still around, most interest in language
modelling research has shifted to neural networks
• We will discuss neural language modelling soon

CSC401/2511 – Spring 2022 84


Readings
•Chen & Goodman (1998) “An Empirical Study of
Smoothing Techniques for Language Modeling,”
Harvard Computer Science Technical Report

•Jurafsky & Martin (2nd ed): 4.1-4.7

•Manning & Schütze: 6.1-6.2.2, 6.2.5, 6.3

•Shareghi et al (2019): https://www.aclweb.org


/anthology/N19-1417.pdf (From the aside –
completely optional)
CSC401/2511 – Spring 2022 85

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy