0% found this document useful (0 votes)
12 views21 pages

5) Lecture Feb11&13&17&18

The document discusses language models, specifically N-grams, which assign probabilities to sequences of words based on their history. It addresses challenges like data sparsity and unknown words, and presents solutions such as smoothing techniques and backoff methods. The evaluation of language models is also covered, highlighting intrinsic and extrinsic measures, including perplexity as a performance metric.

Uploaded by

Prince lalulucky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views21 pages

5) Lecture Feb11&13&17&18

The document discusses language models, specifically N-grams, which assign probabilities to sequences of words based on their history. It addresses challenges like data sparsity and unknown words, and presents solutions such as smoothing techniques and backoff methods. The evaluation of language models is also covered, highlighting intrinsic and extrinsic measures, including perplexity as a performance metric.

Uploaded by

Prince lalulucky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

CSN-528

Feb-11&13&17&18
Language Models - N-Grams
Language Model

- “I love to listen .............” ? - Music/your/truth ...


- Models that assign a probability to each possible next word (conditional), then to assign
probability to the entire sentence (joint).
- For noisy or ambiguous input (I will be back soonish and not I will be bassoon dish)
- Next word suggestion
- Next POS tag suggestion
- Spelling correction (Their ar two midterms)
- Grammatical Error Correction (Everything has improve)
- Probabilites to sequence of words can further improve machine translation.
he introduced reporters to the main contents of the statement

he briefed to reporters the main contents of the statement


he briefed reporters on the main contents of the statement
Language Model

Conditional Probabiltiy
P(w|h) // Probability of word given history
eg., P(the|its water is so transparent that)

One possible way is to compute probability directly from the corpus or the text -
P(the|its water is so transparent that) = C(its water is so transparent that the) / C(its water is so transparent that)

- Languages are creative, count changes rapidly.

- No corpus is big enough.

- Even single addition of a word to the history will change the count to zero – Leads to data sparsity problem

Solution:

- Break the sequence into individual component.

- Store the count of the individual component.


Language Model

Joint Probability of a sequence

One possible way is to compute probability directly from the corpus or the text -
P(its water is so transparent that) = C(its water is so transparent that) / C(all 6 words sequence in the courpus)

- Hard to compute from the corpus.

Simplify using Chain Rule- Showing relation between conditional and join probability

- Probability of a word is conditioned on a long sequence of preceeding words. w n is conditioned on wn-1 to


w1
N-Gram Language Model

A word depends only a few n words from history (Markov Assumption), not the entire
sequence.
N=3 -> Trigram (looks 2 words into the past)
N =2 -> Bigram (looks 1 word into the past)
N=1 -> Unigram (Only the word)

P(the|Walden Pond’s water is so transparent that) ~ P(the|that) //bigram

P(the|Walden Pond’s water is so transparent that) ~ P(the|transpaent that) //trigram


N-Gram Language Model

A word depends only a few n words from history (Markov Assumption), not the entire
sequence (Markov Assumption).
N=3 -> Trigram (looks 2 words into the past)
N =2 -> Bigram (looks 1 word into the past)
N=1 -> Unigram (Only one word)

//Conditional Probability

//Joint Probability considering bi-


gram assumption.

How to compute the liklihood?


// Also called MLE

This relative frequency computation is called Maximum Liklihood Estimaiton.


The model ((M) parameters) maximizes the liklihood of training corpus (T) given
the model M, that is P(T|M).
N-Gram Language Model

<s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

What are?
- P(I|<s>)
- P(Sam|<s>)
- P(am|I)
- P(</s>|Sam)
- P(Sam|am)
- P(do|I)
- P(<s> I am Sam </s>)

- Start and End symbols complete the grammar of n-gram model.


N-Gram Language Model
How to Present a Language Model?

Bi-gram Counts

P(<S> I want to eat chinese </S>) Use training data to build


using the given bigram model? the model.

Uni-gram counts

Bi-gram Probabilites (M)


N-Gram Language Model

Log probabilties:

- for longer seqneces the mulitplication of probabilities will result into numerical underflow.
- log of probability provides a solution, exponent gives back the probabiltiy if needed.
N-Gram Language Model

Performance Evaluation: Measure statisically significant difference between two language models.
- Extrinsic Evaluation: Measure how much the application improves, for example, next word
prediciton.
- Intrinsic Evaluaiton: Measures the quality of the language model on its own. Perplexity (PP)
is one such measure.

- In practice, we often devide data into train, test, and development set. We report performance
score on the test set.
N-Gram Language Model (Intrinsic Evaluation)
- W represents the complete sequence of the test data, <S> & </S> are inserted before and
after each sentence in the data before probability computation.
- The perplexity of a language model on a test set is the inverse probability of the test set, normalized by
the number of words.

Perplexity (PP (W)):

For bigram ->

- Minimizing perplexity is equivalent to maximizing probability of the test data.


- Preplexity is inversely proportional to the liklihood of the test sequence.
- Perplexity is also called weighted average branching factor of a language. The branching factor of a
language is the number of possible next words that can follow any word.

For Wall Street Journal corpus


Cautions: Intrinsic Evaluation Measures

- The vocabulary of the models must be the same for comparison.

- An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) improvement in


the performance of a language processing task like speech recognition or machine
translation.

- We need to check the end-to-end performance with respect to the application.


N-Gram Language Model

Sentences randomly generated from four n-grams computed from Shakespeare’s works.
The longer the context, the more meaningful sentences.
Issues with LMs


LMs do better and better if we increase corpus.

Differences in the training and test: A model
trained on Shakespeare’s text couldn’t perform
well on Wall Street Journal text prediction.

Similar domain helps - To build a language
model for translating legal documents,we need
a training corpus of legal documents.

Still Data sparsity is a prevelant issue.
N-Gram Language Model

- Data Sparsity
- witenssing zero probability or very low probability n-grams.
denied the allegations: 5
denied the speculation: 2
denied the rumors: 1
denied the report: 1
Got the offer: 1

denied the offer: 0


denied the loan: 0
P(offer|denied the) is 0
N-Gram Language Model

- Data Sparsity
- In other cases we have to deal with words we haven’t seen before, which we’ll call
unknown words, or out of vocabulary (OOV) words.
- The percentage of OOV words that appear in the test set is called the OOV rate.
- Often tagged as UNK.
- How to compute probability of UNK?
N-Gram Language Model

- Dealing Unknown Words


- Smoothing: Shave off some probability mass from frequent events and assign to UNK.
1. Laplace Smoothing or add-one Smoothing: Add one to all the bigram counts before
computing probability.
Laplace: Smoothed Bi-gram Probabilites

Each unigram count is augmented by V (1446)


- P(to|want) decreases from .66 to .26
- Probability mass is shifted to all zero terms.
k-smoothing

- Unknown Words
- Smoothing: move a bit less of the probability mass from the seen terms to the unseen
terms.
1. Add k-smoothing: Add a fractional count-k (.5 or .05 or .01).

- k to be optimized on the development/validation set.

- Gale and Chruch, 1994 showed that k-smooting also leads to poor variance and
iappropriate discounts.
Better mechanisms of Smoothing
Unknown Words
- Backoff: Use the less context, for 4-gram, use trigram, for 3-gram, use bigram, for 2-Gram, use
Unigram.
- We “back off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram.

- Katz Backoff: We rely on a discounted probability P∗ if we’ve seen this n-gram before. Otherwise,
we recursively back off to the Katz probability for the shorter-history (N-1)-gram.

function alpha is to distribute this


probability mass to the lower order
n-grams

- Interpolation: Combine different order n-grams by linearly interpolating all the models.

Lamda is a hyper parameter, which


sums to 1.
The Vauquois Triangle

The Vauquois triangle for machine translation [Vauquois 1968]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy