5) Lecture Feb11&13&17&18
5) Lecture Feb11&13&17&18
Feb-11&13&17&18
Language Models - N-Grams
Language Model
Conditional Probabiltiy
P(w|h) // Probability of word given history
eg., P(the|its water is so transparent that)
One possible way is to compute probability directly from the corpus or the text -
P(the|its water is so transparent that) = C(its water is so transparent that the) / C(its water is so transparent that)
- Even single addition of a word to the history will change the count to zero – Leads to data sparsity problem
Solution:
One possible way is to compute probability directly from the corpus or the text -
P(its water is so transparent that) = C(its water is so transparent that) / C(all 6 words sequence in the courpus)
Simplify using Chain Rule- Showing relation between conditional and join probability
A word depends only a few n words from history (Markov Assumption), not the entire
sequence.
N=3 -> Trigram (looks 2 words into the past)
N =2 -> Bigram (looks 1 word into the past)
N=1 -> Unigram (Only the word)
A word depends only a few n words from history (Markov Assumption), not the entire
sequence (Markov Assumption).
N=3 -> Trigram (looks 2 words into the past)
N =2 -> Bigram (looks 1 word into the past)
N=1 -> Unigram (Only one word)
//Conditional Probability
What are?
- P(I|<s>)
- P(Sam|<s>)
- P(am|I)
- P(</s>|Sam)
- P(Sam|am)
- P(do|I)
- P(<s> I am Sam </s>)
Bi-gram Counts
Uni-gram counts
Log probabilties:
- for longer seqneces the mulitplication of probabilities will result into numerical underflow.
- log of probability provides a solution, exponent gives back the probabiltiy if needed.
N-Gram Language Model
Performance Evaluation: Measure statisically significant difference between two language models.
- Extrinsic Evaluation: Measure how much the application improves, for example, next word
prediciton.
- Intrinsic Evaluaiton: Measures the quality of the language model on its own. Perplexity (PP)
is one such measure.
- In practice, we often devide data into train, test, and development set. We report performance
score on the test set.
N-Gram Language Model (Intrinsic Evaluation)
- W represents the complete sequence of the test data, <S> & </S> are inserted before and
after each sentence in the data before probability computation.
- The perplexity of a language model on a test set is the inverse probability of the test set, normalized by
the number of words.
Sentences randomly generated from four n-grams computed from Shakespeare’s works.
The longer the context, the more meaningful sentences.
Issues with LMs
●
LMs do better and better if we increase corpus.
●
Differences in the training and test: A model
trained on Shakespeare’s text couldn’t perform
well on Wall Street Journal text prediction.
●
Similar domain helps - To build a language
model for translating legal documents,we need
a training corpus of legal documents.
●
Still Data sparsity is a prevelant issue.
N-Gram Language Model
- Data Sparsity
- witenssing zero probability or very low probability n-grams.
denied the allegations: 5
denied the speculation: 2
denied the rumors: 1
denied the report: 1
Got the offer: 1
- Data Sparsity
- In other cases we have to deal with words we haven’t seen before, which we’ll call
unknown words, or out of vocabulary (OOV) words.
- The percentage of OOV words that appear in the test set is called the OOV rate.
- Often tagged as UNK.
- How to compute probability of UNK?
N-Gram Language Model
- Unknown Words
- Smoothing: move a bit less of the probability mass from the seen terms to the unseen
terms.
1. Add k-smoothing: Add a fractional count-k (.5 or .05 or .01).
- Gale and Chruch, 1994 showed that k-smooting also leads to poor variance and
iappropriate discounts.
Better mechanisms of Smoothing
Unknown Words
- Backoff: Use the less context, for 4-gram, use trigram, for 3-gram, use bigram, for 2-Gram, use
Unigram.
- We “back off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram.
- Katz Backoff: We rely on a discounted probability P∗ if we’ve seen this n-gram before. Otherwise,
we recursively back off to the Katz probability for the shorter-history (N-1)-gram.
- Interpolation: Combine different order n-grams by linearly interpolating all the models.