6.Chapter6_LanguageModel
6.Chapter6_LanguageModel
AC3110E
1
Chapter 6: N-grams Language Model
3
Approaches of NLP
4
Introduction
• Which is correct?
• Output of an ASR system
• “Their are two terms” vs. “There are two terms” ?
• “Here you are !” vs. “Hear you are !” ?
• Output of a MT system
• “Strong wind tonight” vs. “Big wind tonight” ?
• Output of Handwritten recognition
5
Introduction
• Predict how frequent a phrase occurs within the natural use of a language
• Useful for spelling correction, grammatical error correction, etc. in NLP tasks
6
Probabilistic Language Model
7
Probabilistic Language Model
(" ")
• P(“know”|“we wanted to”) = ? =
(" ")
8
6.1 n-gram Language models
9
6.1 n-gram Language models
10
6.1 n-gram Language models
• Bigram probability
11
6.1 n-gram Language models
12
6.1 n-gram Language models
• Chain rule of probability: P(S) = P(w1, w2, w3, ..., wn) = P(w1:n)
• Log probabilities:
• Avoid underflow
• Faster
p1 × p2 × p3 × p4 = exp(log p1 +log p2 +log p3 +log p4)
13
Building n-gram Language models
14
Building n-gram Language models
• Example
• Berkeley Restaurant Project
• Counts for unigram
• Probability of bigrams = ? wn
wn-1
https://github.com/wooters/berp-trans
Jurafsky, D., et al. 1994. The Berkeley restaurant project. ICSLP.
15
Building n-gram Language models
• Example
• Berkeley Restaurant Project
• Estimation of the bigram probability of a statement “<s> I want chinese food </s>”
• Given P( I | <s>) = 0.25
P(</s >| food) = 0.68
wn
wn-1
16
6.2. Evaluating Language Models
17
6.2. Evaluating Language Models
18
6.2. Evaluating Language Models
• Intrinsic evaluation
• Using a measure: perplexity
• Measures the quality of a model independent of any application
• A training set
• for training language model parameters
• A development set
• for turning the model parameters to fit the characteristics of the development set
• A test set
• data truly not seen during model training
• Perplexity indicates how well the model fits the test set
• assign a higher probability to the test data
19
6.2. Evaluating Language Models
• Perplexity Intuition
• A better model is one that assigns a higher probability to the word that actually
occurs
• The perplexity (PP) of a language model on a test set W= w1, w2, w3, ..., wn :
20
6.2. Evaluating Language Models
21
6.2. Evaluating Language Models
• Perplexity
• Independent measure of specific application.
• But difficult to know whether a gain in perplexity will comes into a real gain
in practice
• Bad approximation unless the test data resembles the training data
• Generally useful in pilot experiments, before moving on to extrinsic
evaluation
22
6.3. Smoothing algorithms
• Training data
• n-gram model is dependent on the training corpus
• be sure to use a training corpus that has a similar genre/domain to the task
• Problem of sparsity
• Unknown Words
• Out of vocabulary (OOV) words can appear in open vocabulary system
• convert OOV word to <UNK> token (need a prior vocabulary in advance)
• replace words in the training data by <UNK> based on their frequency (< threshold)
• if some perfectly acceptable word sequences are missing from training corpus
“zero probability n-grams”
• but they could appear in the test set !
• For example
• In the work of Shakespeare
• N=884,647 tokens, V=29,066 => Nbr of possible bigrams = V2 = 844 million
• Shakespeare produced 300,000 different bigrams
• Thus, 99.96% of possible bigrams have never been seen.
23
6.3. Smoothing algorithms6
• Problem:
• Some of these zeros are really zeros.... but, some of them are only rare events
Smoothing algorithms
24
Backoff solutions
• Backoff
• “back off” to a lower-order n-gram if zero evidence for a higher-order n-gram
P(wn|wn−2wn−1) P(wn|wn−1) P(wn) ...
• Katz backoff
• Good-Turing backoff
• Interpolation
• weighting and combining the trigram, bigram, and unigram etc.
with
• Simple interpolation
• Conditional interpolation
25
Smoothing algorithms
• Smoothing (discounting)
• Adding probability mass to unseen events by Removing probability mass from
previously seen events in order to maintain a joint distribution that is 1
• Type of smoothing
• Laplace (add-one) smoothing
• Add-k smoothing
• Kneser-Ney smoothing
• Stupid backoff
• etc.
26
Laplace Smoothing
• etc.
• Made a very big change to the counts !
• too much probability mass is moved to all the zeros
• tends to reassign too much mass to events not seen in training
27
Add-k smoothing
• Instead of adding 1 to each count, adding a fractional count k (.5? .05? .01?)
28
Kneser-Ney smoothing
• One of the most commonly used and best performing n-gram smoothing
methods
• Take into account the number of different contexts word w has appeared in
• Hypothesis: words that have appeared in more contexts in the past are more likely to
appear in some new context as well.
• base on the number of different contexts word w has appeared in
29
• Google Books Ngram Viewer
• https://books.google.com/ngrams/
30
Why should we care about Language Modeling?
31
Language Modeling Toolkits
• SRILM
• http://www.speech.sri.com/projects/srilm/
• IRSTLM
• https://sourceforge.net/projects/irstlm/
• KenLM
• https://kheafield.com/code/kenlm/
32
• end of Chapter 6
33