NLP-Ch-2 Introduction to Language Models
NLP-Ch-2 Introduction to Language Models
AAiT
SiTE
Office: NB #
More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
P(w1w2…wn) = )
Example
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water) × P(so|its water is)
x P(transparent|its water is so)
Or may be
Andrei Andreyevich
Markov (1856 - 1922)
Markov models are the class of probabilistic models that assume we can
predict the probability of some future unit without looking too far into the past.
We can generalize the bigram (which looks one word into the past) n-gram to
the trigram (which looks two words into the past) and thus to the n-gram (which
looks n-1 words into the past).
Bigram counts for eight of the words (out of V = 1446) in the Berkeley
Restaurant Project corpus of 9332 sentences. Zero counts are in blue.
10/27/2023 NLP Fantahun B (PhD) 29
Estimating N-gram Probabilities: Bi-gram estimation
Raw bigram probabilities, Normalize by unigrams:
Bigram counts for eight of the words (out of V = 1446) in the Berkeley
Restaurant Project corpus of 9332 sentences. Zero counts are in blue.
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Bigram model
Figure 3.3 (TextBook) A visualization of the sampling distribution for sampling sentences by repeatedly sampling unigrams. The blue
bar represents the frequency of each word. The number line shows the cumulative probabilities. If we choose a random number
between 0 and 1, it will fall in an interval corresponding to some word. The expectation for the random number to fall in the larger
intervals of one of the frequent words (the, of, a) is much higher than in the smaller interval of one of the rare words (polyphonic).
10/27/2023 NLP Fantahun B (PhD) 53
Sampling Sentences from a Language Model
Shanon visualization technique for bigrams
Choose a random bigram (<s>, w) according to its probability
Now choose a random bigram (w, x) according to its probability And so on
until we choose </s>
Then string the words together
<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
I want to eat Chinese food
10/27/2023 NLP Fantahun B (PhD) 54
Generalization and Zeros
Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-
corrected for capitalization to improve readability.
10/27/2023 NLP Fantahun B (PhD) 57
Generalization and Zeros
Approximating Shakespeare
Figure 3.5 Three sentences randomly generated from three n-gram models computed from 40 million
words of the Wall Street Journal, lower-casing all characters and treating punctuation as words. Output
was then hand-corrected for capitalization to improve readability.
allegations
3 allegations
2 reports
outcome
reports
1 claims
…
attack
request
claims
man
1 request
7 total
Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations
1.5 reports
allegations
outcome
0.5 claims
attack
reports
0.5 request
…
man
request
claims
2 other
7 total
10/27/2023 NLP Fantahun B (PhD) 67
Generalization and Zeros: Zeros – Smoothing
We use the term smoothing for such modifications that address the
poor estimates that are due to variability in small data sets.
The name comes from the fact that we will be shaving a little bit of
probability mass from the higher counts, and piling it instead on the
zero counts, making the distribution a little less jagged.
One simple way to do smoothing might be just to take our matrix of
bigram counts, before we normalize them into probabilities, and
add one to all the counts.
This algorithm is called Laplace smoothing, or Laplace’s Law
(Lidstone, 1920; Johnson, 1932; SMOOTHING Jeffreys, 1948).
c(wi-1, wi )
MLE estimate: PMLE (wi | wi-1 ) =
c(wi-1 )
c(wi-1, wi ) +1
Add-1 estimate: PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V
Held-Out Test
Training Data Data Data