0% found this document useful (0 votes)
17 views82 pages

NLP-Ch-2 Introduction to Language Models

Uploaded by

Fedasa Bote
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views82 pages

NLP-Ch-2 Introduction to Language Models

Uploaded by

Fedasa Bote
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

AAU

AAiT
SiTE

Course Title: Natural Language Processing (NLP) (SiTE 5264)


Credit Hour: 3
Instructor: Fantahun B. (PhD)  [email protected]

Office: NB #

Ch-2 Language Models


October 2023, Addis Ababa, Ethiopia
N-gram Language Models
Contents
 The role of language models.
 Simple N-gram models.
 Estimating parameters and smoothing.
 Evaluating language models.

10/27/2023 NLP Fantahun B (PhD) 2


N-gram Language Models
Objectives:
After completing this chapter, students will be able to:
 Identify and discuss the role of language models.
 Explain and implement simple n-gram models.
 Identify techniques that could improve n-gram models
 Evaluate language models.

10/27/2023 NLP Fantahun B (PhD) 3


N-gram Language Models
Brain storming
1. What is a language model?
2. Which types of language models do you know?
3. What are n-grams?
4. What is the role of n-grams?
5. Why would you want to predict upcoming words, or assign
probabilities to sentences?

10/27/2023 NLP Fantahun B (PhD) 4


Language Models: Problem of Modeling Language
 Formal languages, like programming languages, can be fully
specified.
 All the reserved words can be defined and the valid ways that they
can be used can be precisely defined.
 We cannot do this with natural/human language.
 Natural languages are not designed; they emerge, and hence there is
no formal specification. There may be formal rules for parts of the
language, and heuristics, but natural language that does not confirm
is often used.
 Natural languages are ambiguous;
 Languages change, word usages change: it is a moving target.
 An alternative approach to specifying the model of the language
is to learn it from examples.
10/27/2023 NLP Fantahun B (PhD) 5
Language Models: What is a language model?
 Language modeling, or LM, is the use of various statistical and
probabilistic techniques to determine the probability of a given
sequence of words occurring in a sentence.

 A language model is a probabilistic model of a natural language


that can generate probabilities of a series of words, based on text
corpora in one or multiple languages it was trained on.

10/27/2023 NLP Fantahun B (PhD) 6


Language Models: Types - Statistical Language Models
 Statistical Language Modeling, or Language Modeling and LM for
short, is the development of probabilistic models that are able to
predict the next word in the sequence given the words that
precede it.
 Language modeling is the task of assigning a probability to sentences
in a language.
 assigning a probability to each sequence of words,
 assigns a probability for the likelihood of a given word (or a sequence
of words) to follow a sequence of words
 A language model is a function that puts a probability measure over
strings drawn from some vocabulary.

10/27/2023 NLP Fantahun B (PhD) 7


Language Models: Types - Statistical Language Models
 Statistical Language Modeling Examples:
 N-gram: this model analyzes the text backwards. To do so, it creates a
probability distribution of sequence "n".

 Bidirectional: a type of statistical language modeling that analyzes


text bidirectionally, backward and forward.

 Exponential: in this case, the evaluation of the text is performed by


means of an equation combining n-grams and other parameters. It is
more accurate than the n-gram model.

 Continuous space: this is based on weighting each word (word


embedding). It is very useful when dealing with very large texts or
data sets.

10/27/2023 NLP Fantahun B (PhD) 8


Language Models: Types - Neural Language Models
 The use of neural networks in language modeling is often called
Neural Language Modeling, or NLM for short.
 Nonlinear neural network models solve some of the shortcomings
of traditional language models:
 they allow conditioning on increasingly large context sizes with only a linear
increase in the number of parameters, they alleviate the need for manually
designing backoff orders, and they support generalization across different
contexts.
 Neural Language Models (NLM) address the n-gram data sparsity issue through
parameterization of words as vectors (word embeddings) and using them as
inputs to a neural network. The parameters are learned as part of the training
process. Word embeddings obtained through NLMs exhibit the property
whereby semantically close words are likewise close in the induced vector
space.

10/27/2023 NLP Fantahun B (PhD) 9


Simple N-gram Models

10/27/2023 NLP Fantahun B (PhD) 10


N-gram Language Models
 Predicting is difficult—especially about the future, as the old
quip goes. But how about predicting something that seems
much easier, like the next few words someone is going to say?
 What word, for example, is likely to follow
Please turn off your cell ...
 Hopefully, most of you concluded that a very likely word is
phone, but probably not refrigerator or the.
 This intuition can be formalized by introducing models that
assign a probability to each possible next word.

10/27/2023 NLP Fantahun B (PhD) 11


N-gram Language Models
 The same models will also serve to assign a probability to an
entire sentence.
 Such a model, for example, could predict that the following
sequence has a much higher probability of appearing in a text:

all of a sudden I notice three guys standing on the sidewalk


than does this same set of words in a different order:
on guys all I of notice sidewalk three a sudden standing the
Qs:
 Why would you want to predict upcoming words, or assign
probabilities to sentences?
 How do you think a spelling corrector or grammar corrector work?
10/27/2023 NLP Fantahun B (PhD) 12
N-gram Language Models
 Probabilities are essential in any task in which we have to
identify words in noisy, ambiguous inputs (eg. Speech
recognition).
 Speech recognition. For a speech recognizer to realize that you said I
will be back soonish and not I will be bassoon dish, it helps to know
that P(back soonish) >> P(bassoon dish).

 For writing tools like spelling correction or grammatical error


correction, we need to find and correct errors in writing like Their are
two midterms, in which There was mistyped as Their, or Everything has
improve, in which “improve” should have been “improved”.
P(There are) >> P(Their are), and
P(has improved) >> P(has improve),
allowing us to help users by detecting and correcting these errors.
10/27/2023 NLP Fantahun B (PhD) 13
N-gram Language Models
 Assigning probabilities to sequences of words is also essential in machine
translation.
 Suppose we are translating a Chinese source sentence:
他 向 记者 介绍了 主要 内容
He to reporters introduced main content
 We might have built the below set of potential rough English translations:
he introduced reporters to the main contents of the statement
he briefed to reporters the main contents of the statement
he briefed reporters on the main contents of the statement
 A probabilistic model of word sequences could suggest that
 P(briefed reporters on) > P(briefed to reporters) (which has an awkward “to”
after briefed) or
 P(briefed reporters on) > P(introduced reporters to) (which uses a verb that is
less fluent English in this context), allowing us to correctly select the
boldfaced sentence above.
10/27/2023 NLP Fantahun B (PhD) 14
N-gram Language Models
 Probabilities are also important for augmentative and alternative
communication AAC systems (Trnka et al. 2007, Kane et al. 2017).
 People often use such AAC devices if they are physically unable to
speak or sign but can instead use eye gaze or other specific movements
to select words from a menu to be spoken by the system. Word
prediction can be used to suggest likely words for the menu.
 More NLP tasks where N-grams are applicable
 part-of-’speech tagging,
 natural language generation,
 word similarity,
 authorship identification
 sentiment extraction
 predictive text input.
 etc.
10/27/2023 NLP Fantahun B (PhD) 15
N-gram Language Models
Language models
 Models that assign probabilities to sequences of words are called
language models or LMs.
 The n-gram is the simplest language model that assigns probabilities to
sentences and sequences of words.
 An n-gram is a sequence of n words:
 a 2-gram (bigram) is a two-word sequence of words like
“please turn”, “turn your”, or ”your homework”, and
 a 3-gram (a trigram) is a three-word sequence of words like
“please turn your”, or
“turn your homework”.
 etc.
10/27/2023 NLP Fantahun B (PhD) 16
N-gram Language Models
 Goal: compute the probability of a sentence or sequence of
words:
P(W) = P(w1,w2,w3,w4,w5…wn)
 Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
 A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
 Better: the grammar But language model or LM is standard

10/27/2023 NLP Fantahun B (PhD) 17


N-gram Language Models
 Let’s begin with the task of computing P(w|h), the probability
of a word w given some history h.
 Suppose the history h is “its water is so transparent that” and
we want to know the probability that the next word is “the”:
P(the | its water is so transparent that)
 How can we compute this probability? One way is to estimate
it from relative frequency counts.

10/27/2023 NLP Fantahun B (PhD) 18


N-gram Language Models
 For example, we could take a very large corpus, count the number
of times we see “the water is so transparent that”, and count the
number of times this is followed by “the”.
 This would be answering the question “Out of the times we saw the
history h, how many times was it followed by the word w”:
P(the|its water is so transparent that) =
C(its water is so transparent that the)
C(its water is so transparent that)
 While this method of estimating probabilities directly from counts
works fine in many cases, it turns out that even the web isn’t big
enough to give us good estimates in most cases.
 This is because language is creative; new sentences are created
all the time, and we won’t always
10/27/2023 NLP Fantahunbe
B (PhD)able to count entire 19
sentences.
N-gram Language Models
 Similarly, if we wanted to know the joint probability of an entire
sequence of words like “its water is so transparent”, we could do it
by asking
 “out of all possible sequences of 5 words, how many of them are its
water is so transparent?”
 We would have to get the count of “its water is so transparent”, and
divide by the sum of the counts of “all possible 5 word sequences”.
That seems rather a lot to estimate!
 For this reason, we’ll need to introduce cleverer ways of estimating
the probability of a word w given a history h, or the probability of
an entire word sequence W.

10/27/2023 NLP Fantahun B (PhD) 20


N-gram Language Models
 Recall the definition of conditional probabilities
p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

 More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

 The Chain Rule in generalized


P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

10/27/2023 NLP Fantahun B (PhD) 21


N-gram Language Models
 The Chain Rule applied to compute joint probability of words
in sentence

P(w1w2…wn) = )

 Example
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water) × P(so|its water is)
x P(transparent|its water is so)

10/27/2023 NLP Fantahun B (PhD) 22


N-gram Language Models
 But using the chain rule doesn’t really seem to help us!
 As we said above, we can’t just estimate by counting the
number of times every word occurs following every long string,
because language is creative and any particular context
might have never occurred before!
 The intuition of the N-gram model is that instead of computing
the probability of a word given its entire history, we will
approximate the history by just the last few words.

10/27/2023 NLP Fantahun B (PhD) 23


N-gram Language Models
 Markov’s Simplifying assumption:

Or may be
Andrei Andreyevich
Markov (1856 - 1922)
 Markov models are the class of probabilistic models that assume we can
predict the probability of some future unit without looking too far into the past.
 We can generalize the bigram (which looks one word into the past) n-gram to
the trigram (which looks two words into the past) and thus to the n-gram (which
looks n-1 words into the past).

10/27/2023 NLP Fantahun B (PhD) 24


N-gram Language Models
 Simplest case: Unigram model

 Bigram model: condition on the previous word

 We can extend to trigrams, 4-grams, 5-grams

10/27/2023 NLP Fantahun B (PhD) 25


Estimating N-gram Probabilities

10/27/2023 NLP Fantahun B (PhD) 26


Estimating N-gram Probabilities: Bi-gram estimation
 The Maximum Likelihood Estimate

10/27/2023 NLP Fantahun B (PhD) 27


Estimating N-gram Probabilities: Bi-gram estimation
 The Maximum Likelihood Estimate: Example

<s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

10/27/2023 NLP Fantahun B (PhD) 28


Estimating N-gram Probabilities: Bi-gram estimation

Bigram counts for eight of the words (out of V = 1446) in the Berkeley
Restaurant Project corpus of 9332 sentences. Zero counts are in blue.
10/27/2023 NLP Fantahun B (PhD) 29
Estimating N-gram Probabilities: Bi-gram estimation
 Raw bigram probabilities, Normalize by unigrams:

Bigram counts for eight of the words (out of V = 1446) in the Berkeley
Restaurant Project corpus of 9332 sentences. Zero counts are in blue.

10/27/2023 NLP Fantahun B (PhD) 30


Estimating N-gram Probabilities: Bi-gram estimation
 Other bi-gram estimates of sentence probabilities

P(i|<s>) = 0.25 P(english|want) = 0.0011


P(food|english) = 0.5 P(</s>|food) = 0.68

10/27/2023 NLP Fantahun B (PhD) 31


Estimating N-gram Probabilities: Bi-gram estimation
 Now we can compute the probability of sentences like I want
English food or I want Chinese food by simply multiplying the
appropriate bigram probabilities together, as follows:

P(<s> i want english food </s>) =


P(i|<s>) × P(want|i) × P(english|want) × P(food|english) × P(</s>|food)
= 0.25 x 0.33 x .0011 x 0.5 x 0.68
= 0.000031
 Class activity: compute the probability of i want chinese food.

10/27/2023 NLP Fantahun B (PhD) 32


Estimating N-gram Probabilities: Bi-gram estimation
 What kinds knowledge are  Some of the bigram probabilities
captured in these bigram statistics? above encode some facts that we
think of as strictly syntactic in nature,
like the fact that what comes after
P(english|want) = .0011 eat is usually a noun or an adjective,
or that what comes after to is usually
P(chinese|want) = .0065 a verb.
P(to|want) = .66  Others might be a fact about the
P(eat | to) = .28 personal assistant task, like the high
probability of sentences beginning
P(food | to) = 0 with the words I.
P(want | spend) = 0  And some might even be cultural
P (i | <s>) = .25 rather than linguistic, like the higher
probability that people are looking
for Chinese versus English food.

10/27/2023 NLP Fantahun B (PhD) 33


Estimating n-gram probabilities: Some practical issues
 We use trigrams, 4-grams or 5-grams than bi-grams
 We always represent and compute language model probabilities in log
format as log probabilities. Since probabilities are (by definition) less than
or equal to 1. The more probabilities we multiply together, the smaller the
product becomes.
 Avoid numerical underflow
 (also adding is faster than multiplying)

 We only need to convert back into probabilities if we need to report


them at the end; then we can just take the exp of the logprob:
p1x p2 x p3 x p4 = exp(log p1+log p2+log p3+log p4)
10/27/2023 NLP Fantahun B (PhD) 34
Estimating n-gram probabilities: Some practical issues
Google N-gram Release Example of the 4-gram data in this corpus:
serve as the incoming 92
serve as the incubator 99
File sizes: approx. 24 GB serve as the independent 794
compressed (gzip'ed) text files serve as the index 223
serve as the indication 72
Number of tokens: 1,024,908,267,229
Number of sentences: 95,119,665,584 serve as the indicator 120
Number of unigrams: 13,588,391 serve as the indicators 45
Number of bigrams: 314,843,401 serve as the indispensable 111
Number of trigrams: 977,069,902 serve as the indispensible 40
Number of fourgrams: 1,313,818,354 serve as the individual 234
Number of fivegrams: 1,176,470,663
serve as the industrial 52
serve as the industry 607

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

10/27/2023 NLP Fantahun B (PhD) 35


Evaluating Language Models

10/27/2023 NLP Fantahun B (PhD) 36


Evaluating Language Models: How good is our model?
 Does our language model prefer good sentences to bad
ones?
 Assign higher probability to “real” or “frequently observed” sentences
Than “ungrammatical” or “rarely observed” sentences?
 We train parameters of our model on a training set.
 We test the model’s performance on data we haven’t seen.
 A test set is an unseen dataset that is different from our training set,
totally unused.
 An evaluation metric tells us how well our model does on the test set.

10/27/2023 NLP Fantahun B (PhD) 37


Evaluating Language Models: Extrinsic evaluation
 Extrinsic evaluation is the only way to know if a particular
improvement in a component is really going to help the task
at hand.
 Best evaluation for comparing two models A and B: Put each
model in a task
 E.g. tasks: spelling corrector, speech recognizer, MT system
 Run the task, get an accuracy for A and for B
 How many misspelled words corrected properly
 How many words translated correctly
 Compare accuracy of models A and B

10/27/2023 NLP Fantahun B (PhD) 38


Evaluating Language Models: Extrinsic evaluation
 Extrinsic evaluation pitfalls
 it may be expensive
 Why?
 Solution
 Intrinsic evaluation

10/27/2023 NLP Fantahun B (PhD) 39


Evaluating Language Models : Intrinsic evaluation
 We need a metric that can be used to quickly evaluate
potential intrinsic improvements in a language model.
 In an intrinsic evaluation, quality of NLP systems outputs is
evaluated against pre-determined ground truth whereas an
extrinsic evaluation is aimed at evaluating systems outputs
based on their impact on the performance of other NLP
systems.
 An intrinsic evaluation metric is one that measures the quality
of a model independent of any application.

10/27/2023 NLP Fantahun B (PhD) 40


Evaluating Language Models : Intrinsic evaluation
 For instance, in intrinsic evaluation of paraphrase generation
system, we would ask the following questions:
 Does the generated paraphrase convey meaning of the original text?
 On the other hand, extrinsic evaluation of generated
paraphrases deals with their impact on the performance of
other NLP systems. In this context, we might ask:
 Do the incorporated paraphrases significantly improve performance
of question-answering model?
 Can the generated paraphrases be used as a surrogate of original
text to train text classification models?
 If so, it can be concluded that, extrinsically, the considered
paraphrases are useful.

10/27/2023 NLP Fantahun B (PhD) 41


Evaluating Language Models
 Since our evaluation metric is based on test set probability, it’s
important not to let the test sentences into the training set. We
call this situation training on the test set.
 Training on the test set introduces a bias that makes the
probabilities all look too high, and causes huge inaccuracies
in perplexity (a probability-based metric).
 Sometimes we use a particular test set so often that we
implicitly tune to its characteristics. We then need a fresh test
set that is truly unseen.
 In such cases, we call the initial test set the development test
set or, devset or validation set?

10/27/2023 NLP Fantahun B (PhD) 42


Evaluating Language Models
 At the minimum, we would want to pick the smallest test set
that gives us enough statistical power to measure a
statistically significant difference between two potential
models.
 In practice, we often just divide our data into 80% training,
10% development, and 10% test.

10/27/2023 NLP Fantahun B (PhD) 43


Evaluating Language Models: Perplexity
 In practice we don’t use raw probability as our metric for
evaluating language models, but a variant called perplexity.
 The perplexity (PP for short) of a language model on a test set
is the inverse probability of the test set, normalized by the
number of words.
 Note that because of the inverse, the higher the conditional
probability of the word sequence, the lower the perplexity.
 Thus, minimizing perplexity is equivalent to maximizing the test
set probability according to the language model.

10/27/2023 NLP Fantahun B (PhD) 44


Evaluating Language Models: Perplexity

 For a test set W = w1w2…wN,:

 Chain rule, expand PP of W:

 Bigram model

10/27/2023 NLP Fantahun B (PhD) 45


Evaluating Language Models: Perplexity
 What we generally use for word sequence in the above
equations is the entire sequence of words in some test set.
 Since this sequence will cross many sentence boundaries,
we need to include the begin- and end-sentence markers
<s> and </s> in the probability computation.
 We also need to include the </s> (but not <s>) in the total
count of word tokens N.

10/27/2023 NLP Fantahun B (PhD) 46


Evaluating Language Models: Perplexity as branching factor
 There is another way to think about perplexity: as the weighted
average branching factor of a language (the number of possible
next words that can follow any word.)
 Consider the task of recognizing the digits in English (0, 1, 2, ….9),
given that (both in some training set and in some test set) all digits
occur with equal probability P= 1/10.
 The perplexity of this mini-language is ?
 Answer 10.

 How hard is recognizing (30,000) names at Microsoft.


 Perplexity = 30,000
10/27/2023 NLP Fantahun B (PhD) 47
Evaluating Language Models: Perplexity as branching factor
 Josh Goodman: a call-routing phone system gets 120K calls and
has to recognize
 "Operator" (let's say this occurs 1 in 4 calls)
 "Sales" (1in 4)
 "Technical Support" (1 in 4)
 30,000 different names (each name occurring 1 time in the 120K calls)

 We get the perplexity of this sequence of length 120K by first


multiplying 120K probabilities (90K of which are 1/4 and 30K of
which are 1/120K), and then taking the inverse 120,000th root:
 Perp = (¼ * ¼ * ¼* ¼ * ¼ * …. * 1/120K * 1/120K * ….)^(-1/120K)
 Arithmetically simplified to just N = 4: the operator (1/4), the sales
(1/4), the tech support (1/4), and the 30,000 names (1/120,000):
 Perplexity= ((¼ * ¼ * ¼ * 1/120K)^(-1/4) = 52.6
10/27/2023 NLP Fantahun B (PhD) 48
Evaluating Language Models: Perplexity as branching factor
 Training:
38 million words,
test 1.5 million words,
on Wall Street Journal,
using a 19,979 word vocabulary.

N-gram Order Unigram Bigram Trigram

Perplexity 962 170 109

10/27/2023 NLP Fantahun B (PhD) 49


Evaluating Language Models: Perplexity as branching factor
 An (intrinsic) improvement in perplexity does not guarantee an
(extrinsic) improvement in the performance of a language
processing task like speech recognition or machine translation.
 Nonetheless, because perplexity often correlates with such
improvements, it is commonly used as a quick check on an
algorithm.
 But a model’s improvement in perplexity should always be
confirmed by an end-to-end evaluation of a real task before
concluding the evaluation of the model.

10/27/2023 NLP Fantahun B (PhD) 50


Sampling Sentences from a Language Model

10/27/2023 NLP Fantahun B (PhD) 51


Sampling Sentences from a Language Model
 One important way to visualize what kind of knowledge a
language model embodies is to sample from it.
 Sampling from a distribution means to choose random points
according to their likelihood.
 Thus sampling from a language model — which represents a
distribution over sentences — means to generate some
sentences, choosing each sentence according to its likelihood
as defined by the model.
 Thus, we are
 more likely to generate sentences that the model thinks have a high
probability and
 less likely to generate sentences that the model thinks have a low
probability.
10/27/2023 NLP Fantahun B (PhD) 52
Sampling Sentences from a Language Model
 This technique of visualizing a language model by sampling was first
suggested by Shannon (1951) and Miller & Selfridge (1950). It’s simplest to
visualize for the unigram case.

Figure 3.3 (TextBook) A visualization of the sampling distribution for sampling sentences by repeatedly sampling unigrams. The blue
bar represents the frequency of each word. The number line shows the cumulative probabilities. If we choose a random number
between 0 and 1, it will fall in an interval corresponding to some word. The expectation for the random number to fall in the larger
intervals of one of the frequent words (the, of, a) is much higher than in the smaller interval of one of the rare words (polyphonic).
10/27/2023 NLP Fantahun B (PhD) 53
Sampling Sentences from a Language Model
 Shanon visualization technique for bigrams
 Choose a random bigram (<s>, w) according to its probability
 Now choose a random bigram (w, x) according to its probability And so on
until we choose </s>
 Then string the words together

<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
I want to eat Chinese food
10/27/2023 NLP Fantahun B (PhD) 54
Generalization and Zeros

10/27/2023 NLP Fantahun B (PhD) 55


Generalization and Zeros
 The n-gram model, like many statistical models, is dependent on
the training corpus.
 Implications:
 The probabilities often encode specific facts about a given training
corpus.
 n-grams do a better and better job of modeling the training corpus as
we increase the value of N.
 We can use the sampling method from the prior section to visualize
both of these facts!
 To give an intuition for the increasing power of higher-order n-
grams, Fig. 3.4 shows random sentences generated from unigram,
bigram, trigram, and 4-gram models trained on Shakespeare’s
works.

10/27/2023 NLP Fantahun B (PhD) 56


Generalization and Zeros
Approximating Shakespeare

Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-
corrected for capitalization to improve readability.
10/27/2023 NLP Fantahun B (PhD) 57
Generalization and Zeros
Approximating Shakespeare

 N=884,647 tokens, V=29,066


 Shakespeare produced 300,000 bigram types out of V2= 844 million
possible bigrams.
 So 99.96% of the possible bigrams were never seen (have zero
entries in the table)

 Quadrigrams worse: What's coming out looks like Shakespeare


because it is Shakespeare

10/27/2023 NLP Fantahun B (PhD) 58


Generalization and Zeros
Approximating the WSJ

 To get an idea of the dependence of a grammar on its training set,


let’s look at an n-gram grammar trained on a completely different
corpus: the Wall Street Journal (WSJ) newspaper.
 Shakespeare and the Wall Street Journal are both English, so we
might expect some overlap between our n-grams for the two
genres.
 Fig. 3.5 shows sentences generated by unigram, bigram, and
trigram grammars trained on 40 million words from WSJ.

10/27/2023 NLP Fantahun B (PhD) 59


N-gram Language Models: Generalization and zeros
Approximating the WSJ

Figure 3.5 Three sentences randomly generated from three n-gram models computed from 40 million
words of the Wall Street Journal, lower-casing all characters and treating punctuation as words. Output
was then hand-corrected for capitalization to improve readability.

10/27/2023 NLP Fantahun B (PhD) 60


Generalization and Zeros
Approximating the WSJ
 Compare these examples to the pseudo-Shakespeare in Fig. 3.4.
 While they both model “English-like sentences”, there is clearly no
overlap in generated sentences, and little overlap even in small
phrases.
 Conclusion:
 Statistical models are likely to be pretty useless as predictors if the
training sets and the test sets are as different as Shakespeare and WSJ.
 Question:
 How should we deal with this problem (overfitting) when we build n-
gram models?.

10/27/2023 NLP Fantahun B (PhD) 61


Generalization and Zeros: The perils of overfitting
 N-grams only work well for word prediction if the test corpus looks
like the training corpus.
 In real life, it often doesn’t
 We need to train robust models that generalize well!
1) One step is to be sure to use a training corpus that has a similar
genre to whatever task we are trying to accomplish. Examples:
 To build a language model for translating legal documents, we need
a training corpus of legal documents.
 To build a language model for a question-answering system, we need
a training corpus of questions.

10/27/2023 NLP Fantahun B (PhD) 62


Generalization and Zeros: The perils of overfitting
2) It is equally important to get training data in the appropriate
dialect or variety, especially when processing social media posts
or spoken transcripts.
 For example some tweets will use features of African American
Language (AAL)— the name for the many variations of language
used in African American communities (King, 2020).
 Such features include words like finna—an auxiliary verb that marks
immediate future tense —that don’t occur in other varieties, or
spellings like den for then, in tweets like this one (Blodgett and
O’Connor, 2017):
Bored af den my phone finna die!!!

10/27/2023 NLP Fantahun B (PhD) 63


Generalization and Zeros: The perils of overfitting
2) … get training data in the appropriate dialect or variety….
 Tweets from varieties like Nigerian English have markedly different
vocabulary and n-gram patterns from American English (Jurgens et
al., 2017):
 Eg.
@username R u a wizard or wat gan sef: in d mornin - u tweet, afternoon
– u tweet, nyt gan u dey tweet. beta get ur IT placement wiv twitter

10/27/2023 NLP Fantahun B (PhD) 64


Generalization and Zeros: Zeros
3) Matching genres and dialects is still not sufficient. Our models may
still be subject to the problem of sparsity.
 For any n-gram that occurred a sufficient number of times, we might
have a good estimate of its probability.
 But because any corpus is limited, some perfectly acceptable English
word sequences are bound to be missing from it.
 i.e., we’ll have many cases of putative “zero probability n-grams” that
should really have some non-zero probability.
 Eg.
• Consider the words that follow the bigram “denied the” in the WSJ
Treebank3 corpus, together with their counts (next slide):

10/27/2023 NLP Fantahun B (PhD) 65


Generalization and Zeros: Zeros
 Consider the words that follow the bigram denied the in the WSJ
Treebank3 corpus, together with their counts:
o denied the allegations: 5
o denied the speculation: 2
o denied the rumors: 1
o denied the report: 1
 But suppose our test set has phrases like:
o denied the offer
o denied the loan
 Our model will incorrectly estimate that the P(offer|denied the) is 0!
 And hence we cannot compute perplexity (can’t divide by 0)!
 In order to evaluate our language models, we need to modify the MLE
method to assign some non-zero probability to any N-gram, even one
that was never observed in training.
10/27/2023 NLP Fantahun B (PhD) 66
Generalization and Zeros: Zeros – Smoothing
 The intuition of smoothing (from Dan Klein)
 When we have sparse statistics:
P(w | denied the)

allegations
3 allegations
2 reports

outcome
reports
1 claims

attack
request
claims

man
1 request

7 total
 Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations

1.5 reports
allegations

outcome
0.5 claims

attack
reports

0.5 request

man
request
claims

2 other
7 total
10/27/2023 NLP Fantahun B (PhD) 67
Generalization and Zeros: Zeros – Smoothing
 We use the term smoothing for such modifications that address the
poor estimates that are due to variability in small data sets.
 The name comes from the fact that we will be shaving a little bit of
probability mass from the higher counts, and piling it instead on the
zero counts, making the distribution a little less jagged.
 One simple way to do smoothing might be just to take our matrix of
bigram counts, before we normalize them into probabilities, and
add one to all the counts.
 This algorithm is called Laplace smoothing, or Laplace’s Law
(Lidstone, 1920; Johnson, 1932; SMOOTHING Jeffreys, 1948).

10/27/2023 NLP Fantahun B (PhD) 68


Generalization and Zeros: Zeros – Smoothing
Laplace smoothing (Add-one estimation)
 Pretend we saw each word one more time than we did
 Just add one to all the counts!

c(wi-1, wi )
 MLE estimate: PMLE (wi | wi-1 ) =
c(wi-1 )

c(wi-1, wi ) +1
 Add-1 estimate: PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V

10/27/2023 NLP Fantahun B (PhD) 69


Generalization and Zeros: Zeros – Smoothing
Laplace smoothing
 Berkeley Restaurant Corpus:
Laplace smoothed bigram counts

10/27/2023 NLP Fantahun B (PhD) 70


Generalization and Zeros: Zeros – Smoothing
Laplace smoothing
 Laplace smoothed bigram probabilities

10/27/2023 NLP Fantahun B (PhD) 71


Generalization and Zeros: Zeros – Smoothing
 Itis often convenient to reconstruct the count matrix so we can
see how much a smoothing algorithm has changed the
original counts.

10/27/2023 NLP Fantahun B (PhD) 72


Generalization and Zeros: Zeros – Smoothing
 Notethat add-one smoothing has made a very big change to
the counts.
 C(want to) changed from 608 to 238!
 We can see this in probability space as well:
P(to|want) decreases from .66 in the unsmoothed case to .26 in the
smoothed case.

10/27/2023 NLP Fantahun B (PhD) 73


Generalization and Zeros: Zeros – Smoothing
 Compare with raw bigram counts

10/27/2023 NLP Fantahun B (PhD) 74


Generalization and Zeros: Zeros – Smoothing
 Add-1 estimation is a blunt instrument

 So add-1 isn’t used for N-grams:


 Other better methods can be used
 But add-1 is used to smooth other NLP models
 For text classification
 In domains where the number of zeros isn’t so huge.

10/27/2023 NLP Fantahun B (PhD) 75


Generalization and Zeros: Zeros – Backoff and Interpolation
 Sometimes it helps to use less context
 Condition on less context for contexts you haven’t learned much
about
 Backoff:
 use trigram if you have good evidence,
 otherwise bigram, otherwise unigram
 Interpolation:
 mix unigram, bigram, trigram

 Interpolation works better

10/27/2023 NLP Fantahun B (PhD) 76


Generalization and Zeros: Zeros – Backoff and Interpolation
Linear Interpolation
 Simple interpolation

 Lambdas conditional on context:

10/27/2023 NLP Fantahun B (PhD) 77


Generalization and Zeros: Zeros – Backoff and Interpolation
Linear Interpolation: How to set the lambdas?
 Use a held-out corpus

Held-Out Test
Training Data Data Data

 Choose λs to maximize the probability of held-out data:


 Fix the N-gram probabilities (on the training data)
 Then search for λs that give largest probability to held-out set:

log P(w1...wn | M (l1... lk )) = å log PM ( l1... lk ) (wi | wi-1 )


i

10/27/2023 NLP Fantahun B (PhD) 78


Generalization and Zeros: Unknown words
4) The previous section discussed the problem of words whose
bigram probability is zero.
But what about words we simply have never seen before?
 Sometimes we have a language task in which this can’t happen
because we know all the words that can occur. In such a closed
vocabulary system the test set can only contain words from this
lexicon, and there will be no unknown words.
 This is a reasonable assumption in some domains, such as speech
recognition or machine translation, where we have a pronunciation
dictionary or a phrase table that are fixed in advance, and so the
language model can only use the words in that dictionary or phrase
table.

10/27/2023 NLP Fantahun B (PhD) 79


Generalization and Zeros: Unknown words
 If we know all the words in advance,
 Vocabulary V is fixed
 Closed vocabulary task
 Often we don’t know this
 Out Of Vocabulary = OOV words
 Open vocabulary task
 Instead: create an unknown word token <UNK>
 Training of <UNK> probabilities
o Create a fixed lexicon L of size V
o At text normalization phase, any training word not in L changed to <UNK>
o Now we train its probabilities like a normal word
 At decoding time
o If text input: Use UNK probabilities for any word not in training
10/27/2023 NLP Fantahun B (PhD) 80
N-gram Language Models
Summary
 Language models assign a probability that a sentence is a legal
string in a language.
 They are useful as a component of many NLP systems, such as ASR,
OCR, and MT.
 Simple N-gram models are easy to train on unsupervised corpora
and can provide useful estimates of sentence likelihood.
 MLE gives inaccurate parameters for models trained on sparse
data.
 Smoothing techniques adjust parameter estimates to account for
unseen (but not impossible) events.

10/27/2023 NLP Fantahun B (PhD) 81


N-gram Language Models
Bibliography
Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition (2nd
edition) D. Jurafsky and J. Martin
Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition (3rd
edition draft) D. Jurafsky and J. Martin
Natural Language Processing Slides by Raymond J. Mooney University of
Texas at Austin
Foundations of Statistical Natural Language Processing C. Manning and
H. Schutze

10/27/2023 NLP Fantahun B (PhD) 82

You might also like