0% found this document useful (0 votes)
26 views33 pages

6.Chapter6_LanguageModel

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views33 pages

6.Chapter6_LanguageModel

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Natural Language Processing

AC3110E

1
Chapter 6: N-grams Language Model

Lecturer: PhD. DO Thi Ngoc Diep


SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)


• collection of rules, complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical probability NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• achieve state-of-the-art results in many natural language tasks

3
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)


• collection of rules, complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical probability NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• word embedding algorithm: large corpus of text produces a vector space, word
embeddings to capture semantic properties of words.
• achieve state-of-the-art results in many natural language tasks

4
Introduction

• Which is correct?
• Output of an ASR system
• “Their are two terms” vs. “There are two terms” ?
• “Here you are !” vs. “Hear you are !” ?
• Output of a MT system
• “Strong wind tonight” vs. “Big wind tonight” ?
• Output of Handwritten recognition

• HOW to assign a probability to a sentence ?


• HOW to predict upcoming words ?

5
Introduction

• Predict how frequent a phrase occurs within the natural use of a language
• Useful for spelling correction, grammatical error correction, etc. in NLP tasks

6
Probabilistic Language Model

• Language Models: assign probabilities to sequences of words


• P(S) = P(w1, w2, w3, ..., wn)
• To predict the next word
• Conditional probability: P(wn|w1, w2, w3, ..., wn-1)
• Target
• In recognition: P(“recognize speech”) ? P(“wreck a nice beach”)
• P(“recognize speech”) > P(“wreck a nice beach”)
• In text generation: P(“three houses”) ? P(“three house”)
• P(“three houses”) > P(“three house”)
• In spelling correction: P(“my cat eats fish”) ? P(“my xat eats fish”)
• P(“my cat eats fish”) > P(“my xat eats fish”)
• In machine translation: P(“strong wind”) ? P(“big wind”)
• P(“strong wind”) > P(“big wind”)
• etc.

• => Need to be learned from a training corpus with high quality


• containing perfectly acceptable word sequences/sentences in a language

7
Probabilistic Language Model

• How to calculate the probabilities from a training corpus


(" ")
• P(“we wanted to know”) = ? = ( )

(" ")
• P(“know”|“we wanted to”) = ? =
(" ")

+ Need a large “enough” corpus


+ “estimation” only, never see enough data

• Chain rule of probability:


P(S) = P(w1, w2, w3, ..., wn) = P(w1:n)

=> How to model P(wk|w1:k-1) ?

8
6.1 n-gram Language models

• The simplest model: n-gram LM


• n-gram: a sequence of n words
• 2-gram (bigram), 3-gram (trigram), etc.

"The big brown fox jumped over the fence"

Unigrams: "The", "big", "brown", "fox", "jumped", "over", "the", "fence"


Bigram: "The big", "big brown", "brown fox", "fox jumped", "jumped over", "over
the", "the fence"
Trigram: "The big brown", "big brown fox", "brown fox jumped", "fox jumped over",
"jumped over the", "over the fence"

6-gram (Hexagram): "The big brown fox jumped over", "big brown fox jumped over
the", "brown fox jumped over the fence"

• Longer n-grams suffer from low density

9
6.1 n-gram Language models

• The simplest model: n-gram LM


• n-gram: a sequence of n words
• 2-gram (bigram), 3-gram (trigram), etc.
• Longer n-grams suffer from low density
• Markov assumption:
• Approximation: The probability of a word depends only on several previous words
• The n-gram model looks n−1 words in the past

• P(“know”|“we wanted to”)


• unigram: = P(“know”)
• bigram: = P(“know”|“to”)
• trigram: = P(“know”|“wanted to”)

10
6.1 n-gram Language models

• Estimate the n-gram probabilities


• Maximum likelihood estimation (MLE)
• Unigram probability

• Bigram probability

• Example: Bigram probabilities for the corpus


<s> I am Sam </s> P(I|<s>) P(Sam|<s>) P(am|I)
?
<s> Sam I am </s> P(</s>|Sam) P(Sam|am) P(do|I)
<s> I do not like green eggs and ham </s>

11
6.1 n-gram Language models

• Estimate the n-gram probabilities


• Maximum likelihood estimation (MLE)
• n-gram LM from relative frequency

12
6.1 n-gram Language models

• Chain rule of probability: P(S) = P(w1, w2, w3, ..., wn) = P(w1:n)

• Log probabilities:
• Avoid underflow
• Faster
p1 × p2 × p3 × p4 = exp(log p1 +log p2 +log p3 +log p4)

• In practice it’s more common to use trigram models, or 4-gram or even 5-


gram models, when there is sufficient training data

13
Building n-gram Language models

• Preprocess the text:


• word normalization, sentence segmentation, etc.
• depends on the tasks
• Build model’s vocabulary
• All the words recognized by the language model
• All other words are considered “unknown”: <UNK>, <OOV>
• Include words whose frequency is greater than a certain threshold
• Add <UNK>, [<s>], </s>
• Types: the number of distinct words in a corpus
• Tokens: the total number of running words

<s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

14
Building n-gram Language models

• Example
• Berkeley Restaurant Project
• Counts for unigram

• Counts for bigram wn


wn-1

• Probability of bigrams = ? wn
wn-1

https://github.com/wooters/berp-trans
Jurafsky, D., et al. 1994. The Berkeley restaurant project. ICSLP.
15
Building n-gram Language models

• Example
• Berkeley Restaurant Project
• Estimation of the bigram probability of a statement “<s> I want chinese food </s>”
• Given P( I | <s>) = 0.25
P(</s >| food) = 0.68

P(<s> I want chinese food </s>)


= P( I |<s>) × P(want | I) × P(chinese | want) × P (food | chinese) × P(</s> | food)
= 0.25 × 0.33 × 0.0065 × 0.52 × 0.68
= 0.000189618

wn
wn-1

16
6.2. Evaluating Language Models

• How good is a model?


• How much does the embed application improve ?
• How a model performs on some unseen data ?
• Does the language model prefer good sentences or bad ones?
• Assign higher probability to “grammatical” or “frequently observed” statements
• than “ungrammatical” or “rarely observed” utterances?
• Does it accurately predict the unseen data ?
• Does it assign a higher probability to the unseen data ?

17
6.2. Evaluating Language Models

• Extrinsic evaluation - end-to-end evaluation


• Good evaluation to compare models A and B
• Implement each model in a task.
• spelling correction, speech recognition, translation system
• Execute the task, get precision for A and for B
• How many misspelled words were corrected correctly?
• How many words translated correctly
• Compare accuracy for A and B
• Unfortunately
• very time consuming assessment

18
6.2. Evaluating Language Models

• Intrinsic evaluation
• Using a measure: perplexity
• Measures the quality of a model independent of any application
• A training set
• for training language model parameters
• A development set
• for turning the model parameters to fit the characteristics of the development set
• A test set
• data truly not seen during model training
• Perplexity indicates how well the model fits the test set
• assign a higher probability to the test data

19
6.2. Evaluating Language Models

• Perplexity Intuition
• A better model is one that assigns a higher probability to the word that actually
occurs
• The perplexity (PP) of a language model on a test set W= w1, w2, w3, ..., wn :

The higher the conditional probability of the word sequence,


the lower the perplexity => model predicts the test data better

• Perplexity is a normalized version of the probability of the test set

20
6.2. Evaluating Language Models

• Goodman (2001) A bit of progress in language modelling


• Vocabulary size 50,000
• Trigram model: perplexity = 74
• Bigram model: perplexity = 137
• Unigram model: perplexity = 995
• Daniel and Martin (2009),
• Wall Street Journal
• 38 million words, Vocabulary size 19,979
• Trigram model: perplexity = 109
• Bigram model: perplexity = 170
• Unigram model: perplexity = 962
• The perplexity of two language models is only comparable if they use
identical vocabularies
• The perplexity is typically between 40 and 400

21
6.2. Evaluating Language Models

• Perplexity
• Independent measure of specific application.
• But difficult to know whether a gain in perplexity will comes into a real gain
in practice
• Bad approximation unless the test data resembles the training data
• Generally useful in pilot experiments, before moving on to extrinsic
evaluation

22
6.3. Smoothing algorithms

• Training data
• n-gram model is dependent on the training corpus
• be sure to use a training corpus that has a similar genre/domain to the task
• Problem of sparsity
• Unknown Words
• Out of vocabulary (OOV) words can appear in open vocabulary system
• convert OOV word to <UNK> token (need a prior vocabulary in advance)
• replace words in the training data by <UNK> based on their frequency (< threshold)
• if some perfectly acceptable word sequences are missing from training corpus 
“zero probability n-grams”
• but they could appear in the test set !
• For example
• In the work of Shakespeare
• N=884,647 tokens, V=29,066 => Nbr of possible bigrams = V2 = 844 million
• Shakespeare produced 300,000 different bigrams
• Thus, 99.96% of possible bigrams have never been seen.

23
6.3. Smoothing algorithms6

• Problem:
• Some of these zeros are really zeros.... but, some of them are only rare events

What if “students opened their ” never


occurred in data?  zero frequency n-grams ?

 Smoothing algorithms

What if “students opened their” never


occurred in data?  divide by zero ?
 Backoff solution

24
Backoff solutions

• Backoff
• “back off” to a lower-order n-gram if zero evidence for a higher-order n-gram
P(wn|wn−2wn−1) P(wn|wn−1) P(wn) ...
• Katz backoff
• Good-Turing backoff
• Interpolation
• weighting and combining the trigram, bigram, and unigram etc.

with

• Simple interpolation
• Conditional interpolation

25
Smoothing algorithms

• Smoothing (discounting)
• Adding probability mass to unseen events by Removing probability mass from
previously seen events in order to maintain a joint distribution that is 1

• Type of smoothing
• Laplace (add-one) smoothing
• Add-k smoothing
• Kneser-Ney smoothing
• Stupid backoff
• etc.

26
Laplace Smoothing

• Add 1 to all the n-gram counts, before normalizing into probabilities


• For unigram probabilities

• For bigram probabilities

• etc.
• Made a very big change to the counts !
• too much probability mass is moved to all the zeros
• tends to reassign too much mass to events not seen in training

27
Add-k smoothing

• Instead of adding 1 to each count, adding a fractional count k (.5? .05? .01?)

• How to choose k? Are they appropriate discounts ?

28
Kneser-Ney smoothing

• One of the most commonly used and best performing n-gram smoothing
methods
• Take into account the number of different contexts word w has appeared in
• Hypothesis: words that have appeared in more contexts in the past are more likely to
appear in some new context as well.
• base on the number of different contexts word w has appeared in

I can’t see without my reading ________

• “Los Angeles” and “glasses” are very frequent words.


• But the word (Los) occurring in only one context (with Angeles)  will have a low continuation
probability
• The word “glasses” has a much wider distribution  better choice

29
• Google Books Ngram Viewer
• https://books.google.com/ngrams/

30
Why should we care about Language Modeling?

• Language Modeling is a subcomponent of many NLP tasks


• Especially those involving generating text or estimating the probability of
text:
• Predictive typing
• Speech recognition
• Handwriting recognition
• Spelling/grammar correction
• Machine translation
• Summarization
• Dialogue
• etc.
• Everything else in NLP has now been rebuilt upon Language Modeling:
• GPT-3 is an LM!

31
Language Modeling Toolkits

• SRILM
• http://www.speech.sri.com/projects/srilm/
• IRSTLM
• https://sourceforge.net/projects/irstlm/
• KenLM
• https://kheafield.com/code/kenlm/

32
• end of Chapter 6

33

You might also like