0% found this document useful (0 votes)

28 views33 pages

6.chapter6 LanguageModel

Uploaded by

Minh Mai Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views33 pages

6.chapter6 LanguageModel

Uploaded by

Minh Mai Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Natural Language Processing

AC3110E

1
Chapter 6: N-grams Language Model

Lecturer: PhD. DO Thi Ngoc Diep

SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)

• collection of rules, complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical probability NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• achieve state-of-the-art results in many natural language tasks

3
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)

• collection of rules, complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical probability NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• word embedding algorithm: large corpus of text produces a vector space, word
embeddings to capture semantic properties of words.
• achieve state-of-the-art results in many natural language tasks

4
Introduction

• Which is correct?
• Output of an ASR system
• “Their are two terms” vs. “There are two terms” ?
• “Here you are !” vs. “Hear you are !” ?
• Output of a MT system
• “Strong wind tonight” vs. “Big wind tonight” ?
• Output of Handwritten recognition

• HOW to assign a probability to a sentence ?

• HOW to predict upcoming words ?

5
Introduction

• Predict how frequent a phrase occurs within the natural use of a language
• Useful for spelling correction, grammatical error correction, etc. in NLP tasks

6
Probabilistic Language Model

• Language Models: assign probabilities to sequences of words

• P(S) = P(w1, w2, w3, ..., wn)
• To predict the next word
• Conditional probability: P(wn|w1, w2, w3, ..., wn-1)
• Target
• In recognition: P(“recognize speech”) ? P(“wreck a nice beach”)
• P(“recognize speech”) > P(“wreck a nice beach”)
• In text generation: P(“three houses”) ? P(“three house”)
• P(“three houses”) > P(“three house”)
• In spelling correction: P(“my cat eats fish”) ? P(“my xat eats fish”)
• P(“my cat eats fish”) > P(“my xat eats fish”)
• In machine translation: P(“strong wind”) ? P(“big wind”)
• P(“strong wind”) > P(“big wind”)
• etc.

• => Need to be learned from a training corpus with high quality

• containing perfectly acceptable word sequences/sentences in a language

7
Probabilistic Language Model

• How to calculate the probabilities from a training corpus

(" ")
• P(“we wanted to know”) = ? = ( )

(" ")
• P(“know”|“we wanted to”) = ? =
(" ")

+ Need a large “enough” corpus

+ “estimation” only, never see enough data

• Chain rule of probability:

P(S) = P(w1, w2, w3, ..., wn) = P(w1:n)

=> How to model P(wk|w1:k-1) ?

8
6.1 n-gram Language models

• The simplest model: n-gram LM

• n-gram: a sequence of n words
• 2-gram (bigram), 3-gram (trigram), etc.

"The big brown fox jumped over the fence"

Unigrams: "The", "big", "brown", "fox", "jumped", "over", "the", "fence"

Bigram: "The big", "big brown", "brown fox", "fox jumped", "jumped over", "over
the", "the fence"
Trigram: "The big brown", "big brown fox", "brown fox jumped", "fox jumped over",
"jumped over the", "over the fence"
…
6-gram (Hexagram): "The big brown fox jumped over", "big brown fox jumped over
the", "brown fox jumped over the fence"

• Longer n-grams suffer from low density

9
6.1 n-gram Language models

• The simplest model: n-gram LM

• n-gram: a sequence of n words
• 2-gram (bigram), 3-gram (trigram), etc.
• Longer n-grams suffer from low density
• Markov assumption:
• Approximation: The probability of a word depends only on several previous words
• The n-gram model looks n−1 words in the past

• P(“know”|“we wanted to”)

• unigram: = P(“know”)
• bigram: = P(“know”|“to”)
• trigram: = P(“know”|“wanted to”)

10
6.1 n-gram Language models

• Estimate the n-gram probabilities

• Maximum likelihood estimation (MLE)
• Unigram probability

• Bigram probability

• Example: Bigram probabilities for the corpus

11
6.1 n-gram Language models

• Estimate the n-gram probabilities

• Maximum likelihood estimation (MLE)
• n-gram LM from relative frequency

12
6.1 n-gram Language models

• Chain rule of probability: P(S) = P(w1, w2, w3, ..., wn) = P(w1:n)

• Log probabilities:
• Avoid underflow
• Faster
p1 × p2 × p3 × p4 = exp(log p1 +log p2 +log p3 +log p4)

• In practice it’s more common to use trigram models, or 4-gram or even 5-

gram models, when there is sufficient training data

13
Building n-gram Language models

• Preprocess the text:

• word normalization, sentence segmentation, etc.
• depends on the tasks
• Build model’s vocabulary
• All the words recognized by the language model
• All other words are considered “unknown”: <UNK>, <OOV>
• Include words whose frequency is greater than a certain threshold
• Add <UNK>, [<s>], </s>
• Types: the number of distinct words in a corpus
• Tokens: the total number of running words

<s> I am Sam </s>

<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

14
Building n-gram Language models

• Example
• Berkeley Restaurant Project
• Counts for unigram

• Counts for bigram wn

wn-1

• Probability of bigrams = ? wn
wn-1

https://github.com/wooters/berp-trans
Jurafsky, D., et al. 1994. The Berkeley restaurant project. ICSLP.
15
Building n-gram Language models

• Example
• Berkeley Restaurant Project
• Estimation of the bigram probability of a statement “<s> I want chinese food </s>”
• Given P( I | <s>) = 0.25
P(</s >| food) = 0.68

P(<s> I want chinese food </s>)

wn
wn-1

16
6.2. Evaluating Language Models

• How good is a model?

• How much does the embed application improve ?
• How a model performs on some unseen data ?
• Does the language model prefer good sentences or bad ones?
• Assign higher probability to “grammatical” or “frequently observed” statements
• than “ungrammatical” or “rarely observed” utterances?
• Does it accurately predict the unseen data ?
• Does it assign a higher probability to the unseen data ?

17
6.2. Evaluating Language Models

• Extrinsic evaluation - end-to-end evaluation

• Good evaluation to compare models A and B
• Implement each model in a task.
• spelling correction, speech recognition, translation system
• Execute the task, get precision for A and for B
• How many misspelled words were corrected correctly?
• How many words translated correctly
• Compare accuracy for A and B
• Unfortunately
• very time consuming assessment

18
6.2. Evaluating Language Models

• Intrinsic evaluation
• Using a measure: perplexity
• Measures the quality of a model independent of any application
• A training set
• for training language model parameters
• A development set
• for turning the model parameters to fit the characteristics of the development set
• A test set
• data truly not seen during model training
• Perplexity indicates how well the model fits the test set
• assign a higher probability to the test data

19
6.2. Evaluating Language Models

• Perplexity Intuition
• A better model is one that assigns a higher probability to the word that actually
occurs
• The perplexity (PP) of a language model on a test set W= w1, w2, w3, ..., wn :

The higher the conditional probability of the word sequence,

the lower the perplexity => model predicts the test data better

• Perplexity is a normalized version of the probability of the test set

20
6.2. Evaluating Language Models

• Goodman (2001) A bit of progress in language modelling

• Vocabulary size 50,000
• Trigram model: perplexity = 74
• Bigram model: perplexity = 137
• Unigram model: perplexity = 995
• Daniel and Martin (2009),
• Wall Street Journal
• 38 million words, Vocabulary size 19,979
• Trigram model: perplexity = 109
• Bigram model: perplexity = 170
• Unigram model: perplexity = 962
• The perplexity of two language models is only comparable if they use
identical vocabularies
• The perplexity is typically between 40 and 400

21
6.2. Evaluating Language Models

• Perplexity
• Independent measure of specific application.
• But difficult to know whether a gain in perplexity will comes into a real gain
in practice
• Bad approximation unless the test data resembles the training data
• Generally useful in pilot experiments, before moving on to extrinsic
evaluation

22
6.3. Smoothing algorithms

• Training data
• n-gram model is dependent on the training corpus
• be sure to use a training corpus that has a similar genre/domain to the task
• Problem of sparsity
• Unknown Words
• Out of vocabulary (OOV) words can appear in open vocabulary system
• convert OOV word to <UNK> token (need a prior vocabulary in advance)
• replace words in the training data by <UNK> based on their frequency (< threshold)
• if some perfectly acceptable word sequences are missing from training corpus 
“zero probability n-grams”
• but they could appear in the test set !
• For example
• In the work of Shakespeare
• N=884,647 tokens, V=29,066 => Nbr of possible bigrams = V2 = 844 million
• Shakespeare produced 300,000 different bigrams
• Thus, 99.96% of possible bigrams have never been seen.

23
6.3. Smoothing algorithms6

• Problem:
• Some of these zeros are really zeros.... but, some of them are only rare events

What if “students opened their ” never

occurred in data?  zero frequency n-grams ?

 Smoothing algorithms

What if “students opened their” never

occurred in data?  divide by zero ?
 Backoff solution

24
Backoff solutions

• Backoff
• “back off” to a lower-order n-gram if zero evidence for a higher-order n-gram
P(wn|wn−2wn−1) P(wn|wn−1) P(wn) ...
• Katz backoff
• Good-Turing backoff
• Interpolation
• weighting and combining the trigram, bigram, and unigram etc.

with

• Simple interpolation
• Conditional interpolation

25
Smoothing algorithms

• Smoothing (discounting)
• Adding probability mass to unseen events by Removing probability mass from
previously seen events in order to maintain a joint distribution that is 1

• Type of smoothing
• Laplace (add-one) smoothing
• Add-k smoothing
• Kneser-Ney smoothing
• Stupid backoff
• etc.

26
Laplace Smoothing

• Add 1 to all the n-gram counts, before normalizing into probabilities

• For unigram probabilities

• For bigram probabilities

• etc.
• Made a very big change to the counts !
• too much probability mass is moved to all the zeros
• tends to reassign too much mass to events not seen in training

27
Add-k smoothing

• Instead of adding 1 to each count, adding a fractional count k (.5? .05? .01?)

• How to choose k? Are they appropriate discounts ?

28
Kneser-Ney smoothing

• One of the most commonly used and best performing n-gram smoothing
methods
• Take into account the number of different contexts word w has appeared in
• Hypothesis: words that have appeared in more contexts in the past are more likely to
appear in some new context as well.
• base on the number of different contexts word w has appeared in

I can’t see without my reading ________

• “Los Angeles” and “glasses” are very frequent words.

• But the word (Los) occurring in only one context (with Angeles)  will have a low continuation
probability
• The word “glasses” has a much wider distribution  better choice

29
• Google Books Ngram Viewer
• https://books.google.com/ngrams/

30
Why should we care about Language Modeling?

• Language Modeling is a subcomponent of many NLP tasks

• Especially those involving generating text or estimating the probability of
text:
• Predictive typing
• Speech recognition
• Handwriting recognition
• Spelling/grammar correction
• Machine translation
• Summarization
• Dialogue
• etc.
• Everything else in NLP has now been rebuilt upon Language Modeling:
• GPT-3 is an LM!

31
Language Modeling Toolkits

• SRILM
• http://www.speech.sri.com/projects/srilm/
• IRSTLM
• https://sourceforge.net/projects/irstlm/
• KenLM
• https://kheafield.com/code/kenlm/

32
• end of Chapter 6

Placa E+l
100% (1)
Placa E+l
68 pages
Fault Code Zx350-5g
100% (4)
Fault Code Zx350-5g
47 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
Neural Network 1&2
No ratings yet
Neural Network 1&2
72 pages
12-13.chapter9 DeepLearningInNLP
No ratings yet
12-13.chapter9 DeepLearningInNLP
45 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
ELB02 Manual
No ratings yet
ELB02 Manual
28 pages
NLP Unit 4 Q & A
No ratings yet
NLP Unit 4 Q & A
17 pages
Pioneer GM D9701xevuc GM d9701 GM Dx971 GM d9701 GM d9701
No ratings yet
Pioneer GM D9701xevuc GM d9701 GM Dx971 GM d9701 GM d9701
17 pages
NLP-Ch-2 Introduction To Language Models
No ratings yet
NLP-Ch-2 Introduction To Language Models
82 pages
Applications of NLP For Business
No ratings yet
Applications of NLP For Business
29 pages
15.chapter11 NLPApplications
No ratings yet
15.chapter11 NLPApplications
25 pages
11.chapter8 WordEmbedding
No ratings yet
11.chapter8 WordEmbedding
17 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Levine BSFC6e PPT Ch02
No ratings yet
Levine BSFC6e PPT Ch02
58 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
NLP Sem Unit 5
No ratings yet
NLP Sem Unit 5
9 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
1.chapter1 Introduction Chapter2 LanguageCharacteristics
No ratings yet
1.chapter1 Introduction Chapter2 LanguageCharacteristics
35 pages
Module-1 ch-2
No ratings yet
Module-1 ch-2
31 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
Lecture 4 N Grams
No ratings yet
Lecture 4 N Grams
29 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Chapter 03-Number System
No ratings yet
Chapter 03-Number System
88 pages
MirrorView and SAN Copy Configuration and Management SRG R29
No ratings yet
MirrorView and SAN Copy Configuration and Management SRG R29
124 pages
Bcse306l Ai Module-7 Smsatapathy
No ratings yet
Bcse306l Ai Module-7 Smsatapathy
51 pages
Convertidores de Temperatura ABB
No ratings yet
Convertidores de Temperatura ABB
20 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
1.unogeeks Oracle Integration Cloud OIC Course Details
No ratings yet
1.unogeeks Oracle Integration Cloud OIC Course Details
6 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Digital Forensics Confronting Modern Cyber Crimes Technological Advancements and Future Challenges
No ratings yet
Digital Forensics Confronting Modern Cyber Crimes Technological Advancements and Future Challenges
8 pages
EROWA Leonardo
No ratings yet
EROWA Leonardo
12 pages
Full Upgrade
No ratings yet
Full Upgrade
16 pages
P4 Specimen Ms
No ratings yet
P4 Specimen Ms
14 pages
OTA2012B Documentation
No ratings yet
OTA2012B Documentation
74 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
Lecture5 Ngrams
No ratings yet
Lecture5 Ngrams
40 pages
NLP 5th Unit
No ratings yet
NLP 5th Unit
19 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Fair Federated Learning For Digital Healthcare
No ratings yet
Fair Federated Learning For Digital Healthcare
15 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
Unit 5 Notes Final
No ratings yet
Unit 5 Notes Final
14 pages
N Grams
No ratings yet
N Grams
51 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Five Stages of CEAC Cycle
100% (1)
Five Stages of CEAC Cycle
6 pages
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
No ratings yet
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
42 pages
Concept of Pn-Junction Diode: A.Mounika, Assistant Professor ECE-Department A.I.T.S-Tirupati
No ratings yet
Concept of Pn-Junction Diode: A.Mounika, Assistant Professor ECE-Department A.I.T.S-Tirupati
21 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Sortcut List
No ratings yet
Sortcut List
8 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Evaluating Language Models
No ratings yet
Evaluating Language Models
21 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
Unit 5-Aiml
No ratings yet
Unit 5-Aiml
25 pages
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
No ratings yet
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
10 pages
Tender Notice of Computer Consumable Items
No ratings yet
Tender Notice of Computer Consumable Items
6 pages
Artificial Intelligence: Natural Language Processing
No ratings yet
Artificial Intelligence: Natural Language Processing
13 pages
Instant Download Elementary Topology Problem Textbook 1st Edition O.Ya. Viro PDF All Chapter
No ratings yet
Instant Download Elementary Topology Problem Textbook 1st Edition O.Ya. Viro PDF All Chapter
32 pages
Datasheet Easyio FG 20
No ratings yet
Datasheet Easyio FG 20
2 pages
Miracolo A Milano 1951 de Sica XVID - ZEiNAB
No ratings yet
Miracolo A Milano 1951 de Sica XVID - ZEiNAB
2 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Using A WOT For Competition Robot Design - jv.11252008
No ratings yet
Using A WOT For Competition Robot Design - jv.11252008
5 pages
NLP m2
No ratings yet
NLP m2
74 pages
Seismic Design For RC Building
No ratings yet
Seismic Design For RC Building
33 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Network Administration: Established Subnetting Questions
No ratings yet
Network Administration: Established Subnetting Questions
4 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Apply For Ethiopian Passport Online
No ratings yet
Apply For Ethiopian Passport Online
1 page
Non-Technical Roles - Associate Product Management
No ratings yet
Non-Technical Roles - Associate Product Management
4 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
NLP
No ratings yet
NLP
12 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Abis Over TDM To Abis IP Over E1
100% (6)
Abis Over TDM To Abis IP Over E1
14 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
Maxbox Arduino Pascal Magazine
100% (1)
Maxbox Arduino Pascal Magazine
8 pages
Ngrams
100% (1)
Ngrams
22 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
System Verilog Interview Questions
No ratings yet
System Verilog Interview Questions
7 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet