0% found this document useful (0 votes)
21 views3 pages

Unit 5 Language Modeling Notes

Language modeling in NLP involves probabilistic models that predict the likelihood of word sequences, with applications in predictive text, speech recognition, and chatbots. Key concepts include n-gram models, evaluation metrics like coverage rate and perplexity, and techniques for parameter estimation and adaptation to new domains. The document also discusses various types of language models and challenges in multilingual and crosslingual contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views3 pages

Unit 5 Language Modeling Notes

Language modeling in NLP involves probabilistic models that predict the likelihood of word sequences, with applications in predictive text, speech recognition, and chatbots. Key concepts include n-gram models, evaluation metrics like coverage rate and perplexity, and techniques for parameter estimation and adaptation to new domains. The document also discusses various types of language models and challenges in multilingual and crosslingual contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

UNIT 5: LANGUAGE MODELING

1. Introduction to Language Modeling

A Language Model (LM) in NLP is a probabilistic statistical model that estimates the likelihood of a

sequence of words. It predicts the next word in a sentence using the context provided by previous

words.

Applications:

- Predictive text input

- Speech recognition

- Spelling correction

- Machine translation

- Chatbots

Example: "I love reading history..." -> next word: "books"

2. N-Gram Models

N-gram = sequence of N words.

- Unigram: "I", "love", "reading"

- Bigram: "I love", "love reading"

- Trigram: "I love reading"

Formula using Chain Rule:

P(W) = P(w1) * P(w2|w1) * P(w3|w1, w2) ...

Approximation: P(w_n | ...) P(w_n | w_{n-k}...)

3. Language Model Evaluation

i. Coverage Rate: % of known n-grams in test data.

ii. Perplexity: Measures model's prediction power.


Perplexity = 2^H(p) or PP(W) = (1/P(w1...wt))^(1/t)

4. Parameter Estimation

i. MLE: P(wi|wi-1,wi-2) = count(wi-2,wi-1,wi) / count(wi-2,wi-1)

ii. Smoothing: Assigns small probabilities to unseen n-grams.

Backoff: Uses lower-order n-grams when data is sparse.

5. Language Model Adaptation

Used when applying models to new domains.

Techniques:

- Interpolation: Mix in-domain and general models

- Topic-based adaptation: Cluster documents into topics

6. Types of Language Models

i. Class-Based: Group words (e.g., cities, animals)

ii. Variable-Length: Handle varying input/output sizes

iii. Discriminative: Focus on classification tasks

iv. Topic-Based (LDA): Discover hidden topics in docs

v. Neural Network Models: Use deep learning (Word2Vec, BERT)

7. Language-Specific Modeling Problems

i. Morphologically Rich: Use morphemes instead of full words

ii. No Word Segmentation: Needed in Chinese, Japanese

iii. Spoken vs Written: Require manual transcription

8. Multilingual and Crosslingual Modeling

i. Multilingual: Handle multiple languages & code-switching


Example: "I need to tell her que no voy a poder ir."

ii. Crosslingual: Use one language's data for another

(Translate or share models like LSA)

Conclusion:

Language modeling is essential in NLP for understanding and generating human language. It

ranges from simple n-grams to advanced neural models.

You might also like