01-Transformer Based NLP Applications
01-Transformer Based NLP Applications
2 January, 2025
Outline
1. Introduction to NLP
2. Word representation
3. Language modeling
4. LMs + RNNs
5. Better RNNs
[Content adapted from CS224N: Natural Language Processing with Deep Learning,
The Stanford NLP Group, Stanford]
3 January, 2025
NLP progress
● 1940–1969
● Early Explorations
● 1970–1992
● Hand-built systems, of increasing formalization
● 1993–2012
● Statistical or Probabilistic NLP
● then, more general Supervised ML
● 2013–now
● Deep Artificial Neural Networks
● Unsupervised
● Self-supervised
● Reinforcement Learning
4 January, 2025
Machine Translation (1950s)
● The origin of NLP/Computational Linguistics
5 January, 2025
Machine Translation (today)
6 January, 2025
Question-Answering
● BASEBALL (Green et al., 1961)
● handled questions about a database of baseball
Bibliographical and Historical Notes.
● LUNAR (Woods 1973)
● answered questions about the rocks brought back from
the moon by the Apollo program.
● ChatGPT
7 January, 2025
Main NLP applications
● Text Classification
● Sentiment Analysis
● Named Entity Recognition (NER)
● Topic Modeling
● Text Summarization
● Text Generation
● Speech Recognition (ASR, Speech-to-Text STT)
● Text-to-Speech (TTS)
8 January, 2025
How machines understand us?
● Machines need to understand (at least partially) the
ambiguous, messy languages that humans use.
9 January, 2025
Words meaning
● Definition: meaning (Webster dictionary)
● the idea that is represented by a word, phrase, etc.
● the idea that a person wants to express by using words,
signs, etc.
● the idea that is expressed in a work of writing, art, etc.
Word ⇔ Idea
Hi ⇔ Greeting
Hello
Howdy
Greetings
10 January, 2025
Word representation
11 January, 2025
Words representation as discrete symbols
● Traditional NLP used one-hot vectors
AI is the future
The 0 0 1 0
future 0 0 0 1
is 0 1 0 0
AI 1 0 0 0
Vocabulary Size
12 January, 2025
Words representation as discrete symbols
● Is “Hello” similar to “Hi” with one-hot vectors?
13 January, 2025
Word Embedding
● Represent a word by considering the contexts in
which it is found.
● Construct dense vectors for each word, aiming for
similarity with vectors of words found in
comparable contexts.
● Similarity will be assessed through the scalar
product of the vectors.
0.186 0.228
0.685 0.815
−0.209 −0.106 dot(Hi, Hello) = 0.77
Hi = −0.107 Hello = −0.201
0.117 0.205
−0.352 −0.288
14 January, 2025
Word Embedding
● Word2vec [Mikolov, 2013]
● GLOVE [Pennington, 2014]
15 January, 2025
Language Modeling
16 January, 2025
Language Modeling
● The task of predicting what word comes next.
The students came ____ early/late/sleepy/motivated
● P(wn+1|wn, …, w1),
● wn+1 a word from the vocabulary V.
17 January, 2025
Language Modeling
● Word embedding focuses on representing individual
words as vectors based on their contextual usage.
18 January, 2025
Language Modeling
● A Language Model (LM) assigns a probability to a
sequence of words.
● A LM assigns the following probability to the sequence
of words w1, …, wN :
19 January, 2025
n-gram Language Models
● For a sequence of tokens (e.g., words or subwords)
S=[w1, …, wN ], an n-gram is a subsequence of n
consecutive elements from S.
● Unigram (n=1): single items
■ "I love NLP" ⇒ "I","love","NLP"
● Bigram (n=2): pairs of consecutive items
■ "I love NLP" ⇒ "I love","love NLP"
● Trigram (n=3): triplets of consecutive items
■ "I love NLP" ⇒ "I love NLP"
20 January, 2025
n-gram Language Models
● Idea:
● collect statistics about the occurrences of different
n-grams
● and, use these statistics to predict next word.
● Example: learn a 4-gram LM
● The book that the teacher recommended was _____
count(teacher recommended was w )
● P(w|teacher recommended was) = count(teacher recommended was )
● Problems
■ Zero occurrences→ Zero probability (dividend / divisor)
– sparsity problem
■ Increasing n → increases
– sparsity problem
– Model size
21 January, 2025
Neural LMs & RNNs
22 January, 2025
Fixed-window Neural LM
23 January, 2025
Fixed-window Neural LM
● Improvements over n-gram LM
● No sparsity problem
● Don’t need to store all observed n-grams awesome
amazing
x1 x2 x3 x4
24 January, 2025
Fixed-window Neural LM
● Remaining problems
● Fixed window is too small
● Enlarging window enlarges 𝑊
● Window can never be large enough
1 2
● x and x are multiplied by completely
different weights in 𝑊.
→ No symmetry in how the inputs are
processed.
x1 x2 x3
x4
25 January, 2025
RNN LM
Core idea: Apply the same weights 𝑊 recurrently
x1 x2 x3 x4
26 January, 2025
RNN LM
● RNN Advantages
● Can process any length input
● Computation for step t can (in theory) use information
from many steps back
● Model size doesn’t increase for longer input context
● Same weights applied on every timestep, so there is
symmetry in how inputs are processed.
● RNN Disadvantages
● Recurrent computation is slow
● In practice, difficult to access information from many
steps back
27 January, 2025
How to train an RNN LM?
● Build a large corpus of text (sequences of words)
● For every step t:
● input: feed sequence of words into RNN LM
● output: compute ŷ (t) the prob distribution of every word
28 January, 2025
How to train an RNN LM?
-log(prob(“teacher”))
29 January, 2025
How to train an RNN LM?
● Computing loss and gradients across entire corpus
x(1), x(2), … , x(T) at once is too expensive
● In practice,
● x(1), x(2), … , x(T) is a sentence (or a document)
● Use Stochastic Gradient Descent to compute loss and
gradients for small chunk of data, and update.
● Compute loss J(𝛉) for a batch of sentences, compute
gradients and update weights.
■ Repeat on a new batch of sentences.
30 January, 2025
How generate text with RNN LM?
31 January, 2025
Evaluating Language Models
● The standard evaluation metric for LMs is perplexity.
● Inverse probability of corpus, according to LM
● Normalized by number of words (1/T)
When the derivatives are small, the gradient signal gets smaller and smaller as
it backpropagates further.
33 January, 2025
RNNs: Vanishing Gradient
Gradient signal from far away is lost because it’s much smaller than gradient
signal from close-by.
⇒ Model weights are updated only with respect to near effects, not long-term
effects.
34 January, 2025
How to fix Vanishing Gradient problem?
● The main problem
● RNNs are unable to preserve information over many
timesteps.
● Memorize
● Add separate memory: LSTM
● And then
● Create more direct and linear pass-through connections
in model
■ Attention, residual connections, etc.
35 January, 2025
Recap
● Word representation
● Focus
■ Individual words and their semantic relationships.
● Applications
■ Similarity measurement (e.g., finding synonyms).
■ Feature inputs for downstream NLP tasks like sentiment
analysis or text classification.
● Language modeling
● Focus
■ Sequences of words or sentences understanding and
prediction.
● Applications
■ Text generation (e.g., chatbots, auto-completion).
■ Machine translation.
■ Speech recognition.
36 January, 2025
Recap
● Language Model
● A system that predicts the next word
37 January, 2025
Better RNNs
38 January, 2025
Long Short-Term Memory RNNs (LSTMs)
● On step t, there is a hidden state 𝒉(t) and a cell state 𝒄(t)
● Both are vectors length n
● The cell stores long-term information
● LSTM can read, erase, and write information from the cell
■ The cell is like a RAM in a computer
39 January, 2025
LSTMs
● The LSTM architecture solves the problem of preserve
information over many timesteps (long distance) within
an RNN
● e.g., to preserve indefinitely information contained in a
cell
■ set the forget gate to 1
■ and, set the input gate to 0
● In practice
■ LSTM preserves info over 100 timesteps
■ Vanilla RNN preserves info over 7 timesteps
41 January, 2025
Sentiment classification with RNNs
42 January, 2025
Sentiment classification with RNNs
● The hidden state is a
contextual representation of
the word “terribly”.
● It contains information about
the left context
● “the movie was”
● Sentiment
● neg: terribly
● pos: terribly exciting
43 January, 2025
Bidirectional RNNs
This contextual
representation of
“terribly”
has both left and
right context
44 January, 2025
Bidirectional RNNs
● Bidirectional RNNs are only applicable if access to the
entire input sequence is available
● They are not applicable to Language Modeling, because
in LM only left context is available.
45 January, 2025
Multi-layer RNNs
● RNNs are already “deep” on one dimension (they unroll
over many timesteps)
46 January, 2025
Multi-layer RNNs
47 January, 2025
Multi-layer RNNs
● RNNs perform better with more layers
● For Neural Machine Translation
● 2 to 4 layers is best for the encoder RNN
● and, 4 layers is best for the decoder RNN
● Britz et al, 2017. Massive Exploration of Neural Machine Translation Architectures.
48 January, 2025
Statistical Machine Translation (SMT)
● Core idea: Learn a probabilistic model from data
● Example: French → English
● Find best English sentence y, given French sentence x
49 January, 2025
Neural Machine Translation
● Since 2014
50 January, 2025
Neural Machine Translation (NMT)
51 January, 2025
Sequence-to-sequence is versatile
● The general notion here is an encoder-decoder model
● One NN takes input and produces a neural representation
● Another NN produces output based on that neural
representation
● If the input and output are sequences, we call it a
seq2seq model
52 January, 2025
Neural Machine Translation (NMT)
● The seq-2-seq model is an example of a Conditional
Language Model
● Language Model because the decoder is predicting the
next word of the target sentence y
● Conditional because its predictions are also conditioned
on the source sentence x
53 January, 2025
Multi-layer deep encoder-decoder MT Net
Conditioning
Bottleneck!
54 January, 2025
Evaluate Machine Translation
Commonest way: BLEU (Bilingual Evaluation Understudy)
Papineni, et al. (2002). BLEU: a method for automatic evaluation of machine translation (PDF).
ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311–318.
55 January, 2025