0% found this document useful (0 votes)
7 views55 pages

01-Transformer Based NLP Applications

The document outlines a course on Transformer Based NLP applications, detailing the structure, grading policy, and key topics including word representation, language modeling, and recurrent neural networks (RNNs). It discusses the evolution of NLP from early explorations to modern deep learning techniques, emphasizing the importance of word embeddings and language models in understanding and generating natural language. The document also highlights advancements such as Long Short-Term Memory (LSTM) networks and bidirectional RNNs for improved contextual understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views55 pages

01-Transformer Based NLP Applications

The document outlines a course on Transformer Based NLP applications, detailing the structure, grading policy, and key topics including word representation, language modeling, and recurrent neural networks (RNNs). It discusses the evolution of NLP from early explorations to modern deep learning techniques, emphasizing the importance of word embeddings and language models in understanding and generating natural language. The document also highlights advancements such as Long Short-Term Memory (LSTM) networks and bidirectional RNNs for improved contextual understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Mastère TICV

Transformer Based NLP


Applications
Lecture 1

Moez BEN HAJ HMIDA


[email protected]
January, 2025
Course hours and grading policy
● 3 x 3 hours Lectures
● foundations of modern methods for DL applied to NLP
● grading: regular exam (60 %)

● 6 hours Hands-on Labs


● Nvidia certification
● grading: online certification exam (40 %)

2 January, 2025
Outline

1. Introduction to NLP

2. Word representation

3. Language modeling

4. LMs + RNNs

5. Better RNNs

[Content adapted from CS224N: Natural Language Processing with Deep Learning,
The Stanford NLP Group, Stanford]

3 January, 2025
NLP progress
● 1940–1969
● Early Explorations
● 1970–1992
● Hand-built systems, of increasing formalization
● 1993–2012
● Statistical or Probabilistic NLP
● then, more general Supervised ML
● 2013–now
● Deep Artificial Neural Networks
● Unsupervised
● Self-supervised
● Reinforcement Learning

4 January, 2025
Machine Translation (1950s)
● The origin of NLP/Computational Linguistics

5 January, 2025
Machine Translation (today)

6 January, 2025
Question-Answering
● BASEBALL (Green et al., 1961)
● handled questions about a database of baseball
Bibliographical and Historical Notes.
● LUNAR (Woods 1973)
● answered questions about the rocks brought back from
the moon by the Apollo program.
● ChatGPT

7 January, 2025
Main NLP applications
● Text Classification
● Sentiment Analysis
● Named Entity Recognition (NER)
● Topic Modeling
● Text Summarization
● Text Generation
● Speech Recognition (ASR, Speech-to-Text STT)
● Text-to-Speech (TTS)

8 January, 2025
How machines understand us?
● Machines need to understand (at least partially) the
ambiguous, messy languages that humans use.

● Machines need to understand words meaning


● single word vs. context

● All is about words meaning!

9 January, 2025
Words meaning
● Definition: meaning (Webster dictionary)
● the idea that is represented by a word, phrase, etc.
● the idea that a person wants to express by using words,
signs, etc.
● the idea that is expressed in a work of writing, art, etc.

Word ⇔ Idea
Hi ⇔ Greeting
Hello
Howdy
Greetings

⇒ We need better techniques to represent words

10 January, 2025
Word representation

11 January, 2025
Words representation as discrete symbols
● Traditional NLP used one-hot vectors
AI is the future
The 0 0 1 0
future 0 0 0 1
is 0 1 0 0
AI 1 0 0 0

Vocabulary Size

● Vector dimension = number of words in vocabulary


● “one-hot” encoding

12 January, 2025
Words representation as discrete symbols
● Is “Hello” similar to “Hi” with one-hot vectors?

Hello word1 word2 Hi


Hello 1 0 0 0
Hi 0 0 0 1

● These two vectors are orthogonal ⇒ not similar


■ cosine similarity

⇒ Can we encode similarity in the vectors themselves?

13 January, 2025
Word Embedding
● Represent a word by considering the contexts in
which it is found.
● Construct dense vectors for each word, aiming for
similarity with vectors of words found in
comparable contexts.
● Similarity will be assessed through the scalar
product of the vectors.

0.186 0.228
0.685 0.815
−0.209 −0.106 dot(Hi, Hello) = 0.77
Hi = −0.107 Hello = −0.201
0.117 0.205
−0.352 −0.288

14 January, 2025
Word Embedding
● Word2vec [Mikolov, 2013]
● GLOVE [Pennington, 2014]

● Word embeddings provide a way to numerically


represent individual words, making it easier for
machine learning models to process and understand
natural language.
● Bad vs. Not Bad ?
● like vs. don’t like?

⇒ We need to represent a sequence of words.

15 January, 2025
Language Modeling

16 January, 2025
Language Modeling
● The task of predicting what word comes next.
The students came ____ early/late/sleepy/motivated
● P(wn+1|wn, …, w1),
● wn+1 a word from the vocabulary V.

17 January, 2025
Language Modeling
● Word embedding focuses on representing individual
words as vectors based on their contextual usage.

● While a LM predicts/generates the next word based on


statistical patterns observed in a given language.
● Recurrently, a LM can predict a whole text.

18 January, 2025
Language Modeling
● A Language Model (LM) assigns a probability to a
sequence of words.
● A LM assigns the following probability to the sequence
of words w1, …, wN :

19 January, 2025
n-gram Language Models
● For a sequence of tokens (e.g., words or subwords)
S=[w1, …, wN ], an n-gram is a subsequence of n
consecutive elements from S.
● Unigram (n=1): single items
■ "I love NLP" ⇒ "I","love","NLP"
● Bigram (n=2): pairs of consecutive items
■ "I love NLP" ⇒ "I love","love NLP"
● Trigram (n=3): triplets of consecutive items
■ "I love NLP" ⇒ "I love NLP"

20 January, 2025
n-gram Language Models
● Idea:
● collect statistics about the occurrences of different
n-grams
● and, use these statistics to predict next word.
● Example: learn a 4-gram LM
● The book that the teacher recommended was _____
count(teacher recommended was w )
● P(w|teacher recommended was) = count(teacher recommended was )

● Problems
■ Zero occurrences→ Zero probability (dividend / divisor)
– sparsity problem
■ Increasing n → increases
– sparsity problem
– Model size
21 January, 2025
Neural LMs & RNNs

22 January, 2025
Fixed-window Neural LM

23 January, 2025
Fixed-window Neural LM
● Improvements over n-gram LM
● No sparsity problem
● Don’t need to store all observed n-grams awesome
amazing

the teacher recommended was

x1 x2 x3 x4

24 January, 2025
Fixed-window Neural LM
● Remaining problems
● Fixed window is too small
● Enlarging window enlarges 𝑊
● Window can never be large enough
1 2
● x and x are multiplied by completely
different weights in 𝑊.
→ No symmetry in how the inputs are
processed.

the teacher recommended was

x1 x2 x3
x4

⇒ We need a neural architecture that can process any length input

25 January, 2025
RNN LM
Core idea: Apply the same weights 𝑊 recurrently

the teacher recommended was

x1 x2 x3 x4
26 January, 2025
RNN LM
● RNN Advantages
● Can process any length input
● Computation for step t can (in theory) use information
from many steps back
● Model size doesn’t increase for longer input context
● Same weights applied on every timestep, so there is
symmetry in how inputs are processed.

● RNN Disadvantages
● Recurrent computation is slow
● In practice, difficult to access information from many
steps back

27 January, 2025
How to train an RNN LM?
● Build a large corpus of text (sequences of words)
● For every step t:
● input: feed sequence of words into RNN LM
● output: compute ŷ (t) the prob distribution of every word

● Loss function on step t is cross-entropy between


predicted prob dist ŷ (t), and the true next word y (t)
(encoding for x (t+1))

● Overall loss for entire training set

28 January, 2025
How to train an RNN LM?
-log(prob(“teacher”))

the teacher recommended was amazing

29 January, 2025
How to train an RNN LM?
● Computing loss and gradients across entire corpus
x(1), x(2), … , x(T) at once is too expensive

● needs to keep in memory J(t)(𝛉) for each t in T

● In practice,
● x(1), x(2), … , x(T) is a sentence (or a document)
● Use Stochastic Gradient Descent to compute loss and
gradients for small chunk of data, and update.
● Compute loss J(𝛉) for a batch of sentences, compute
gradients and update weights.
■ Repeat on a new batch of sentences.

30 January, 2025
How generate text with RNN LM?

31 January, 2025
Evaluating Language Models
● The standard evaluation metric for LMs is perplexity.
● Inverse probability of corpus, according to LM
● Normalized by number of words (1/T)

● Perplexity: inability to deal with or understand something.


⇒ Lower perplexity is better
32 January, 2025
RNNs: Vanishing Gradient
When these are small, the gradient signal gets smaller and
smaller as it backpropagates further

When the derivatives are small, the gradient signal gets smaller and smaller as
it backpropagates further.

33 January, 2025
RNNs: Vanishing Gradient

Gradient signal from far away is lost because it’s much smaller than gradient
signal from close-by.
⇒ Model weights are updated only with respect to near effects, not long-term
effects.

34 January, 2025
How to fix Vanishing Gradient problem?
● The main problem
● RNNs are unable to preserve information over many
timesteps.

● Memorize
● Add separate memory: LSTM

● And then
● Create more direct and linear pass-through connections
in model
■ Attention, residual connections, etc.

35 January, 2025
Recap
● Word representation
● Focus
■ Individual words and their semantic relationships.
● Applications
■ Similarity measurement (e.g., finding synonyms).
■ Feature inputs for downstream NLP tasks like sentiment
analysis or text classification.
● Language modeling
● Focus
■ Sequences of words or sentences understanding and
prediction.
● Applications
■ Text generation (e.g., chatbots, auto-completion).
■ Machine translation.
■ Speech recognition.

36 January, 2025
Recap
● Language Model
● A system that predicts the next word

● Recurrent Neural Network


● A family of neural networks that:
■ Take sequential input of any length; apply the same
weights on each step
■ Can (optionally) produce output on each step

● Recurrent Neural Network ≠ Language Model


● Not all LMs use RNNs, most LMs are transformer-based.

● NLP tasks that involve generating text or estimating the


probability of text use LMs.
● Now everything in NLP is being rebuilt upon LMs.

37 January, 2025
Better RNNs

38 January, 2025
Long Short-Term Memory RNNs (LSTMs)
● On step t, there is a hidden state 𝒉(t) and a cell state 𝒄(t)
● Both are vectors length n
● The cell stores long-term information
● LSTM can read, erase, and write information from the cell
■ The cell is like a RAM in a computer

39 January, 2025
LSTMs
● The LSTM architecture solves the problem of preserve
information over many timesteps (long distance) within
an RNN
● e.g., to preserve indefinitely information contained in a
cell
■ set the forget gate to 1
■ and, set the input gate to 0
● In practice
■ LSTM preserves info over 100 timesteps
■ Vanilla RNN preserves info over 7 timesteps

● LSTM is not the best architecture


● there are alternative ways of creating more direct and
linear pass-through connections in models for long
distance dependencies

40 January, 2025
Sentiment classification with RNNs

41 January, 2025
Sentiment classification with RNNs

42 January, 2025
Sentiment classification with RNNs
● The hidden state is a
contextual representation of
the word “terribly”.
● It contains information about
the left context
● “the movie was”
● Sentiment
● neg: terribly
● pos: terribly exciting

43 January, 2025
Bidirectional RNNs
This contextual
representation of
“terribly”
has both left and
right context

44 January, 2025
Bidirectional RNNs
● Bidirectional RNNs are only applicable if access to the
entire input sequence is available
● They are not applicable to Language Modeling, because
in LM only left context is available.

● With entire input sequence, bidirectionality is powerful.


● BERT (Bidirectional Encoder Representations from
Transformers) is a powerful pretrained contextual
representation system built on bidirectionality.

45 January, 2025
Multi-layer RNNs
● RNNs are already “deep” on one dimension (they unroll
over many timesteps)

● A multi-layer RNN applyies multiple RNNs


● “deep” in another dimension.

● This allows the network to compute more complex


representations
● The lower RNNs should compute lower-level features
● and, the higher RNNs should compute higher-level
features.

46 January, 2025
Multi-layer RNNs

47 January, 2025
Multi-layer RNNs
● RNNs perform better with more layers
● For Neural Machine Translation
● 2 to 4 layers is best for the encoder RNN
● and, 4 layers is best for the decoder RNN
● Britz et al, 2017. Massive Exploration of Neural Machine Translation Architectures.

● Usually, skip-connections/dense-connections are


needed to train deeper RNNs (e.g., 8 layers).

● Transformer-based networks (e.g., BERT) are usually


deeper, like 12 or 24 layers.
● they have a lot of skipping-like connections.

48 January, 2025
Statistical Machine Translation (SMT)
● Core idea: Learn a probabilistic model from data
● Example: French → English
● Find best English sentence y, given French sentence x

argmaxy P(x|y) P(y)

● P(x|y) → Translation Model


● Models how phrases should be translated (fidelity).
● Learned from parallel data.

● P(y) → Language Model


● Models how to write good English (fluency).
● Learned from monolingual data.

49 January, 2025
Neural Machine Translation
● Since 2014

● Neural Machine Translation (NMT) is a way to do


Machine Translation with a single end-to-end neural
network.

● The neural network architecture is called a


sequence-to-sequence model (seq2seq) and it involves
two RNNs.

50 January, 2025
Neural Machine Translation (NMT)

Encoder RNN produces an Decoder RNN is a Language Model that


encoding of the source generates target sentence, conditioned
sentence. on encoding.

51 January, 2025
Sequence-to-sequence is versatile
● The general notion here is an encoder-decoder model
● One NN takes input and produces a neural representation
● Another NN produces output based on that neural
representation
● If the input and output are sequences, we call it a
seq2seq model

● Many NLP tasks can be phrased as seq-2-seq


● Summarization (long text → short text)
● Dialogue (previous utterances → next utterance)
● Parsing (input text → output parse as sequence)
● Code generation (natural language → Python code)

52 January, 2025
Neural Machine Translation (NMT)
● The seq-2-seq model is an example of a Conditional
Language Model
● Language Model because the decoder is predicting the
next word of the target sentence y
● Conditional because its predictions are also conditioned
on the source sentence x

● Training an NMT system needs a big parallel corpus

53 January, 2025
Multi-layer deep encoder-decoder MT Net

Conditioning
Bottleneck!
54 January, 2025
Evaluate Machine Translation
Commonest way: BLEU (Bilingual Evaluation Understudy)
Papineni, et al. (2002). BLEU: a method for automatic evaluation of machine translation (PDF).
ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311–318.

● BLEU compares the machine-written translation to one


or several human-written translation(s), and computes a
similarity score based on:
● Geometric mean of n-gram precision (1, 2, 3, and 4-grams)
● Plus a penalty for too-short system translations

● BLEU is useful but imperfect


● There are many valid ways to translate a sentence
● Therefore, a good translation can get a poor BLEU score
because it has low n-gram overlap with the human
translation

55 January, 2025

You might also like