0% found this document useful (0 votes)
24 views54 pages

Deep learning notes

The document provides an overview of deep learning, contrasting it with traditional machine learning, and discusses various concepts such as neural networks, perceptrons, multilayer perceptrons, loss functions, gradient descent, and optimization techniques. It highlights the importance of deep learning in achieving high accuracy with large datasets and powerful hardware, while also addressing challenges like overfitting and the vanishing gradient problem. Additionally, it outlines methods to improve models, including hyperparameter tuning, regularization techniques, and the use of advanced optimizers.

Uploaded by

jagdishdapke362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views54 pages

Deep learning notes

The document provides an overview of deep learning, contrasting it with traditional machine learning, and discusses various concepts such as neural networks, perceptrons, multilayer perceptrons, loss functions, gradient descent, and optimization techniques. It highlights the importance of deep learning in achieving high accuracy with large datasets and powerful hardware, while also addressing challenges like overfitting and the vanishing gradient problem. Additionally, it outlines methods to improve models, including hyperparameter tuning, regularization techniques, and the use of advanced optimizers.

Uploaded by

jagdishdapke362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

How can you improve your BERT NER model?

What is Deep Learning

Deep Learning is a subfield of AI and ML that is


inspired by the structure of a human brain.

Deep learning algorithms attempt to draw similar


conclusions as humans would by continually analysis
data with a given logical structure called Neural
Network.

Or

Deep learning is part of a broader family of machine learning methods based on artificial neural
networks with representation learning

Deep learning algorithms uses multiple layers to progressively extract higher-level features from
the raw input . For example, in image processing lower layers may identify edges, while higher
layers may identify the concepts relevant to a human such as digits or letters or faces.

why DL getting so famous

Deep Learning (DL) is becoming famous due to:

1. High Accuracy – Achieves state-of-the-art results in vision, NLP, and speech tasks.
2. Big Data Availability – Thrives on large datasets, which are now more accessible.
3. Powerful Hardware – GPUs and TPUs enable efficient training of deep networks.
4. Automation & Adaptability – Learns complex patterns without manual feature
engineering.
5. Real-World Applications – Used in AI assistants, self-driving cars, medical
diagnosis, and more.

Its success is driven by improved algorithms, computing power, and vast amounts of data.

When to Use

 ML → Small data, structured problems, fast training (e.g., decision


trees, SVMs).
 DL → Large data, complex unstructured tasks like image recognition
and NLP.
How can you improve your BERT NER model?

Deep Learning (DL) vs. Machine Learning (ML) – Summary

Aspect Machine Learning (ML) Deep Learning (DL)


Uses algorithms to learn Uses neural networks with multiple
Definition
patterns from data layers to learn complex patterns
Feature Requires manual feature
Learns features automatically
Engineering selection
Works well with small to
Data Requirement Requires large amounts of data
medium datasets
Computational
Runs efficiently on CPUs Needs GPUs/TPUs for training
Power
Acts like a "black box" (harder to
Interpretability Easier to interpret results
interpret)
Good for structured/tabular Excels in images, text, and speech
Performance
data processing

1. What is deep learning? How is it different from machine learning?

Answer:
Deep learning is a subset of machine learning that uses neural networks with multiple layers
(deep neural networks) to learn from large amounts of data.

 Machine Learning involves feature engineering, where humans select relevant


features for models.
 Deep Learning automatically extracts features using neural networks, making it more
powerful for tasks like image recognition, NLP, and speech processing.

What is perceptron model

A Perceptron is the simplest type of artificial neural network, primarily used for
binary classification. It is a fundamental building block in deep learning and serves as the
basis for more complex neural networks.

Key Points:
How can you improve your BERT NER model?

 Structure: A perceptron consists of input neurons, weights, a bias term, an activation


function (typically a step function), and an output neuron.
 Working:
1. Takes input features.
2. Multiplies them by weights and adds a bias.
3. Passes the sum through an activation function (like step or sigmoid) to
produce an output.

 Types:

o Single-layer Perceptron (SLP): Can only solve linearly separable problems.


o Multi-layer Perceptron (MLP): Uses hidden layers and activation functions
like ReLU or sigmoid, allowing it to solve non-linear problems.
o

 Limitations: A single-layer perceptron cannot solve the XOR problem, but MLPs
(with backpropagation) can.

 The perceptron is a single-layer neural network used for binary classification.


 It learns by adjusting weights using supervised learning.
 It can only solve linearly separable problems (like AND, OR) but not XOR.
How can you improve your BERT NER model?

 More complex problems require Multi-Layer Perceptrons (MLPs) with hidden layers
and non-linear activations.

What is Multilayer Perceptron (MLP)

A Multilayer Perceptron (MLP) is a type of artificial neural network composed of multiple layers of
neurons. Unlike a simple perceptron, which has a single layer of neurons, an MLP includes input,
hidden, and output layers. Each layer is fully connected to the next one.

1. Structure of MLP:
How can you improve your BERT NER model?

 Input Layer: Takes the input features of the dataset.


 Hidden Layers: Layers between the input and output where neurons process inputs.
These layers use activation functions (e.g., ReLU, Sigmoid) to introduce non-
linearity.
 Output Layer: Produces the final output or prediction.

2. How It Works:

 Forward Propagation:
o The input data is passed through the network layer by layer.
o Each neuron in a layer calculates a weighted sum of its inputs and applies an
activation function.
o The result of one layer is passed as input to the next layer until the output layer
provides the final prediction.
 Backward Propagation:
o After forward propagation, the loss (error) between the predicted output and
the actual target is calculated.
o The error is propagated backward through the network to compute the
gradients of the loss function with respect to the weights.
o These gradients are used to adjust the weights using an optimization algorithm
like gradient descent.

Loss function in deep learning

Loss function is a method of evaluating how well your algorithms is modelling


your dataset.

Regression loss:

mse->

Advantage-(easy to interpret, differentialble(Gd), 1 local minima).

Disadvantage-(Error Unit(squared), robust to outliers(not))

Mae->

Advantage- (Intuitive and easy, unit-same, Robust to outlier)

Disadvantage- (not differentiable)


How can you improve your BERT NER model?

Huber loss->

Combine (mse and mae) if outlier in data its behave like mae, and if outlier not present in

Data its behave like mse.

Classification loss:

Binary Cross Entropy – two classes (yes or no)

Categorical Cross Entropy- Multi-class classification (yes,no,maybe)

Gradient Descent in Deep Learning

What is Gradient Descent?

Gradient Descent is an optimization algorithm used in deep learning to minimize the error
(loss function) by adjusting the model’s parameters (weights & biases). It helps neural
networks learn by iteratively improving predictions.

|How It Works:
o Forward Pass: Compute predictions using the current
parameters.
o Loss Calculation: Measure the error between predictions and
true labels.
o Backpropagation: Compute gradients of the loss with
respect to each parameter.
o Parameter Update: Adjust parameters in the direction of the
negative gradient.
o Repeat: Iterate until the loss converges.
How can you improve your BERT NER model?

🔹 Types of Gradient Descent


Type Description Pros Cons

Batch Gradient Uses the entire dataset to Slow for large


More stable convergence.
Descent (BGD) compute gradients. datasets.

Stochastic Gradient Updates weights after each Faster updates, good for Noisy updates can lead
Descent (SGD) training example. online learning. to instability.

Mini-Batch Gradient Uses a small batch of data Balances speed & Requires tuning batch
Descent for updates. stability. size.

📌 Example: *SGD is widely used in deep learning frameworks like TensorFlow & PyTorch
for training deep networks efficiently.

Challenges of Gradient Descent in Deep


Learning

 Gradient Descent is essential for training deep learning models but


faces challenges like vanishing/exploding gradients, slow
convergence, and sensitivity to hyperparameters.
 Advanced techniques (e.g., adaptive optimizers, normalization, and
regularization) help address these challenges and improve training
efficiency and stability.

What is the Vanishing Gradient Problem?

The vanishing gradient problem occurs when deep neural networks struggle to update
weights in the earlier layers due to very small gradients during backpropagation. This
slows down or even stops learning.

How to Handle the Vanishing Gradient Problem

The vanishing gradient problem occurs in deep neural networks when gradients become too
small, preventing effective learning. Here are key solutions:
How can you improve your BERT NER model?

1. Use Activation Functions Like ReLU


o ReLU (Rectified Linear Unit) avoids small gradients by setting negative
values to zero.
o Variants like Leaky ReLU and ELU help prevent dead neurons.
2. Weight Initialization Techniques
o Xavier/Glorot Initialization: Balances variance across layers.
o He Initialization: Works well with ReLU-based activations.
3. Batch Normalization
o Normalizes inputs at each layer, stabilizing gradients.
4. Gradient Clipping
o Limits extreme gradient values to prevent exploding and vanishing gradients.
5. Residual Connections (Skip Connections)
o Used in ResNets to allow gradients to bypass deep layers, improving flow.
6. Use Smaller Learning Rates
o Helps prevent weight updates from becoming too small.

How to improve a neural network

1.Fine tuning NN hyperparameters


. hidden layers
. neurons per layers
. learning rate
. optimizer
. Batch size
. Activation function

2.By solving problems


->Vanishing / Exploding gradient:
.weight initialization
.Activation function
.gradient clipping(Exploding)
-> Not enough data:
.transfer learning
->slow training
.optimizers
.learning rate scheduler
->overfitting
.l1, and l2 regulization
.Droupout layer
.Early stopping
How can you improve your BERT NER model?

Overfitting

Early Stopping in Deep Learning

Early Stopping is a regularization technique used to prevent overfitting by halting training


when the validation loss stops improving.

Benefits:

✔ Prevents overfitting
✔ Reduces training time
✔ Improves generalization

Droupout layers in Deep Learning

(layer me jo nodes hote hai vo traing ke time hi droup hote hai testing ke time sare nodes
available hote hai , w(1-p))

Dropout is a technique in deep learning that helps prevent overfitting by randomly turning off a
percentage of neurons during training. This forces the network to learn more generalized patterns
instead of relying too much on specific neurons. The dropout rate controls how many neurons are
deactivated (e.g., 0.5 means half the neurons are dropped). During testing or inference, all neurons
are active, but their outputs are scaled to maintain consistency with training.

Regularization in Deep Learning

Regularization in deep learning refers to techniques used to prevent


overfitting, where a model performs well on training data but poorly on
unseen data. The goal of regularization is to improve the generalization of
the model by adding constraints or penalties to the learning process.
Here’s a summary of key regularization techniques:

1. L1 and L2 Regularization
 L1 Regularization (Lasso):
o Adds the sum of the absolute values of the weights to the loss
function.
o Encourages sparsity by driving some weights to zero,
effectively performing feature selection.
o Loss function: Loss=Original Loss + λ∑∣wi∣.
How can you improve your BERT NER model?

 L2 Regularization (Ridge):
o Adds the sum of the squared values of the weights to the loss
function.
o Encourages small weights but does not force them to zero.
o Loss function: Loss=Original Loss +λ ∑w

(Regularization me ham jo hamara


existing loss function hota hai usme penalty term
add karte. Penalty term sare weight ke square ka
sum hota hai, ham kabhi bhi bias ke square ka
sum nhi karte)

Vanishing / Exploding gradient:

What are Activation Function

In artificial neural networks, each neuron forms a


weighted sum of its input and passes the resulting scalar
value through a function referred to as an activation
function

Types of Activation Functions

Linear Activation Function

 f(x) = ax
 Issue: Cannot handle non-linear problems, limiting learning.

Non-Linear Activation Functions

1. ReLU (Rectified Linear Unit)


o f(x) = max(0, x)
o Non-linear
o Not saturated in the +ve region
o Pros: Avoids vanishing gradient, computationally efficient.
o Cons: Can cause "dead neurons" (outputs zero for negative inputs)
o Variants: Leaky ReLU, Parametric ReLU (PReLU), ELU.
How can you improve your BERT NER model?

2. Leaky ReLU
o f(x) = x if x>0, else 0.01x
o Fixes dead neurons issue in ReLU.
3. Sigmoid (Logistic Function)
o f(x)=1/1+e−x
o Pros: Converts inputs to range (0,1), useful for probability-based tasks.
o Cons: Prone to vanishing gradients, slow learning.
4. Tanh (Hyperbolic Tangent Function)
o Pros: Outputs in range (-1,1), better than sigmoid for deep networks.
o Cons: Still suffers from vanishing gradients.
5. Softmax
o Used in multi-class classification to convert logits into probabilities.

the speed up the traning

Why is Weight Initialization Important in Deep Learning?


Weight initialization is crucial in deep learning because it significantly impacts the training
process, convergence speed, and overall performance of a neural network. Poor weight
initialization can lead to slow learning, vanishing/exploding gradients, or even prevent the
network from learning effectively.

What NOT to Do:


 Do NOT initialize all weights to zero: (no training)
. if all weights are set to zero, all neurons in a layer will receive the same
gradients and update identically.
. This leads to symmetry problems, where all neurons learn the same features
and the network fails to learn effectively.

 Do NOT use constant values for weights:


. Assigning all weights a constant (e.g., 0.1) still leads to symmetry problems, similar
to zero initialization.

 Using Random Initialization:


. training will slow when using random initialization.
. Vanishing gradient problems arise.
How can you improve your BERT NER model?

When can be done:

Use activation-specific initialization:

 Xavier for Sigmoid/Tanh


 He for ReLU/Leaky ReLU

What is Batch Normalization?

Batch-Normalization is an algorithmic method which makes the training of deep neural


Networks (DNN) faster and more stable.
It consists of normalizing activation vectors from hidden layers using the mean and
Variance of the current batch. This normalization step is applied right before (or
Right after ) the nonlinear function.

Optimizers in Deep Learning

Optimizers are algorithms that adjust the weights of a neural network to minimize the loss function
and improve model performance. They help in faster and more stable training.

Or

Optimizers are algorithms or methods used to change the attributes of your neural network such as
weight and learning rate in order to reduce the loses.

Challenges Solved by Optimizers:

Gradient Descent is essential for training deep learning models but faces
challenges like vanishing/exploding gradients, slow convergence, local
minima, learning rate, and sensitivity to hyperparameters.
Types of Optimizers

1 . Momentum-based (SGD + Momentum, NAG)

. Speeds up training and smooths updates.

2 . Adagrad (Adaptive Gradient Algorithm)

 Adjusts learning rates based on past gradients (larger updates for infrequent
parameters).
 Pros: Works well for sparse(most value are zero) data.
How can you improve your BERT NER model?

 Cons: Learning rate keeps decreasing, which can slow down training.

3 . RMSprop (Root Mean Square Propagation)

 Modifies Adagrad by introducing a moving average of squared gradients.


 Pros: Works well for non-stationary objectives like deep networks.
 Cons: Learning rate still needs careful tuning.

4 .Adam (Adaptive Moment Estimation)

 Combines momentum and RMSprop to provide adaptive learning rates.


 Pros: Fast convergence, works well for most tasks.
 Cons: Can overfit or lead to unstable training in some cases.

What is CNN

A CNN (Convolutional Neural Network) is a type of deep learning model primarily used for image
processing, computer vision, and pattern recognition tasks
How can you improve your BERT NER model?

Key Layers(COMPONENTS):

1. Convolutional Layer: Extracts features by applying filters (kernels) to the input


image.
2. Pooling Layer: Reduces spatial dimensions, making the model efficient.
3. Fully Connected Layer: Processes extracted features and makes final predictions.
4. Padding: Preserves input size or reduces edge information loss.
5. Strides: Controls the step size of the filter, affecting output size.

Applications:

 Image classification (e.g., identifying cats vs. dogs)


 Object detection (e.g., self-driving cars)
 Medical imaging (e.g., detecting tumors in X-rays)
 Face recognition (e.g., unlocking phones)

Advantages:
 Automatically learns spatial hierarchies of features.
 Highly effective for image-related tasks.
 Reduces the need for manual feature engineering

=>CNN Architecher based on humen VISUAL CORTEX.


=>there area two types of image 1.greyscale-balck/white image, 2.RGB – red green and black

Limitations:

 Computationally expensive, especially for large datasets.


 Requires significant amounts of labeled data for training.
 Can be prone to overfitting if not properly regularized

Why Not Use ANN for Image Data


CNNs are specifically designed to handle the unique challenges of image data,
such as spatial hierarchies, translation invariance, and high dimensionality. While
ANNs can technically process images, they are inefficient, computationally
expensive, and less effective compared to CNNs for image-related tasks.

 High coputional cost


 Overfitting
 Loss of important information like spatial arrangement of pixels

CNN Architecture - Summary

A Convolutional Neural Network (CNN) is a deep learning model used for image recognition
and processing. Its key layers include:

1. Input Layer – Takes an image as input.


2. Convolutional Layer – Extracts features using filters (kernels).
How can you improve your BERT NER model?

3. Activation Function (ReLU) – Introduces non-linearity.


4. Pooling Layer – Reduces spatial dimensions while retaining key features (e.g., Max Pooling).
5. Fully Connected Layer (FC) – Flattens and connects neurons for classification.
6. Output Layer – Produces final predictions using softmax or sigmoid functions.

CNNs are effective in tasks like image classification, object detection, and facial recognitio

CNN Architecture

Bac
kpropagation in CNN

Backpropagation in Convolutional Neural Networks (CNNs) is the process of updating the network’s
weights based on the error computed at the output layer. It follows the same fundamental principles
as backpropagation in traditional neural networks but accounts for the unique layers in CNNs, such
as convolutional and pooling layers.

Steps of Backpropagation in CNN:

1. Forward Propagation:

 Input image is passed through multiple convolutional, activation, pooling, and fully
connected layers.

 The final layer computes the output (predictions).

 Loss is calculated using a loss function (e.g., Cross-Entropy, MSE).

2. Backward Propagation (Gradient Calculation):

 Compute the derivative of the loss with respect to the output layer weights (Fully
Connected Layer).

 Backpropagate the error through the fully connected layers using the chain rule.

 Compute gradients for the convolutional layers by applying the chain rule to the activation
maps and filters.
How can you improve your BERT NER model?

 Pooling layers (if max pooling) propagate gradients only for the max activation locations.

Weight Update (Gradient Descent / Adam Optimizer):

 Update the weights of the filters and fully connected layers using gradient descent:
W=W−α∂L/∂W

 where α\alphaα is the learning rate.

4. Repeat Until Convergence:

 The process is repeated over multiple epochs until the loss stabilizes.

24. What are some real-world applications of CNNs?

Answer:

 Facial Recognition (Face ID, surveillance)


 Medical Imaging (Tumor detection)
 Autonomous Vehicles (Object detection, lane detection)
 Robotics & Industrial Automation

Pretrained Models in CNN

Pretrained models in Convolutional Neural Networks (CNNs) are deep learning models that
have already been trained on large datasets (such as ImageNet). These models can be used for
transfer learning, where they are fine-tuned on new tasks with smaller datasets, significantly
reducing training time and computational cost.

1. Why Use Pretrained Models?

✅ Faster training – No need to train from scratch.


✅ Better accuracy – Models have learned general features from large datasets.
✅ Less data required – Useful when labeled data is limited.
✅ State-of-the-art performance – Many models achieve top results on benchmark datasets.
How can you improve your BERT NER model?

VGG16: A Deep Learning Model for Image Processing

VGG16 is a deep convolutional neural network (CNN) architecture developed by the


Visual Geometry Group (VGG) at the University of Oxford. It was introduced in the 2014
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and became famous for
its simplicity and effectiveness in image classification tasks.

Why is VGG16 Used?

. pretrained on ImageNet:->
VGG16 is pretrained on the ImageNet dataset, which contains over 1
million images across 1000 classes. This makes it a powerful feature extraction for many computer
vision tasks.

. Transfer Learning:->
It can be fine-tuned for specific tasks, saving time and computational resources.

Good for Feature Extraction:->


The hierarchical feature maps produced by the network capture low-level(edges) and
high-level(object parts) features.

Standard Benchmark Model:->


Due to its popularity and simplicity, it serves as a benchmark in academic and
industrial research.

VGG16 Architecture
How can you improve your BERT NER model?

VGG16 is a deep convolutional neural network (CNN) with 16 weighted layers (13 convolutional + 3
fully connected). It uses 3×3 convolutional filters, ReLU activation, and 2×2 max-pooling layers for
feature extraction. The final layers consist of three fully connected layers, ending with a Softmax
classifier for image classification.

Applications:

 Image classification (ImageNet)


 Object detection (Faster R-CNN)
 Feature extraction for deep learning tasks

What is ResNet?

ResNet (Residual Network) is a deep learning architecture designed to train very deep
neural networks effectively. It was introduced by Microsoft in 2015 and became famous for
winning the ImageNet competition due to its superior performance.

Why was ResNet Needed?

As neural networks grow deeper, they face problems like:


❌ Vanishing Gradient Problem – Gradients become very small during training, making it
hard to update weights in deeper layers.
❌ Degradation Problem – Adding more layers sometimes decreases accuracy instead of
improving it.

Residual Block & Skip Connections


How can you improve your BERT NER model?

Components of the Residual Block

 Input (X): The initial feature map.


 Weight Layers: These are convolutional layers that learn transformations.
 ReLU Activation: Applied after each convolution to introduce non-linearity.
 Skip Connection (Identity Mapping): Directly adds the input X to the output of the
weight layers.
 Summation (+): Combines the transformed output F(X) with the original input X.
 Final ReLU Activation: Helps in non-linearity and better learning.

Skip Connections
. these “shortcut” connections allow the information from earlier layers to bypass some layers
and directly contribute to the later layers helping with gradient flow.
. prevent problems like vanishing/exploding gradients and improve regularization.

Experiments:
. ResNet has been successfully tested with 100-1000 layers on datasets like CIFAR-10,
demonstrating its scalability and effectiveness.

2️⃣ Understanding the Formula

The Residual Block computes:

Y=F(X) + X

Where:
How can you improve your BERT NER model?

 F(X) is the output of the convolutional layers.


 X is the original input (passed through the shortcut connection).
 The final result Y is the sum of both.

3️⃣ Why Use a Skip Connection?

✅ Solves the vanishing gradient problem


✅ Allows training very deep networks

Transfer learning
Transfer learning is a research problem in machine learning that focuses on storing
knowledge gained while solving one problem and applying it to a different but related problem.

Ways of doing Transfer Learning

Feature
Extraction Fine tuning
How can you improve your BERT NER model?

Feature Extraction

✅ Freeze pre-trained model layers and use them as a fixed feature extractor.
✅ Replace the last few layers with a new classifier for the target task.
✅ Best for small datasets and when features from the pre-trained model are already useful.
✅ Fast training but may not generalize well if tasks are too different.

Fine-Tuning

✅ Unfreeze some or all pre-trained layers and retrain them on new data.
✅ Helps the model adapt its features to the new dataset.
✅ Best for larger datasets and when the new task requires more customization.
✅ Better accuracy but requires more training time and data.

ANN-> Work on tbular data


CNN->Work on image data
RNN-> Work on sequential data

When to Use What?


Type Best For

ANN Simple classification/regression problems with no sequential dependency.

CNN Image/video processing, feature extraction from spatial data.

RNN Speech recognition, text processing, and time series forecasting.


How can you improve your BERT NER model?

What is RNN

RNN (Recurrent Neural Network) is a type of neural network designed for sequential data
processing. It retains memory of past inputs using hidden states, making it useful for tasks like
speech recognition, language modelling, and time series prediction. However, RNNs suffer from
vanishing/exploding gradients, which limit their ability to learn long-term dependencies. Variants
like LSTM and GRU help overcome these issues.

Why RNNs?

 Handle variable-length sequences.


 Capture temporal dependencies (order matters).
 Suitable for tasks where context from past inputs is crucial.
 sequence contains some meaning

Architecture of RNN
. Input Layer:
->Accepts sequential data(e.g, time-series, text).
->Data is usually represented as vectors(e.g, word embeddings for text)
. Hidden Layers with Recurrent Connections:
->Contains neurons that not olny process the current input but also have
“recurrent” past inputs.
->The hidden state stores information from previous time steps, allowing the
network to “remember” past inputs.
. Output layer:
->Generates the output at each time step based on the hidden state.
Time Dependency:
->The network processes one input at a time step and uses the hidden state to link
information across time steps.

. Backpropagation Through Time(BPTT):


 The training process where errors are propagated backward through the entire
sequence to adjust weights.
 Special algorithm used to handle the temporal dependencies in the network.
How can you improve your BERT NER model?

Feed back

h
output
input

Why This Matters:

 This unfolding process is key to understanding how RNNs handle sequential data and
maintain a "memory" of past inputs.

 It highlights the recurrent nature of RNNs, where the same parameters are reused across
time steps, making them efficient for tasks like time series prediction, NLP, and speech
recognition.

Types of RNN

Many to One to many2 One 2 one

one many many


How can you improve your BERT NER model?

Types of RNN (Summary)

RNNs can be categorized based on how inputs and outputs are structured over time.

1. One-to-One (Vanilla NN)


o Standard neural network (not truly RNN).
o Single input → Single output.
o Example: Image classification.
2. One-to-Many
o Single input → Sequence output.
o Example: Image captioning (one image → multiple words).
3. Many-to-One
o Sequence input → Single output.
o Example: Sentiment analysis (sentence → sentiment score).
4. Many-to-Many (Same Length)
o Sequence input → Sequence output (same length).
o Example: Named Entity Recognition (NER) (word sequence → tag sequence).
5. Many-to-Many (Different Lengths)
o Sequence input → Sequence output (different lengths).
o Example: Machine translation (English sentence → French sentence).

Key Takeaways

 Forward propagation: Computes hidden states and outputs.

 Backward propagation (BPTT): Computes gradients and updates weights.

 Challenges: Vanishing gradient problem( Long term dependency)/Exploding gradient


problems, solved using LSTMs/GRUs,

Why Does It Occur Vanishing gradient problem?


Repeated multiplication of small gradient values.
Activation functions like sigmoid/tanh squash values, making gradients even smaller.
RNNs share weights across time steps, worsening the effect over long sequences.
How can you improve your BERT NER model?

What is LSTM

LSTM(Long Short Term Memory ) is a type of recurrent neural network(RNN) designed to handle
sequential data and long term dependencies by addressing the Vanishing gradient problem.

Architecture of LSTM

The architecture of an LSTM (Long Short-Term Memory) network consists of several


components designed to manage and retain information over long sequences. The key
elements are:

1. Cell State: The memory of the LSTM, which carries important information
throughout the sequence and is updated by the gates. It is passed along from one time
step to the next.
2. Forget Gate: Decides what information from the cell state should be discarded. It
uses a sigmoid activation function to output values between 0 (completely forget) and
1 (completely retain).
How can you improve your BERT NER model?

3. Input Gate: Determines what new information should be added to the cell state. It
has two parts:
o A sigmoid layer, which decides which values to update.
o A tanh layer, which creates a vector of new candidate values that could be
added to the state.
4. Cell State Update: The cell state is updated by combining the old state (after being
filtered by the forget gate) with the new candidate values from the input gate.
5. Output Gate: Decides what the next hidden state should be (i.e., the output for the
current time step). It uses a sigmoid layer to filter information and a tanh layer to
ensure that the values stay between -1 and 1.

The flow can be summarized in the following steps:

1. The forget gate filters out irrelevant information from the previous cell state.

2. The input gate updates the cell state with new relevant information.

3. The cell state is updated and passed on to the next time step.

4. The output gate generates the hidden state (output).

This structure allows LSTMs to effectively capture long-term dependencies in sequential data, solving
issues that arise in traditional RNNs, such as the vanishing gradient problem

LSTM Aplications:

LSTM (Long Short-Term Memory) networks are widely used in sequential data processing due to
their ability to retain long-term dependencies. Their key applications include:

1. Natural Language Processing (NLP) – Used in text generation, machine translation, speech
recognition, and chatbots.

2. Time Series Forecasting – Helps predict stock prices, weather patterns, and sales trends.

3. Speech & Audio Processing – Applied in speech recognition, music generation, and speaker
identification.

4. Healthcare & Biomedical – Used for ECG analysis, medical diagnosis, and drug discovery.

5. Anomaly Detection – Detects fraud, cybersecurity threats, and machinery failures.

6. Autonomous Systems – Helps in self-driving cars and robotics.

7. Computer Vision & Video Analysis – Used in action recognition, handwriting recognition,
and image captioning.

8. Reinforcement Learning – Enhances AI decision-making in gaming and automated systems.

LSTMs are essential for tasks requiring memory of past sequences, making them valuable in AI
applications.
How can you improve your BERT NER model?

What is Gated Recurrent Unit(GRU)

GRUs are often preferred when computational efficiency and simplicity are
priorities, especially for smaller datasets or tasks where long-term dependencies
are less critical. However, LSTMs may still be better for tasks requiring the
modeling of very long-term dependencies or when working with large datasets.
How can you improve your BERT NER model?

Deep learning

ANN CNN RNN Seq2seq


How to CNN RNN models
Improve transfer LSTM/GRU Encoder Decoder
Performance learning Attention
Transformer
Fine-tuning

Seq2seq model
How can you improve your BERT NER model?

Encoder and Decoder

High-Level Overview

At a very high level, an encoder-decoder model can be


thought of as two blocks, the encoder and the decoder
connected by a vector which we will refer to as the
‘context vector’.

 Encoder: The encoder processes each token in the input-sequence. It


tries to cram all the information about the input-sequence into a vector of
fixed length i.e. the ‘context vector’. After going through all the tokens,
the encoder passes this vector onto the decoder.
 Context vector: The vector is built in such a way that it's expected to
encapsulate the whole meaning of the input-sequence and help the
decoder make accurate predictions. We will see later that this is the final
internal states of our encoder block.
 Decoder: The decoder reads the context vector and tries to predict the
target-sequence token by token.
How can you improve your BERT NER model?

What’s under the hood?

The internal structure of both the blocks would look


something like this:

Explanation of the Image

This image represents a sequence-to-sequence (Seq2Seq) model with Long Short-Term


Memory (LSTM) networks, commonly used for tasks like machine translation.

 Encoder: Processes the input sentence and extracts meaningful context.


 Decoder: Uses this context to generate the output sequence step by step.
 Hidden State & Cell State: These states transfer the learned knowledge from the encoder to
the decoder.
 Seq2Seq Model: This type of model is widely used in machine translation, text
summarization, and chatbot developmen

Encoders and decoders several challenges

Encoders and decoders in machine learning, especially in sequence-to-sequence models, face


several challenges:

 Encoder Problems: Loss of information, vanishing gradients, inefficiency in handling long


sequences, and overfitting.
How can you improve your BERT NER model?

 Decoder Problems: Exposure bias, error accumulation, slow inference, difficulty with rare
words, and alignment issues.

Solutions include attention mechanisms, transformers, beam search, pre-training, and improved
training techniques like scheduled sampling.

Attention Mechanism
The attention mechanism is a technique used in neural networks to focus on the most
relevant parts of the input when making predictions. It is widely used in sequence-to-
sequence (Seq2Seq) models, especially in tasks like machine translation, text
summarization, and speech recognition.
How can you improve your BERT NER model?

How It Works
Instead of encoding the entire input into a fixed-length vector (which can
lead to information loss), attention dynamically assigns different weights
to different input elements. This allows the model to focus on the most
important parts of the input at each step.
1. Calculate Attention Scores: The model computes similarity scores
between each input token and the current decoding step.
2. Assign Weights: These scores are converted into probabilities (using
Softmax), determining how much attention each input token gets.
3. Weighted Sum: The weighted sum of the input representations is
computed and used for the final prediction.

Transformer

A Transformer is a deep learning model introduced in "Attention is All


You Need" (2017). It replaces RNNs and LSTMs with self-attention and
multi-head attention, enabling parallel processing of sequences for
better efficiency and scalability.

Why Transformers?
They handle long-range dependencies, support parallel processing, and
scale efficiently.

Key Features:

 Self-Attention: Captures dependencies between words in a sequence.


 Multi-Head Attention: Learns multiple contextual representations.
 Positional Encoding: Preserves word order.
 Feedforward Layers & Residual Connections: Improve learning stability.
 Encoder-Decoder Architecture: Used in translation and text generation.

Applications:
 NLP: BERT, GPT, T5 for text tasks.
 Vision: Vision Transformers (ViT) for image processing.
 Speech: Whisper, Wav2Vec for speech-to-text.
How can you improve your BERT NER model?

Advantages of Transformers

 Handles long-range dependencies with self-attention.


 Highly parallelizable for faster training.
 Performs well on large datasets (e.g., BERT, GPT).
 Versatile applications in NLP, vision, and speech.
 No sequential computation like RNNs, reducing bottlenecks.

Disadvantages of Transformers

 High computational cost and requires powerful hardware.


 Memory-intensive, especially for long sequences (O(n²) complexity).
 Needs large datasets for effective training.
 Difficult to interpret compared to traditional models.
 Not ideal for small-scale tasks, as simpler models may perform sufficiently.

Self Attention
[ Self Attention is a mechanism that can take static Embbeding as input and can
generat good Contextual Embedding which are much better use for any kind of
Nlp Aplications ]
How can you improve your BERT NER model?

1. Inputs (Tokens): Represented by green blocks at the top, these could be words like
"money," "bank," etc.
2. Query, Key, and Value Vectors (Q, K, V):
o The pink blocks represent Key (K) vectors.
o The green blocks represent Query (Q) vectors.
o The blue blocks at the bottom represent Value (V) vectors.
3. Softmax Layer:
o The yellow box labeled "Softmax" computes attention scores for each token
by comparing the Query with all Keys.
o It assigns importance scores (s11, s12, s13, etc.) based on similarity.
4. Weighted Sum of Values:
o The final output (purple block) is obtained by summing the values (V)
weighted by the attention scores.

How It Works

1. Each token attends to all other tokens in the sequence, computing attention scores using
the dot product of its Query (Q) with all Keys (K).
2. Softmax normalizes the scores, ensuring they sum to 1.
3. Weighted sum of Value (V) vectors gives the final contextualized representation for each
word.

Points to consider :

 This operation is a parallel operation. Problems of self Attention


 There are no parameterics involved
How can you improve your BERT NER model?

This figure add parametrics(Wa,Wk,Wv) for task specific contextual embedding.

. Parallel summary upper figure.


. unstable gradient or unstable training naho isliye Scaled-dot-Product ka use karte
hai.
How can you improve your BERT NER model?

Multi-Head Attention
Multi-head attention is an extension of the self-attention mechanism in deep learning,
primarily used in transformers (e.g., BERT, GPT). It allows the model to attend to different
parts of the input sequence simultaneously, improving its ability to capture complex
relationships.

Key Points:

 Multiple Attention Heads: Instead of a single attention mechanism, multiple


attention heads process the input in parallel.
 Diverse Focus: Each head learns different patterns (e.g., short-term vs. long-term
dependencies).
 Improved Performance: Helps models better understand contextual relationships in
NLP, vision, and time-series data.

Formula:

MultiHead(Q,K,V)=Concat(head1,head2,...,headh)WO

where each head applies scaled dot-product attention.

Multi-head attention is an extension of the self-attention mechanism, where the model


computes multiple sets of attention scores (heads) in parallel and concatenates their outputs.
How can you improve your BERT NER model?

This allows the model to focus on different parts of the input sequence simultaneously,
capturing multiple types of relationships between tokens.

Benefits:

 Improves the ability to capture different patterns in the input.


 Provides more robust representations by combining multiple attention heads.

Positional Encoding in Transformers

Since transformers do not have a built-in notion of sequence order (unlike RNNs or CNNs),
positional encoding (PE) is used to provide positional information about words in a
sequence.

Why Positional Encoding?

Transformers process input in parallel rather than sequentially. Without positional encoding,
the model would treat words as a bag of words, ignoring word order. PE allows the model to
distinguish the positions of words in a sentence.

What is Normalization
Normalization in deep learning refers to the process of transforming data or model
outputs to Have specific statistical properties, typically a mean of zero and a Variance of one.

Key Benefits:

✅ Prevents exploding/vanishing gradients


✅ Stabilizes training for deeper models
✅ Speeds up training and improves efficiency
✅ Works well with attention mechanisms
✅ Independent of batch size (unlike BatchNorm)

Modern transformers mostly use Pre-Normalization (Pre-LN) for better stability.

Normalization in Transformers
Normalization in Transformers stabilizes training and improves convergence. The most common
method is Layer Normalization (LN), which normalizes across features for each token independently,
making it effective for variable sequence lengths

Why Normalization is Needed?


How can you improve your BERT NER model?

 Transformers process sequences where activations can have varying magnitudes.

 Normalization ensures stable gradients and efficient learning.

 It helps prevent issues like internal covariate shift.

why not use batch normalization in transformer

Batch Normalization (BN) is not used in Transformers because it depends on batch statistics, which
are unstable due to variable sequence lengths and small batch sizes. Instead, Layer Normalization
(LN) is preferred as it normalizes across features per token, making it more stable for self-attention
mechanisms. BN can also interfere with attention dynamics, whereas LN works more effectively in
Transformers.

Transformer as an Autoregressive models

The Transformer decoder is autoregressive at inference time and non-autoregressive at

Training time.

In the context of deep learning, autoregressive models are a class of


Models that generate data points in a sequence by conditioning each
New point on the previously generated points.

Masked Multi-Head Self-Attention

1. What is It?

Masked Multi-Head Self-Attention is a modified version of self-attention used in the Transformer


decoder. It prevents the model from "looking ahead" at future tokens when generating text,
ensuring that predictions are made sequentially.

2. Why is Masking Needed?

When generating text, the model should predict the next word based only on previous words. If it
could see future words during training, it would "cheat" and learn incorrect patterns.
How can you improve your BERT NER model?

How Masking Works


A masking matrix is applied to the attention scores before the softmax step. The mask:

Sets future token scores to −∞ (or a very large negative number), making their softmax
probabilities zero.

Ensures the model cannot attend to tokens that come after the current position.

Mathematically, masking modifies the attention mechanism as:

Masked Attention(Q,K,V)=softmax(QKT/dk+M)V

where M is a matrix where:

Allowed positions (past & current tokens) have 000.

Future positions have −∞.

What is Cross Attention

Cross attention is a mechanism used in transformer architectures, particulary in tasks involving


sequence-to-sequence data like translation or summarization. It allows a model to focus on different
part of an input sequence when generating an output sequence.

Transformer Architecture in Deep


Learning

The Transformer is a deep learning model introduced in the paper "Attention Is All You Need" by
Vaswani et al. (2017). It has revolutionized Natural Language Processing (NLP) and is widely used in
large language models (LLMs) like GPT, BERT, and T5. Unlike recurrent networks, Transformers
process input in parallel, making them more efficient and scalable.
How can you improve your BERT NER model?

Advantages of Transformers

✅ Parallelization – Unlike RNNs, they process all tokens at once.


✅ Long-Range Dependencies – Self-attention allows capturing relationships between distant words.
✅ Scalability – Suitable for training large models with massive datasets.

Summary of Transformer Encoder


The encoder in a Transformer processes an input sequence and generates a rich,
contextualized representation of each token. It consists of multiple identical layers (typically
6 or more), where each layer has two main components:
How can you improve your BERT NER model?

1. Multi-Head Self-Attention – Helps each word attend to all other words in the sequence.
2. Feed-Forward Neural Network (FFN) – Applies non-linearity and transformations.

Each layer also includes residual connections and layer normalization to stabilize learning.

Processing Steps in Each Encoder Layer:

1. Input Embedding + Positional Encoding – Converts words into vectors and adds positional
information.
2. Multi-Head Self-Attention – Computes attention scores to capture relationships between
words.
3. Add & Normalize – Uses residual connections and layer normalization.
4. Feed-Forward Neural Network (FFN) – Applies non-linear transformations.
5. Add & Normalize (Again) – Another residual connection and layer normalization.

After passing through multiple encoder layers, the final output is a contextual
representation of the input, which is used by the decoder for further processing.

Absolutely! Here's a clear and concise summary of the Transformer Decoder, mirroring
the encoder format:

🔍 Transformer Decoder – Summary

The Transformer Decoder takes the encoder’s output and a target sequence (e.g. previous
words in translation) to generate the final output sequence one token at a time during
inference.

🧱 Structure

 Composed of N identical layers (typically same as the encoder: 6).


 Each decoder layer has three main sub-layers:
1. Masked Multi-Head Self-Attention – Allows each position to attend only to
earlier positions in the target sequence (to prevent "cheating").
2. Multi-Head Cross-Attention – Lets the decoder attend to the encoder's
output.
3. Feed-Forward Neural Network (FFN) – Adds transformations and non-
linearity.

🔄 Steps in Each Decoder Layer

1. Input Embedding + Positional Encoding


o Target tokens are embedded + positional info added.
2. Masked Multi-Head Self-Attention
o Prevents attending to future tokens (important during training and inference).
How can you improve your BERT NER model?

3. Add & Normalize


o Residual connection + layer normalization.
4. Multi-Head Cross-Attention (Encoder-Decoder Attention)
o Queries from decoder attend to key-values from encoder outputs.
5. Add & Normalize
o Another residual connection + normalization.
6. Feed-Forward Network (FFN)
o Applies transformation to each position independently.
7. Add & Normalize
o Final normalization before passing to next decoder layer or output layer.

🧠 Final Output

 The decoder generates a probability distribution over the next token in the
sequence.
 During inference, it generates tokens step-by-step, feeding previous outputs back in.

Let me know if you'd like:

 A side-by-side Encoder vs Decoder comparison


 Code implementation in PyTorch/Keras
 Or a visual diagram of the architecture!
How can you improve your BERT NER model?
How can you improve your BERT NER model?

1. What is BERT, and why is it important?


Answer:
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained NLP model
developed by Google. It is designed to understand the context of words in a sentence
bidirectionally, unlike traditional NLP models that process text sequentially.

 It is important because it improves accuracy in NLP tasks such as question answering,


sentiment analysis, text classification, and named entity recognition (NER).

2. How does BERT differ from previous NLP models?

Answer:
BERT is different because:

1. It uses bidirectional training (considers both left and right context in a sentence).

2. Previous models like Word2Vec and Glove created static word embeddings, but BERT
generates contextual embeddings based on the sentence.

3. It is pre-trained on large datasets and then fine-tuned for specific NLP tasks.

3. What are the key components of BERT's architecture?

Answer:
BERT is based on the Transformer architecture and consists of:

 Multi-head Self-Attention (helps in understanding word dependencies)

 Feed-forward Neural Networks

 Positional Encoding (as Transformers do not process text sequentially)

 Layer Normalization

 Residual Connections

5. What are the main pre-training tasks used in BERT?


How can you improve your BERT NER model?

Answer:
BERT is pre-trained using two tasks:

1. Masked Language Model (MLM):

o 15% of words in the input are masked, and BERT predicts the missing words.

2. Next Sentence Prediction (NSP):

o BERT is given two sentences and learns to predict if the second sentence follows the
first one or not.

6. How does BERT handle out-of-vocabulary (OOV) words?

Answer:
BERT uses Word Piece Tokenization, which splits rare words into smaller sub words or characters.
Example:

 The word “unaffordable” → is split into [‘un’, ‘##afford’, ‘##able’]


This allows BERT to handle unknown words efficiently.

7. How is BERT fine-tuned for specific NLP tasks?

Answer:

 A pre-trained BERT model is taken.

 A task-specific layer (such as a classifier or regression layer) is added on top.

 The model is fine-tuned using a smaller label dataset for specific tasks like sentiment
analysis, named entity recognition (NER), question answering, or text classification.

8. What are the advantages and limitations of BERT?

Answer:

✅ Advantages:

 Better context understanding due to bidirectional processing.


 Improves accuracy in NLP tasks like search engines, chatbots, and QA systems.
 Pre-trained on large corpora, reducing training effort.

❌ Limitations:
How can you improve your BERT NER model?

 High computational cost (requires powerful GPUs/TPUs).


 Slow inference time due to its size.
 Fixed-length input (512 tokens max), making it difficult for long documents.

Summary of BERT Training

BERT (Bidirectional Encoder Representations from Transformers) is pre-trained on:

1. Book Corpus (800M words) – A large collection of books to help BERT learn long-
range dependencies.
2. English Wikipedia (2.5B words) – Wikipedia text (excluding tables and lists) to
provide factual and diverse language understanding.

BERT is trained using Masked Language Modelling (MLM) and Next Sentence
Prediction (NSP). After pre-training, it can be fine-tuned on specific NLP tasks like question
answering, sentiment analysis, and named entity recognition (NER).

Why use BERT for fine-tuning?

“BERT is already trained on a huge amount of text, so it understands the language very well.
Instead of training a model from scratch, we can fine-tune BERT on our specific task, like sentiment
analysis or question answering. This saves time, needs less data, and gives better results because
BERT already knows how language works.”

Short example (if asked):

“For example, if I want to build a model to detect spam messages, I can take BERT and fine-tune it
on labeled spam data. BERT already understands grammar and context, so it learns the spam
patterns quickly.”
How can you improve your BERT NER model?

Simple Interview Answer:

"I can improve my BERT NER model in a few ways:

 Use a better version of BERT like RoBERTa or DistilBERT.

 Add more training data or create similar examples (data augmentation).

 Tune the training settings like learning rate, batch size, and number of training steps.

 Handle imbalanced data so the model learns all entity types better.

 Add a CRF layer on top of BERT to improve how the model predicts sequences.

 Clean and preprocess the data well, like fixing label errors or token issues."

Let me know if you want help explaining any of those points in even simpler terms or with
examples!
How can you improve your BERT NER model?

1. What is the T5 model in deep learning?


Answer:
T5 (Text-to-Text Transfer Transformer) is a transformer-based NLP model developed by Google
Research, where all NLP tasks (summarization, translation, question answering, etc.) are treated as
text-to-text problems. It uses an encoder-decoder architecture and is trained on a large dataset
called C4 (Colossal Clean Crawled Corpus).

2. How is T5 different from BERT and GPT?

Answer:

Feature T5 BERT GPT


Architecture Encoder-Decoder Encoder-only Decoder-only
Objective Text-to-Text Masked LM Auto-Regressive
Generation +
Task Type Understanding Generation
Understanding
Handles multiple NLP Strong contextual
Strengths Natural text generation
tasks understanding
Requires large-scale
Weaknesses Expensive Cannot generate text
training

4. What dataset was used to train T5?

Answer:
T5 was trained on C4 (Colossal Clean Crawled Corpus), a massive dataset created from web pages.
This dataset was cleaned to remove low-quality text and was used for pretraining.

5. What is the training objective of T5?

Answer:
T5 is trained using a denoising autoencoder objective. This means:

 The input text is corrupted by randomly masking some words.


 The model learns to reconstruct the original sentence.

Example:

 Input: "The quick brown [MASK] jumps over the lazy [MASK]."
 Output: "The quick brown fox jumps over the lazy dog."
How can you improve your BERT NER model?

This is similar to BERT's masked language modeling, but T5 generates the full sequence
instead of just predicting individual words.

7. What is the architecture of T5?

Answer:
T5 follows the Transformer Encoder-Decoder architecture:

 Encoder: Reads the input text and creates representations.

 Decoder: Generates the target text.

 Self-Attention: Helps model long-range dependencies.

 Feedforward Networks: Used in both encoder and decoder.

10. What are the advantages and limitations of T5?

Answer:

✅ Advantages:

 Unified framework for multiple NLP tasks.

 Pretrained on a large corpus, making it powerful.

 Flexible (can be fine-tuned for any task).

❌ Limitations:

 Computationally expensive (especially larger versions).

 Requires a large dataset for fine-tuning.

 May generate incorrect or biased outputs if the training data contains biases.
How can you improve your BERT NER model?
How can you improve your BERT NER model?
How can you improve your BERT NER model?
How can you improve your BERT NER model?
How can you improve your BERT NER model?

You might also like