Deep learning notes
Deep learning notes
Or
Deep learning is part of a broader family of machine learning methods based on artificial neural
networks with representation learning
Deep learning algorithms uses multiple layers to progressively extract higher-level features from
the raw input . For example, in image processing lower layers may identify edges, while higher
layers may identify the concepts relevant to a human such as digits or letters or faces.
1. High Accuracy – Achieves state-of-the-art results in vision, NLP, and speech tasks.
2. Big Data Availability – Thrives on large datasets, which are now more accessible.
3. Powerful Hardware – GPUs and TPUs enable efficient training of deep networks.
4. Automation & Adaptability – Learns complex patterns without manual feature
engineering.
5. Real-World Applications – Used in AI assistants, self-driving cars, medical
diagnosis, and more.
Its success is driven by improved algorithms, computing power, and vast amounts of data.
When to Use
Answer:
Deep learning is a subset of machine learning that uses neural networks with multiple layers
(deep neural networks) to learn from large amounts of data.
A Perceptron is the simplest type of artificial neural network, primarily used for
binary classification. It is a fundamental building block in deep learning and serves as the
basis for more complex neural networks.
Key Points:
How can you improve your BERT NER model?
Types:
Limitations: A single-layer perceptron cannot solve the XOR problem, but MLPs
(with backpropagation) can.
More complex problems require Multi-Layer Perceptrons (MLPs) with hidden layers
and non-linear activations.
A Multilayer Perceptron (MLP) is a type of artificial neural network composed of multiple layers of
neurons. Unlike a simple perceptron, which has a single layer of neurons, an MLP includes input,
hidden, and output layers. Each layer is fully connected to the next one.
1. Structure of MLP:
How can you improve your BERT NER model?
2. How It Works:
Forward Propagation:
o The input data is passed through the network layer by layer.
o Each neuron in a layer calculates a weighted sum of its inputs and applies an
activation function.
o The result of one layer is passed as input to the next layer until the output layer
provides the final prediction.
Backward Propagation:
o After forward propagation, the loss (error) between the predicted output and
the actual target is calculated.
o The error is propagated backward through the network to compute the
gradients of the loss function with respect to the weights.
o These gradients are used to adjust the weights using an optimization algorithm
like gradient descent.
Regression loss:
mse->
Mae->
Huber loss->
Combine (mse and mae) if outlier in data its behave like mae, and if outlier not present in
Classification loss:
Gradient Descent is an optimization algorithm used in deep learning to minimize the error
(loss function) by adjusting the model’s parameters (weights & biases). It helps neural
networks learn by iteratively improving predictions.
|How It Works:
o Forward Pass: Compute predictions using the current
parameters.
o Loss Calculation: Measure the error between predictions and
true labels.
o Backpropagation: Compute gradients of the loss with
respect to each parameter.
o Parameter Update: Adjust parameters in the direction of the
negative gradient.
o Repeat: Iterate until the loss converges.
How can you improve your BERT NER model?
Stochastic Gradient Updates weights after each Faster updates, good for Noisy updates can lead
Descent (SGD) training example. online learning. to instability.
Mini-Batch Gradient Uses a small batch of data Balances speed & Requires tuning batch
Descent for updates. stability. size.
📌 Example: *SGD is widely used in deep learning frameworks like TensorFlow & PyTorch
for training deep networks efficiently.
The vanishing gradient problem occurs when deep neural networks struggle to update
weights in the earlier layers due to very small gradients during backpropagation. This
slows down or even stops learning.
The vanishing gradient problem occurs in deep neural networks when gradients become too
small, preventing effective learning. Here are key solutions:
How can you improve your BERT NER model?
Overfitting
Benefits:
✔ Prevents overfitting
✔ Reduces training time
✔ Improves generalization
(layer me jo nodes hote hai vo traing ke time hi droup hote hai testing ke time sare nodes
available hote hai , w(1-p))
Dropout is a technique in deep learning that helps prevent overfitting by randomly turning off a
percentage of neurons during training. This forces the network to learn more generalized patterns
instead of relying too much on specific neurons. The dropout rate controls how many neurons are
deactivated (e.g., 0.5 means half the neurons are dropped). During testing or inference, all neurons
are active, but their outputs are scaled to maintain consistency with training.
1. L1 and L2 Regularization
L1 Regularization (Lasso):
o Adds the sum of the absolute values of the weights to the loss
function.
o Encourages sparsity by driving some weights to zero,
effectively performing feature selection.
o Loss function: Loss=Original Loss + λ∑∣wi∣.
How can you improve your BERT NER model?
L2 Regularization (Ridge):
o Adds the sum of the squared values of the weights to the loss
function.
o Encourages small weights but does not force them to zero.
o Loss function: Loss=Original Loss +λ ∑w
f(x) = ax
Issue: Cannot handle non-linear problems, limiting learning.
2. Leaky ReLU
o f(x) = x if x>0, else 0.01x
o Fixes dead neurons issue in ReLU.
3. Sigmoid (Logistic Function)
o f(x)=1/1+e−x
o Pros: Converts inputs to range (0,1), useful for probability-based tasks.
o Cons: Prone to vanishing gradients, slow learning.
4. Tanh (Hyperbolic Tangent Function)
o Pros: Outputs in range (-1,1), better than sigmoid for deep networks.
o Cons: Still suffers from vanishing gradients.
5. Softmax
o Used in multi-class classification to convert logits into probabilities.
Optimizers are algorithms that adjust the weights of a neural network to minimize the loss function
and improve model performance. They help in faster and more stable training.
Or
Optimizers are algorithms or methods used to change the attributes of your neural network such as
weight and learning rate in order to reduce the loses.
Gradient Descent is essential for training deep learning models but faces
challenges like vanishing/exploding gradients, slow convergence, local
minima, learning rate, and sensitivity to hyperparameters.
Types of Optimizers
Adjusts learning rates based on past gradients (larger updates for infrequent
parameters).
Pros: Works well for sparse(most value are zero) data.
How can you improve your BERT NER model?
Cons: Learning rate keeps decreasing, which can slow down training.
What is CNN
A CNN (Convolutional Neural Network) is a type of deep learning model primarily used for image
processing, computer vision, and pattern recognition tasks
How can you improve your BERT NER model?
Key Layers(COMPONENTS):
Applications:
Advantages:
Automatically learns spatial hierarchies of features.
Highly effective for image-related tasks.
Reduces the need for manual feature engineering
Limitations:
A Convolutional Neural Network (CNN) is a deep learning model used for image recognition
and processing. Its key layers include:
CNNs are effective in tasks like image classification, object detection, and facial recognitio
CNN Architecture
Bac
kpropagation in CNN
Backpropagation in Convolutional Neural Networks (CNNs) is the process of updating the network’s
weights based on the error computed at the output layer. It follows the same fundamental principles
as backpropagation in traditional neural networks but accounts for the unique layers in CNNs, such
as convolutional and pooling layers.
Input image is passed through multiple convolutional, activation, pooling, and fully
connected layers.
Compute the derivative of the loss with respect to the output layer weights (Fully
Connected Layer).
Backpropagate the error through the fully connected layers using the chain rule.
Compute gradients for the convolutional layers by applying the chain rule to the activation
maps and filters.
How can you improve your BERT NER model?
Pooling layers (if max pooling) propagate gradients only for the max activation locations.
Update the weights of the filters and fully connected layers using gradient descent:
W=W−α∂L/∂W
The process is repeated over multiple epochs until the loss stabilizes.
Answer:
Pretrained models in Convolutional Neural Networks (CNNs) are deep learning models that
have already been trained on large datasets (such as ImageNet). These models can be used for
transfer learning, where they are fine-tuned on new tasks with smaller datasets, significantly
reducing training time and computational cost.
. pretrained on ImageNet:->
VGG16 is pretrained on the ImageNet dataset, which contains over 1
million images across 1000 classes. This makes it a powerful feature extraction for many computer
vision tasks.
. Transfer Learning:->
It can be fine-tuned for specific tasks, saving time and computational resources.
VGG16 Architecture
How can you improve your BERT NER model?
VGG16 is a deep convolutional neural network (CNN) with 16 weighted layers (13 convolutional + 3
fully connected). It uses 3×3 convolutional filters, ReLU activation, and 2×2 max-pooling layers for
feature extraction. The final layers consist of three fully connected layers, ending with a Softmax
classifier for image classification.
Applications:
What is ResNet?
ResNet (Residual Network) is a deep learning architecture designed to train very deep
neural networks effectively. It was introduced by Microsoft in 2015 and became famous for
winning the ImageNet competition due to its superior performance.
Skip Connections
. these “shortcut” connections allow the information from earlier layers to bypass some layers
and directly contribute to the later layers helping with gradient flow.
. prevent problems like vanishing/exploding gradients and improve regularization.
Experiments:
. ResNet has been successfully tested with 100-1000 layers on datasets like CIFAR-10,
demonstrating its scalability and effectiveness.
Y=F(X) + X
Where:
How can you improve your BERT NER model?
Transfer learning
Transfer learning is a research problem in machine learning that focuses on storing
knowledge gained while solving one problem and applying it to a different but related problem.
Feature
Extraction Fine tuning
How can you improve your BERT NER model?
Feature Extraction
✅ Freeze pre-trained model layers and use them as a fixed feature extractor.
✅ Replace the last few layers with a new classifier for the target task.
✅ Best for small datasets and when features from the pre-trained model are already useful.
✅ Fast training but may not generalize well if tasks are too different.
Fine-Tuning
✅ Unfreeze some or all pre-trained layers and retrain them on new data.
✅ Helps the model adapt its features to the new dataset.
✅ Best for larger datasets and when the new task requires more customization.
✅ Better accuracy but requires more training time and data.
What is RNN
RNN (Recurrent Neural Network) is a type of neural network designed for sequential data
processing. It retains memory of past inputs using hidden states, making it useful for tasks like
speech recognition, language modelling, and time series prediction. However, RNNs suffer from
vanishing/exploding gradients, which limit their ability to learn long-term dependencies. Variants
like LSTM and GRU help overcome these issues.
Why RNNs?
Architecture of RNN
. Input Layer:
->Accepts sequential data(e.g, time-series, text).
->Data is usually represented as vectors(e.g, word embeddings for text)
. Hidden Layers with Recurrent Connections:
->Contains neurons that not olny process the current input but also have
“recurrent” past inputs.
->The hidden state stores information from previous time steps, allowing the
network to “remember” past inputs.
. Output layer:
->Generates the output at each time step based on the hidden state.
Time Dependency:
->The network processes one input at a time step and uses the hidden state to link
information across time steps.
Feed back
h
output
input
This unfolding process is key to understanding how RNNs handle sequential data and
maintain a "memory" of past inputs.
It highlights the recurrent nature of RNNs, where the same parameters are reused across
time steps, making them efficient for tasks like time series prediction, NLP, and speech
recognition.
Types of RNN
RNNs can be categorized based on how inputs and outputs are structured over time.
Key Takeaways
What is LSTM
LSTM(Long Short Term Memory ) is a type of recurrent neural network(RNN) designed to handle
sequential data and long term dependencies by addressing the Vanishing gradient problem.
Architecture of LSTM
1. Cell State: The memory of the LSTM, which carries important information
throughout the sequence and is updated by the gates. It is passed along from one time
step to the next.
2. Forget Gate: Decides what information from the cell state should be discarded. It
uses a sigmoid activation function to output values between 0 (completely forget) and
1 (completely retain).
How can you improve your BERT NER model?
3. Input Gate: Determines what new information should be added to the cell state. It
has two parts:
o A sigmoid layer, which decides which values to update.
o A tanh layer, which creates a vector of new candidate values that could be
added to the state.
4. Cell State Update: The cell state is updated by combining the old state (after being
filtered by the forget gate) with the new candidate values from the input gate.
5. Output Gate: Decides what the next hidden state should be (i.e., the output for the
current time step). It uses a sigmoid layer to filter information and a tanh layer to
ensure that the values stay between -1 and 1.
1. The forget gate filters out irrelevant information from the previous cell state.
2. The input gate updates the cell state with new relevant information.
3. The cell state is updated and passed on to the next time step.
This structure allows LSTMs to effectively capture long-term dependencies in sequential data, solving
issues that arise in traditional RNNs, such as the vanishing gradient problem
LSTM Aplications:
LSTM (Long Short-Term Memory) networks are widely used in sequential data processing due to
their ability to retain long-term dependencies. Their key applications include:
1. Natural Language Processing (NLP) – Used in text generation, machine translation, speech
recognition, and chatbots.
2. Time Series Forecasting – Helps predict stock prices, weather patterns, and sales trends.
3. Speech & Audio Processing – Applied in speech recognition, music generation, and speaker
identification.
4. Healthcare & Biomedical – Used for ECG analysis, medical diagnosis, and drug discovery.
7. Computer Vision & Video Analysis – Used in action recognition, handwriting recognition,
and image captioning.
LSTMs are essential for tasks requiring memory of past sequences, making them valuable in AI
applications.
How can you improve your BERT NER model?
GRUs are often preferred when computational efficiency and simplicity are
priorities, especially for smaller datasets or tasks where long-term dependencies
are less critical. However, LSTMs may still be better for tasks requiring the
modeling of very long-term dependencies or when working with large datasets.
How can you improve your BERT NER model?
Deep learning
Seq2seq model
How can you improve your BERT NER model?
High-Level Overview
Decoder Problems: Exposure bias, error accumulation, slow inference, difficulty with rare
words, and alignment issues.
Solutions include attention mechanisms, transformers, beam search, pre-training, and improved
training techniques like scheduled sampling.
Attention Mechanism
The attention mechanism is a technique used in neural networks to focus on the most
relevant parts of the input when making predictions. It is widely used in sequence-to-
sequence (Seq2Seq) models, especially in tasks like machine translation, text
summarization, and speech recognition.
How can you improve your BERT NER model?
How It Works
Instead of encoding the entire input into a fixed-length vector (which can
lead to information loss), attention dynamically assigns different weights
to different input elements. This allows the model to focus on the most
important parts of the input at each step.
1. Calculate Attention Scores: The model computes similarity scores
between each input token and the current decoding step.
2. Assign Weights: These scores are converted into probabilities (using
Softmax), determining how much attention each input token gets.
3. Weighted Sum: The weighted sum of the input representations is
computed and used for the final prediction.
Transformer
Why Transformers?
They handle long-range dependencies, support parallel processing, and
scale efficiently.
Key Features:
Applications:
NLP: BERT, GPT, T5 for text tasks.
Vision: Vision Transformers (ViT) for image processing.
Speech: Whisper, Wav2Vec for speech-to-text.
How can you improve your BERT NER model?
Advantages of Transformers
Disadvantages of Transformers
Self Attention
[ Self Attention is a mechanism that can take static Embbeding as input and can
generat good Contextual Embedding which are much better use for any kind of
Nlp Aplications ]
How can you improve your BERT NER model?
1. Inputs (Tokens): Represented by green blocks at the top, these could be words like
"money," "bank," etc.
2. Query, Key, and Value Vectors (Q, K, V):
o The pink blocks represent Key (K) vectors.
o The green blocks represent Query (Q) vectors.
o The blue blocks at the bottom represent Value (V) vectors.
3. Softmax Layer:
o The yellow box labeled "Softmax" computes attention scores for each token
by comparing the Query with all Keys.
o It assigns importance scores (s11, s12, s13, etc.) based on similarity.
4. Weighted Sum of Values:
o The final output (purple block) is obtained by summing the values (V)
weighted by the attention scores.
How It Works
1. Each token attends to all other tokens in the sequence, computing attention scores using
the dot product of its Query (Q) with all Keys (K).
2. Softmax normalizes the scores, ensuring they sum to 1.
3. Weighted sum of Value (V) vectors gives the final contextualized representation for each
word.
Points to consider :
Multi-Head Attention
Multi-head attention is an extension of the self-attention mechanism in deep learning,
primarily used in transformers (e.g., BERT, GPT). It allows the model to attend to different
parts of the input sequence simultaneously, improving its ability to capture complex
relationships.
Key Points:
Formula:
MultiHead(Q,K,V)=Concat(head1,head2,...,headh)WO
This allows the model to focus on different parts of the input sequence simultaneously,
capturing multiple types of relationships between tokens.
Benefits:
Since transformers do not have a built-in notion of sequence order (unlike RNNs or CNNs),
positional encoding (PE) is used to provide positional information about words in a
sequence.
Transformers process input in parallel rather than sequentially. Without positional encoding,
the model would treat words as a bag of words, ignoring word order. PE allows the model to
distinguish the positions of words in a sentence.
What is Normalization
Normalization in deep learning refers to the process of transforming data or model
outputs to Have specific statistical properties, typically a mean of zero and a Variance of one.
Key Benefits:
Normalization in Transformers
Normalization in Transformers stabilizes training and improves convergence. The most common
method is Layer Normalization (LN), which normalizes across features for each token independently,
making it effective for variable sequence lengths
Batch Normalization (BN) is not used in Transformers because it depends on batch statistics, which
are unstable due to variable sequence lengths and small batch sizes. Instead, Layer Normalization
(LN) is preferred as it normalizes across features per token, making it more stable for self-attention
mechanisms. BN can also interfere with attention dynamics, whereas LN works more effectively in
Transformers.
Training time.
1. What is It?
When generating text, the model should predict the next word based only on previous words. If it
could see future words during training, it would "cheat" and learn incorrect patterns.
How can you improve your BERT NER model?
Sets future token scores to −∞ (or a very large negative number), making their softmax
probabilities zero.
Ensures the model cannot attend to tokens that come after the current position.
Masked Attention(Q,K,V)=softmax(QKT/dk+M)V
The Transformer is a deep learning model introduced in the paper "Attention Is All You Need" by
Vaswani et al. (2017). It has revolutionized Natural Language Processing (NLP) and is widely used in
large language models (LLMs) like GPT, BERT, and T5. Unlike recurrent networks, Transformers
process input in parallel, making them more efficient and scalable.
How can you improve your BERT NER model?
Advantages of Transformers
1. Multi-Head Self-Attention – Helps each word attend to all other words in the sequence.
2. Feed-Forward Neural Network (FFN) – Applies non-linearity and transformations.
Each layer also includes residual connections and layer normalization to stabilize learning.
1. Input Embedding + Positional Encoding – Converts words into vectors and adds positional
information.
2. Multi-Head Self-Attention – Computes attention scores to capture relationships between
words.
3. Add & Normalize – Uses residual connections and layer normalization.
4. Feed-Forward Neural Network (FFN) – Applies non-linear transformations.
5. Add & Normalize (Again) – Another residual connection and layer normalization.
After passing through multiple encoder layers, the final output is a contextual
representation of the input, which is used by the decoder for further processing.
Absolutely! Here's a clear and concise summary of the Transformer Decoder, mirroring
the encoder format:
The Transformer Decoder takes the encoder’s output and a target sequence (e.g. previous
words in translation) to generate the final output sequence one token at a time during
inference.
🧱 Structure
🧠 Final Output
The decoder generates a probability distribution over the next token in the
sequence.
During inference, it generates tokens step-by-step, feeding previous outputs back in.
Answer:
BERT is different because:
1. It uses bidirectional training (considers both left and right context in a sentence).
2. Previous models like Word2Vec and Glove created static word embeddings, but BERT
generates contextual embeddings based on the sentence.
3. It is pre-trained on large datasets and then fine-tuned for specific NLP tasks.
Answer:
BERT is based on the Transformer architecture and consists of:
Layer Normalization
Residual Connections
Answer:
BERT is pre-trained using two tasks:
o 15% of words in the input are masked, and BERT predicts the missing words.
o BERT is given two sentences and learns to predict if the second sentence follows the
first one or not.
Answer:
BERT uses Word Piece Tokenization, which splits rare words into smaller sub words or characters.
Example:
Answer:
The model is fine-tuned using a smaller label dataset for specific tasks like sentiment
analysis, named entity recognition (NER), question answering, or text classification.
Answer:
✅ Advantages:
❌ Limitations:
How can you improve your BERT NER model?
1. Book Corpus (800M words) – A large collection of books to help BERT learn long-
range dependencies.
2. English Wikipedia (2.5B words) – Wikipedia text (excluding tables and lists) to
provide factual and diverse language understanding.
BERT is trained using Masked Language Modelling (MLM) and Next Sentence
Prediction (NSP). After pre-training, it can be fine-tuned on specific NLP tasks like question
answering, sentiment analysis, and named entity recognition (NER).
“BERT is already trained on a huge amount of text, so it understands the language very well.
Instead of training a model from scratch, we can fine-tune BERT on our specific task, like sentiment
analysis or question answering. This saves time, needs less data, and gives better results because
BERT already knows how language works.”
“For example, if I want to build a model to detect spam messages, I can take BERT and fine-tune it
on labeled spam data. BERT already understands grammar and context, so it learns the spam
patterns quickly.”
How can you improve your BERT NER model?
Tune the training settings like learning rate, batch size, and number of training steps.
Handle imbalanced data so the model learns all entity types better.
Add a CRF layer on top of BERT to improve how the model predicts sequences.
Clean and preprocess the data well, like fixing label errors or token issues."
Let me know if you want help explaining any of those points in even simpler terms or with
examples!
How can you improve your BERT NER model?
Answer:
Answer:
T5 was trained on C4 (Colossal Clean Crawled Corpus), a massive dataset created from web pages.
This dataset was cleaned to remove low-quality text and was used for pretraining.
Answer:
T5 is trained using a denoising autoencoder objective. This means:
Example:
Input: "The quick brown [MASK] jumps over the lazy [MASK]."
Output: "The quick brown fox jumps over the lazy dog."
How can you improve your BERT NER model?
This is similar to BERT's masked language modeling, but T5 generates the full sequence
instead of just predicting individual words.
Answer:
T5 follows the Transformer Encoder-Decoder architecture:
Answer:
✅ Advantages:
❌ Limitations:
May generate incorrect or biased outputs if the training data contains biases.
How can you improve your BERT NER model?
How can you improve your BERT NER model?
How can you improve your BERT NER model?
How can you improve your BERT NER model?
How can you improve your BERT NER model?