0% found this document useful (0 votes)

167 views

Unit 2 (Second Order Methods)

Uploaded by

Rani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

167 views

Unit 2 (Second Order Methods)

Uploaded by

Rani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Unit 2 (Second Order Methods)

13 July 2024 12:57 PM

Unit-2 Second Order Methods Contact Hours:15 hours

Second order methods Introduction:
Second order methods for training
Regularization methods (dropout, drop connect, batch normalization)

Introduction to CNN:
Convolution
Pooling
Deep CNNs
Different deep CNN architectures - LeNet, AlexNet, VGG

Training a CNN:
Weights initialization
Batch normalization
Hyper parameter optimization
Understanding and visualizing CNNs

New Section 1 Page 1

============================================================================================================================

New Section 1 Page 2

============================================================================================================================

Second Order Methods for Training

Second-order methods for training deep learning models involve optimization algorithms that use not only the gradient (first derivative) of
the loss function but also its curvature information, typically represented by the Hessian matrix (second derivative). These methods can offer
advantages in certain scenarios, such as faster convergence and better handling of complex loss landscapes, compared to first-order methods
like Gradient Descent.
Key Concepts:
1. Hessian Matrix:
o The Hessian is a square matrix of second-order partial derivatives of the loss function with respect to the model parameters. It
provides information about the curvature of the loss surface.
o For a loss function L(θ) with parameters θ, the Hessian H is given by:

o The Hessian captures how the gradient changes in response to changes in the parameters.
2. Newton's Method:
o Basic Idea: Newton's method is a classic second-order optimization technique that iteratively updates the model parameters
by considering both the gradient and the Hessian. The update rule is:

o Where ∇L(θ) is the gradient and H−1 is the inverse of the Hessian matrix.
o Advantages: Newton's method can converge faster than first-order methods, especially near the minimum, because it takes
into account the curvature of the loss surface, leading to more accurate updates.
o Challenges: The main challenge is that computing the Hessian and its inverse is computationally expensive, especially for
deep networks with a large number of parameters.
3. Quasi-Newton Methods:
o BFGS and L-BFGS: Quasi-Newton methods approximate the Hessian instead of computing it directly. The BFGS (Broyden–
Fletcher–Goldfarb–Shanno) algorithm and its limited-memory variant L-BFGS are popular examples.
o BFGS algorithm is a popular optimization method used for minimizing nonlinear functions, especially when the Hessian
matrix (second-order derivative) is difficult or expensive to compute directly. It belongs to the family of quasi-Newton
methods, which approximate the Hessian matrix to perform optimization without needing to compute it exactly.
o Key Concepts of BFGS:
o Quasi-Newton Method: BFGS is classified as a quasi-Newton method, meaning that instead of computing the true
Hessian matrix, it builds an approximation of the inverse Hessian matrix iteratively. This makes it computationally more
efficient than methods that require the exact Hessian.
o Update Rule: In each iteration, BFGS updates the estimate of the inverse Hessian matrix based on the gradients of the
objective function at different points. The update rule is designed to ensure that the matrix satisfies the secant equation,
which helps approximate the curvature of the objective function.
o Inverse Hessian Approximation: Rather than storing and updating the Hessian matrix itself (which would require large
memory for high-dimensional problems), BFGS works with the inverse of the Hessian matrix. This inverse
approximation is updated iteratively using gradient information.
o Iterative Process:
o At each step, BFGS computes the gradient of the objective function.
o The current inverse Hessian approximation is updated based on the change in gradient and the change in the
parameters.
o The next set of parameters is computed by using a line search algorithm to minimize the objective function along a
certain direction (similar to Newton's method but with the approximated Hessian).
o BFGS Update Formula:
o If Hk is the approximation of the inverse Hessian matrix at iteration k, and sk=xk+1−xk and yk=∇f(xk+1)−∇f(xk), the BFGS
update formula for the inverse Hessian approximation Hk+1is:

New Section 1 Page 3

o Where:
o sk: the change in the parameter vector xxx.
o yk: the change in the gradient of the objective function.
o I: the identity matrix.
o This formula ensures that the new inverse Hessian approximation satisfies the secant condition while efficiently incorporating
new curvature information.
o Advantages of BFGS:
o Faster Convergence: It converges faster than first-order methods like gradient descent, especially in problems where
curvature (second-order information) is important.
o No Need for Explicit Hessian: Unlike Newton’s method, which requires the Hessian matrix, BFGS only requires
gradients, making it more computationally feasible for larger problems.
o Applicable to Nonlinear Problems: BFGS works well with nonlinear optimization problems where gradient-based
methods are necessary but the Hessian is hard to compute.
o Limitations:
o Memory Intensive: Though it is more efficient than Newton’s method, BFGS still requires storing an approximation of
the inverse Hessian matrix, which can be memory-intensive for high-dimensional problems.
o Less Effective for Large-Scale Problems: For very large problems (e.g., deep learning models with millions of
parameters), BFGS can become computationally prohibitive due to memory and time complexity, making first-order
methods like SGD more practical.
o Limited-memory BFGS (L-BFGS):
o L-BFGS is a variant of BFGS designed to handle large-scale optimization problems. Instead of storing the full inverse
Hessian matrix, L-BFGS only stores a few vectors that represent the most recent updates, reducing memory
requirements significantly.
o Use Cases:
o Machine Learning: BFGS is commonly used for optimization in training smaller-scale models where second-order
information is beneficial, such as logistic regression or support vector machines.
o Nonlinear Optimization: It is often used in optimization tasks outside machine learning, such as parameter estimation,
control systems, and economic modeling.

o Advantages: These methods offer a good trade-off between the computational efficiency of first-order methods and the faster
convergence of Newton's method. L-BFGS, in particular, is suitable for large-scale problems as it approximates the Hessian
using a limited amount of memory.
o Usage in Deep Learning: L-BFGS has been used in training smaller or medium-sized neural networks and for fine-tuning
tasks where the loss surface is well-behaved.
4. Hessian-Free Optimization:
o Basic Idea: Hessian-free optimization avoids explicitly computing the Hessian. Instead, it uses techniques like Conjugate
Gradient to implicitly solve the system involving the Hessian. This method is particularly useful when the Hessian is too large
to compute directly.
o Advantages: It allows for the benefits of second-order methods (like better convergence rates) without the full computational
burden of calculating the Hessian matrix.
o Applications: Hessian-free methods have been used in training deep networks, particularly in tasks where the loss landscape
is complex and requires more nuanced parameter updates.
5. Trust Region Methods:
o Basic Idea: These methods restrict the update step to a region around the current parameters where the second-order
approximation is considered valid. The step size is adjusted based on whether the approximation accurately predicts the
behaviour of the loss function.
o Advantages: Trust region methods can handle cases where the loss surface has sharp changes in curvature, leading to more
stable and reliable updates.
o Challenges: Like other second-order methods, these can be computationally expensive and are less commonly used in large-
scale deep learning tasks.
Practical Considerations:
• Scalability: Second-order methods are typically more computationally intensive than first-order methods, making them challenging to
apply to large-scale deep learning models with millions of parameters.

New Section 1 Page 4

apply to large-scale deep learning models with millions of parameters.
• Memory Requirements: The need to store and manipulate large matrices (like the Hessian) can lead to significant memory overhead,
especially in deep networks.
• Usage Scenarios: Second-order methods are more commonly used in fine-tuning, transfer learning, or scenarios where the number of
parameters is relatively small. They may also be used in combination with first-order methods, where the former is applied in the later
stages of training to refine the solution.

Comparison b/w first-order and second-order methods

The key difference between first-order methods and second-order methods for training neural networks lies in the type
of gradient information they use and how they update the model parameters. Here's a breakdown:
1. First-Order Methods
• Gradient Information Used: Only the first-order derivative (i.e., the gradient) of the loss function with respect to the
model parameters.
• Examples:
○ Gradient Descent (GD): The most basic first-order method, which updates the parameters by moving in the
direction of the negative gradient.
○ Stochastic Gradient Descent (SGD): A variant of gradient descent where updates are made using a small,
random subset (mini-batch) of the training data instead of the whole dataset.
○ Variants: Adam, RMSProp, Adagrad, etc., which use additional information like momentums and adaptive
learning rates.
• Update Rule: In gradient descent, the update rule for weights w is:

where η is the learning rate, and ∇L(wt) is the gradient of the loss function.
• Advantages:
○ Scalable: Works well with large datasets and high-dimensional problems.
○ Simple and computationally inexpensive: Each update step involves computing only the gradient, which is
faster for large-scale models.
• Disadvantages:
○ Slow Convergence: Can take many iterations to converge, especially near saddle points or flat regions of the
loss function.
○ Less precise: Gradient descent methods might not always take the most optimal path, especially when the loss
function's surface is complex.

2. Second-Order Methods
• Gradient Information Used: Both the first-order derivative (the gradient) and the second-order derivative (i.e., the
Hessian, which measures curvature) of the loss function.
• Examples:
○ Newton’s Method: A well-known second-order method that uses the inverse of the Hessian matrix to adjust the
step size.
○ Quasi-Newton Methods: Such as BFGS (Broyden-Fletcher-Goldfarb-Shanno), which approximate the Hessian
matrix rather than computing it directly to reduce computation cost.
• Update Rule: In Newton's method, the update rule for weights w is:

where H(wt) is the Hessian matrix of second-order partial derivatives.

• Advantages:
○ Faster Convergence: Second-order methods often converge faster because they use curvature information,
which helps them take more optimal steps, especially near optima.
○ Better Handling of Curvature: The Hessian helps adjust the step size dynamically, making it better at
navigating through regions of the loss surface with varying curvature (such as steep cliffs or flat valleys).
• Disadvantages:
○ Computationally Expensive: Calculating and inverting the Hessian matrix has a time complexity of O(n2) or
O(n3)(where n is the number of parameters), which can be prohibitive for large neural networks.
New Section 1 Page 5
○
O(n3)(where n is the number of parameters), which can be prohibitive for large neural networks.
○ Scalability Issues: Not practical for large models with millions or billions of parameters due to the
computational and memory burden.
Summary of Key Differences:
Feature First-Order Methods Second-Order Methods
Derivative Used First derivative (Gradient) First and second derivative (Hessian)
Examples Gradient Descent, SGD, Adam Newton’s Method, BFGS
Convergence Speed Slower, especially near saddle points Faster due to curvature information
Computation Cost Low per iteration High due to Hessian computation
Memory Usage Low High (requires storing Hessian)
Scalability Scales well with large datasets Does not scale well with large models

Regularization Methods (Dropout, Drop Connect, Batch Normalization)

Dropout
It is a regularization technique used to prevent overfitting in neural networks, including Convolutional Neural Networks
(CNNs). It works by randomly "dropping out" or deactivating a subset of neurons during each training iteration. These
deactivated neurons do not participate in the forward or backward pass for that iteration, effectively reducing the co-
dependency between neurons and forcing the network to learn more robust and generalized features.

How Dropout Works in CNNs:

1. Random Neuron Deactivation: During each forward pass in training, a random subset of neurons in a particular
layer is deactivated (set to zero). The probability of deactivation is controlled by a parameter called the dropout rate
(commonly set between 0.2 to 0.5). For example, if the dropout rate is 0.5, half of the neurons in that layer will be
turned off during that iteration.
2. Scaling During Inference: During training, the deactivated neurons effectively make the network smaller in each
iteration. However, during inference (testing or prediction), dropout is not applied. Instead, to maintain the
consistency of the output, the weights of the neurons are scaled down by the dropout rate to compensate for the
neurons that were deactivated during training. This ensures that the model behaves consistently during training and
inference.
3. Dropout in Fully Connected Layers vs. Convolutional Layers:
○ In fully connected layers, dropout is typically applied to individual neurons, randomly deactivating them.
○ In convolutional layers, dropout can also be applied, but it’s less common. Some variants, like Spatial
Dropout, drop entire feature maps rather than individual neurons to maintain the spatial correlation between
features in CNNs.

Why Dropout Helps:

• Reduces Overfitting: Dropout prevents the network from becoming overly reliant on specific neurons or paths, thus
reducing overfitting. Each neuron is forced to learn independently and collaborate with different subsets of neurons
during training.
• Ensemble-Like Effect: Since different subsets of neurons are used during each training iteration, dropout effectively
makes the network behave like an ensemble of multiple, smaller networks. This leads to improved generalization
when working with unseen data.

Example:
Let’s consider a fully connected layer in a CNN that has 100 neurons. If we apply dropout with a rate of 0.5, then during
each training iteration, approximately 50 of the neurons will be deactivated randomly. These neurons will not participate
in forward propagation or backpropagation, forcing the remaining neurons to adjust and learn the task without full network
reliance. During testing, all neurons are used, but their outputs are scaled by the factor (1 - dropout rate), ensuring
consistency.

DropConnect

New Section 1 Page 6

DropConnect is a regularization technique similar to Dropout, but instead of dropping neurons, it randomly drops
individual connections (weights) between neurons during training. This method prevents overfitting by randomly setting
weights to zero, thus deactivating the specific connections, but keeping the neurons active.
How DropConnect Works:
1. Random Weight Deactivation: In each training iteration, DropConnect randomly selects some of the weights in the
network and sets them to zero. This prevents the corresponding connections from contributing to the forward or
backward pass in that iteration. The rest of the weights remain active and are updated normally through
backpropagation.
2. Scaling During Inference: During the testing or inference phase, all the weights are used (none are dropped). Similar
to Dropout, the remaining active weights are scaled during inference to account for the random deactivation that
occurred during training.
3. Effect in Fully Connected Layers and CNNs:
○ In fully connected layers, DropConnect randomly drops individual connections between neurons, preventing
certain pathways from being used.
○ In CNNs, DropConnect can be applied to convolutional layers, randomly dropping weights associated with the
filters. This prevents specific filter connections from being used during certain iterations of training.

Difference Between DropConnect and Dropout:

• Dropout randomly deactivates entire neurons (along with all their associated connections), while DropConnect
randomly deactivates individual weights (connections) between neurons, but the neurons themselves remain active.
• DropConnect tends to provide finer-grained regularization compared to Dropout, as it works at the level of individual
weights rather than entire neurons. This can potentially lead to better control over network regularization.

Why DropConnect Helps:

1. Prevents Overfitting: Similar to Dropout, DropConnect reduces the model's tendency to overfit by introducing noise
during training, forcing the model to rely less on specific connections and more on generalized patterns.
2. Encourages Robust Learning: By randomly disabling certain connections during training, the network learns to be
robust, as it can't rely too heavily on specific pathways for correct predictions.
3. Improves Generalization: The network behaves as an ensemble of multiple sub-networks due to different sets of
connections being active at different times, thus improving its ability to generalize to unseen data.

Example:
Let’s assume we have a fully connected layer with 100 neurons and each neuron is connected to all the neurons in the next
layer. If DropConnect is applied with a rate of 0.5, approximately 50% of the connections (weights) will be randomly set
to zero during each iteration, leaving only half of the weights active. The neurons themselves, however, remain active, but
some of their inputs will be disabled due to the dropped connections.

Batch Normalization
Batch Normalization (BatchNorm) is a technique used to improve the training of deep neural networks by normalizing
the inputs to each layer. It helps address issues like internal covariate shift, accelerates training, and reduces the
dependency on careful initialization of weights. It also has a regularization effect, reducing the need for techniques like
Dropout in some cases.

How Batch Normalization Works:

1. Normalization: In each mini-batch during training, Batch Normalization normalizes the inputs (or activations) to
each layer. Specifically, it normalizes the inputs to have a mean of 0 and a variance of 1.

New Section 1 Page 7

each layer. Specifically, it normalizes the inputs to have a mean of 0 and a variance of 1.

Where:
○ xi is the input to the layer
○ μB is the mean of the mini-batch
○ σ2B is the variance of the mini-batch
○ ϵ is a small constant added to avoid division by zero
2. Learnable Parameters: After normalization, the network applies two learnable parameters: scale (γ) and shift (β) to
allow the model to adjust the normalized values back to a useful range for learning.

3. This allows the network to learn how much to scale and shift the normalized inputs, ensuring that the model can still
represent complex functions even after normalization.
4. Batch-Wise Calculation: The normalization is computed for each mini-batch independently during training, which
helps reduce internal covariate shift (the changes in the distribution of network activations as the parameters are
updated during training).
5. During Inference: During testing or inference, Batch Normalization uses the running mean and variance that were
computed during training, rather than the mean and variance of the current mini-batch, to ensure stable and consistent
behavior.

Why Batch Normalization Helps:

1. Reduces Internal Covariate Shift: Internal covariate shift refers to the change in the distribution of activations as
parameters are updated during training. By normalizing the inputs, BatchNorm reduces the shift, helping the network
converge faster.
2. Faster Convergence: Batch Normalization allows the use of higher learning rates, leading to faster training. By
normalizing inputs, the model becomes less sensitive to changes in weights, stabilizing the gradient flow and
speeding up the training process.
3. Reduced Dependence on Initialization: Proper weight initialization is less critical when using BatchNorm, as the
normalization step ensures that inputs to each layer have a stable distribution, even if initial weights are not ideal.
4. Regularization Effect: In some cases, Batch Normalization can reduce overfitting, even without Dropout, as the
random nature of mini-batch statistics introduces noise that can act as a regularizer.

Key Components of Batch Normalization:

• Mean and Variance: BatchNorm normalizes the input using the mean and variance of each mini-batch.
• Learnable Parameters: After normalization, it applies the learnable scale parameter γ and shift parameter β.
• Running Mean and Variance: During inference, the running mean and variance (computed during training) are used
instead of mini-batch statistics.

Example:
Let’s say you have a layer in a CNN with an input of size 64×64×128 (i.e., 128 feature maps of size 64×64). Applying
BatchNorm to this layer will involve:
1. Computing the mean and variance for each feature map over the mini-batch.
2. Normalizing each feature map by subtracting the mean and dividing by the standard deviation.
3. Applying the learnable scale and shift parameters to adjust the normalized values.

New Section 1 Page 8

New Section 1 Page 9

Autoencoders - Presentation
No ratings yet
Autoencoders - Presentation
18 pages
Skip Gram
100% (1)
Skip Gram
37 pages
Bridge Condition Rating Data Modeling Using Deep Learning Algorithm
100% (1)
Bridge Condition Rating Data Modeling Using Deep Learning Algorithm
15 pages
A Guide To Convolutional Neural Networks
100% (2)
A Guide To Convolutional Neural Networks
209 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
NN DL
No ratings yet
NN DL
1 page
Bidirectional RNN and RVNN
No ratings yet
Bidirectional RNN and RVNN
15 pages
Unit 5
No ratings yet
Unit 5
23 pages
Two Stage Job Title Identification-1
No ratings yet
Two Stage Job Title Identification-1
77 pages
Lecture 26-30 Unit 2
No ratings yet
Lecture 26-30 Unit 2
20 pages
Adaptive Networks: Presentation By: C. Vinoth Kumar SSN College of Engineering
No ratings yet
Adaptive Networks: Presentation By: C. Vinoth Kumar SSN College of Engineering
19 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
ADL Unit-3
No ratings yet
ADL Unit-3
21 pages
Tangent Prop and Manifold Tangent Classifier are b
No ratings yet
Tangent Prop and Manifold Tangent Classifier are b
4 pages
AD601 Deep Learning Unit-2 Notes
No ratings yet
AD601 Deep Learning Unit-2 Notes
14 pages
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
No ratings yet
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
60 pages
ML unit-2
100% (1)
ML unit-2
28 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
19 pages
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
No ratings yet
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
10 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
CS230 Midterm Solutions Fall 2022
No ratings yet
CS230 Midterm Solutions Fall 2022
20 pages
NNDL Unit-1
No ratings yet
NNDL Unit-1
28 pages
DLunit 4
No ratings yet
DLunit 4
16 pages
Dl Question Bank
No ratings yet
Dl Question Bank
23 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
No ratings yet
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
9 pages
Unit III
No ratings yet
Unit III
58 pages
Unit 2
No ratings yet
Unit 2
112 pages
DL Unit5
No ratings yet
DL Unit5
15 pages
Chapter4 Associative Memory
No ratings yet
Chapter4 Associative Memory
27 pages
DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
Dropout Vs Pruning
No ratings yet
Dropout Vs Pruning
2 pages
ML unit-1
No ratings yet
ML unit-1
15 pages
Unit Iv Web Retrieval and Web Crawling 9
No ratings yet
Unit Iv Web Retrieval and Web Crawling 9
1 page
Dl All Units Materials
No ratings yet
Dl All Units Materials
138 pages
Artificial Intelligence Module 5
No ratings yet
Artificial Intelligence Module 5
23 pages
Deep Learning 117 MCQ
No ratings yet
Deep Learning 117 MCQ
33 pages
UNIT_4_DL
No ratings yet
UNIT_4_DL
31 pages
Cognitive Computing (Course Code: 18CS3272) : CO1 - Session4 Session Topic: The Elements of A Cognitive System
No ratings yet
Cognitive Computing (Course Code: 18CS3272) : CO1 - Session4 Session Topic: The Elements of A Cognitive System
9 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
RBM, DBN, and DBM
No ratings yet
RBM, DBN, and DBM
79 pages
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
No ratings yet
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
23 pages
Deep Learning With Tensorflow
No ratings yet
Deep Learning With Tensorflow
15 pages
Soft Computing Decode
No ratings yet
Soft Computing Decode
142 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
17 pages
LU5: Deep Feedforward Networks: Hidden Units, Architecture Design
No ratings yet
LU5: Deep Feedforward Networks: Hidden Units, Architecture Design
15 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
ARTIFICIAl iNTELLIGENCE Unit III &iv
No ratings yet
ARTIFICIAl iNTELLIGENCE Unit III &iv
39 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
Deep Learning (MODULE-3) (1)
No ratings yet
Deep Learning (MODULE-3) (1)
85 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
Tensor Flow
No ratings yet
Tensor Flow
12 pages
AIML Unit 2 Notes
No ratings yet
AIML Unit 2 Notes
49 pages
Perceptons Neural Networks
No ratings yet
Perceptons Neural Networks
33 pages
Data Analytics lab manual
No ratings yet
Data Analytics lab manual
47 pages
NN UNIT-1 Complete Notes with 153 pages (1)
No ratings yet
NN UNIT-1 Complete Notes with 153 pages (1)
153 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
8.6 SecondOrder
No ratings yet
8.6 SecondOrder
14 pages
Deep Learning Via Hessian-Free Optimization: James Martens
No ratings yet
Deep Learning Via Hessian-Free Optimization: James Martens
8 pages
Sample Template - Advance Data Science Students
No ratings yet
Sample Template - Advance Data Science Students
3 pages
Robust CAPTCHAs Towards Malicious OCR
No ratings yet
Robust CAPTCHAs Towards Malicious OCR
13 pages
Pothole Severity Prediction Using Monocular Depth (3) (1) - 2
No ratings yet
Pothole Severity Prediction Using Monocular Depth (3) (1) - 2
15 pages
Swin-Unet: Unet-Like Pure Transformer For Medical Image Segmentation
No ratings yet
Swin-Unet: Unet-Like Pure Transformer For Medical Image Segmentation
14 pages
Navya
No ratings yet
Navya
15 pages
Đồ án 2
No ratings yet
Đồ án 2
26 pages
Solar RRL - 2022 - Kunze - Contactless Inline IV Measurement of Solar Cells Using an Empirical Model
No ratings yet
Solar RRL - 2022 - Kunze - Contactless Inline IV Measurement of Solar Cells Using an Empirical Model
10 pages
Hell 3
No ratings yet
Hell 3
20 pages
Document Traffic Signal
No ratings yet
Document Traffic Signal
32 pages
Emotion Recognition in Audio and Video Using Deep Neural Networks
No ratings yet
Emotion Recognition in Audio and Video Using Deep Neural Networks
9 pages
PlantMD - Final Year Project Report
No ratings yet
PlantMD - Final Year Project Report
74 pages
VGG net
No ratings yet
VGG net
6 pages
A Novel Lightweight Real Time Traffic Sign Detection Method Based On An Embedded Device and Yolov8
No ratings yet
A Novel Lightweight Real Time Traffic Sign Detection Method Based On An Embedded Device and Yolov8
10 pages
3D Generative Models A Survey
No ratings yet
3D Generative Models A Survey
21 pages
AI in Medical Imaging Informatics Current Challenges and Future Directions
No ratings yet
AI in Medical Imaging Informatics Current Challenges and Future Directions
21 pages
[FREE PDF sample] Zefs Guide to Deep Learning 1st Edition Roy Keyes ebooks
100% (3)
[FREE PDF sample] Zefs Guide to Deep Learning 1st Edition Roy Keyes ebooks
21 pages
Learn About Artificial Intelligence (AI)
No ratings yet
Learn About Artificial Intelligence (AI)
14 pages
MLT CNN Architectures
No ratings yet
MLT CNN Architectures
104 pages
Blood Cancer Detection Research Paper Using Cnn
No ratings yet
Blood Cancer Detection Research Paper Using Cnn
11 pages
Classification of Therapeutic Wastes Using CNN Algorithm With A 5 DOF Robotic Arm
No ratings yet
Classification of Therapeutic Wastes Using CNN Algorithm With A 5 DOF Robotic Arm
6 pages
Machine Learning and AI Beyond the Basics Raschka - Download the ebook today and own the complete content
100% (3)
Machine Learning and AI Beyond the Basics Raschka - Download the ebook today and own the complete content
69 pages
Covid-19 Face Mask Detection Using TensorFlow, Keras and OpenCV
No ratings yet
Covid-19 Face Mask Detection Using TensorFlow, Keras and OpenCV
5 pages
(IJCST-V12I3P8) :annie Florance V, Fathima G
No ratings yet
(IJCST-V12I3P8) :annie Florance V, Fathima G
6 pages
DSML Brochure
No ratings yet
DSML Brochure
38 pages
Proposal FYP
No ratings yet
Proposal FYP
9 pages
4a Convolutional Neural Networks
No ratings yet
4a Convolutional Neural Networks
56 pages
Sergio Silva License Plate Detection ECCV 2018 Paper1
No ratings yet
Sergio Silva License Plate Detection ECCV 2018 Paper1
18 pages
Set Conference 22mdt1034
No ratings yet
Set Conference 22mdt1034
17 pages

Uploaded by

Uploaded by

Unit 2 (Second Order Methods)

13 July 2024 12:57 PM

Unit-2 Second Order Methods Contact Hours:15 hours

New Section 1 Page 1

New Section 1 Page 2

Second Order Methods for Training

New Section 1 Page 3

New Section 1 Page 4

Comparison b/w first-order and second-order methods

where H(wt) is the Hessian matrix of second-order partial derivatives.

Regularization Methods (Dropout, Drop Connect, Batch Normalization)

How Dropout Works in CNNs:

Why Dropout Helps:

New Section 1 Page 6

Difference Between DropConnect and Dropout:

Why DropConnect Helps:

How Batch Normalization Works:

New Section 1 Page 7

Why Batch Normalization Helps:

Key Components of Batch Normalization:

New Section 1 Page 8

You might also like