Unit 2 (Second Order Methods)
Unit 2 (Second Order Methods)
Introduction to CNN:
Convolution
Pooling
Deep CNNs
Different deep CNN architectures - LeNet, AlexNet, VGG
Training a CNN:
Weights initialization
Batch normalization
Hyper parameter optimization
Understanding and visualizing CNNs
o The Hessian captures how the gradient changes in response to changes in the parameters.
2. Newton's Method:
o Basic Idea: Newton's method is a classic second-order optimization technique that iteratively updates the model parameters
by considering both the gradient and the Hessian. The update rule is:
o Where ∇L(θ) is the gradient and H−1 is the inverse of the Hessian matrix.
o Advantages: Newton's method can converge faster than first-order methods, especially near the minimum, because it takes
into account the curvature of the loss surface, leading to more accurate updates.
o Challenges: The main challenge is that computing the Hessian and its inverse is computationally expensive, especially for
deep networks with a large number of parameters.
3. Quasi-Newton Methods:
o BFGS and L-BFGS: Quasi-Newton methods approximate the Hessian instead of computing it directly. The BFGS (Broyden–
Fletcher–Goldfarb–Shanno) algorithm and its limited-memory variant L-BFGS are popular examples.
o BFGS algorithm is a popular optimization method used for minimizing nonlinear functions, especially when the Hessian
matrix (second-order derivative) is difficult or expensive to compute directly. It belongs to the family of quasi-Newton
methods, which approximate the Hessian matrix to perform optimization without needing to compute it exactly.
o Key Concepts of BFGS:
o Quasi-Newton Method: BFGS is classified as a quasi-Newton method, meaning that instead of computing the true
Hessian matrix, it builds an approximation of the inverse Hessian matrix iteratively. This makes it computationally more
efficient than methods that require the exact Hessian.
o Update Rule: In each iteration, BFGS updates the estimate of the inverse Hessian matrix based on the gradients of the
objective function at different points. The update rule is designed to ensure that the matrix satisfies the secant equation,
which helps approximate the curvature of the objective function.
o Inverse Hessian Approximation: Rather than storing and updating the Hessian matrix itself (which would require large
memory for high-dimensional problems), BFGS works with the inverse of the Hessian matrix. This inverse
approximation is updated iteratively using gradient information.
o Iterative Process:
o At each step, BFGS computes the gradient of the objective function.
o The current inverse Hessian approximation is updated based on the change in gradient and the change in the
parameters.
o The next set of parameters is computed by using a line search algorithm to minimize the objective function along a
certain direction (similar to Newton's method but with the approximated Hessian).
o BFGS Update Formula:
o If Hk is the approximation of the inverse Hessian matrix at iteration k, and sk=xk+1−xk and yk=∇f(xk+1)−∇f(xk), the BFGS
update formula for the inverse Hessian approximation Hk+1is:
o Advantages: These methods offer a good trade-off between the computational efficiency of first-order methods and the faster
convergence of Newton's method. L-BFGS, in particular, is suitable for large-scale problems as it approximates the Hessian
using a limited amount of memory.
o Usage in Deep Learning: L-BFGS has been used in training smaller or medium-sized neural networks and for fine-tuning
tasks where the loss surface is well-behaved.
4. Hessian-Free Optimization:
o Basic Idea: Hessian-free optimization avoids explicitly computing the Hessian. Instead, it uses techniques like Conjugate
Gradient to implicitly solve the system involving the Hessian. This method is particularly useful when the Hessian is too large
to compute directly.
o Advantages: It allows for the benefits of second-order methods (like better convergence rates) without the full computational
burden of calculating the Hessian matrix.
o Applications: Hessian-free methods have been used in training deep networks, particularly in tasks where the loss landscape
is complex and requires more nuanced parameter updates.
5. Trust Region Methods:
o Basic Idea: These methods restrict the update step to a region around the current parameters where the second-order
approximation is considered valid. The step size is adjusted based on whether the approximation accurately predicts the
behaviour of the loss function.
o Advantages: Trust region methods can handle cases where the loss surface has sharp changes in curvature, leading to more
stable and reliable updates.
o Challenges: Like other second-order methods, these can be computationally expensive and are less commonly used in large-
scale deep learning tasks.
Practical Considerations:
• Scalability: Second-order methods are typically more computationally intensive than first-order methods, making them challenging to
apply to large-scale deep learning models with millions of parameters.
where η is the learning rate, and ∇L(wt) is the gradient of the loss function.
• Advantages:
○ Scalable: Works well with large datasets and high-dimensional problems.
○ Simple and computationally inexpensive: Each update step involves computing only the gradient, which is
faster for large-scale models.
• Disadvantages:
○ Slow Convergence: Can take many iterations to converge, especially near saddle points or flat regions of the
loss function.
○ Less precise: Gradient descent methods might not always take the most optimal path, especially when the loss
function's surface is complex.
2. Second-Order Methods
• Gradient Information Used: Both the first-order derivative (the gradient) and the second-order derivative (i.e., the
Hessian, which measures curvature) of the loss function.
• Examples:
○ Newton’s Method: A well-known second-order method that uses the inverse of the Hessian matrix to adjust the
step size.
○ Quasi-Newton Methods: Such as BFGS (Broyden-Fletcher-Goldfarb-Shanno), which approximate the Hessian
matrix rather than computing it directly to reduce computation cost.
• Update Rule: In Newton's method, the update rule for weights w is:
Dropout
It is a regularization technique used to prevent overfitting in neural networks, including Convolutional Neural Networks
(CNNs). It works by randomly "dropping out" or deactivating a subset of neurons during each training iteration. These
deactivated neurons do not participate in the forward or backward pass for that iteration, effectively reducing the co-
dependency between neurons and forcing the network to learn more robust and generalized features.
Example:
Let’s consider a fully connected layer in a CNN that has 100 neurons. If we apply dropout with a rate of 0.5, then during
each training iteration, approximately 50 of the neurons will be deactivated randomly. These neurons will not participate
in forward propagation or backpropagation, forcing the remaining neurons to adjust and learn the task without full network
reliance. During testing, all neurons are used, but their outputs are scaled by the factor (1 - dropout rate), ensuring
consistency.
DropConnect
Example:
Let’s assume we have a fully connected layer with 100 neurons and each neuron is connected to all the neurons in the next
layer. If DropConnect is applied with a rate of 0.5, approximately 50% of the connections (weights) will be randomly set
to zero during each iteration, leaving only half of the weights active. The neurons themselves, however, remain active, but
some of their inputs will be disabled due to the dropped connections.
Batch Normalization
Batch Normalization (BatchNorm) is a technique used to improve the training of deep neural networks by normalizing
the inputs to each layer. It helps address issues like internal covariate shift, accelerates training, and reduces the
dependency on careful initialization of weights. It also has a regularization effect, reducing the need for techniques like
Dropout in some cases.
Where:
○ xi is the input to the layer
○ μB is the mean of the mini-batch
○ σ2B is the variance of the mini-batch
○ ϵ is a small constant added to avoid division by zero
2. Learnable Parameters: After normalization, the network applies two learnable parameters: scale (γ) and shift (β) to
allow the model to adjust the normalized values back to a useful range for learning.
3. This allows the network to learn how much to scale and shift the normalized inputs, ensuring that the model can still
represent complex functions even after normalization.
4. Batch-Wise Calculation: The normalization is computed for each mini-batch independently during training, which
helps reduce internal covariate shift (the changes in the distribution of network activations as the parameters are
updated during training).
5. During Inference: During testing or inference, Batch Normalization uses the running mean and variance that were
computed during training, rather than the mean and variance of the current mini-batch, to ensure stable and consistent
behavior.
Example:
Let’s say you have a layer in a CNN with an input of size 64×64×128 (i.e., 128 feature maps of size 64×64). Applying
BatchNorm to this layer will involve:
1. Computing the mean and variance for each feature map over the mini-batch.
2. Normalizing each feature map by subtracting the mean and dividing by the standard deviation.
3. Applying the learnable scale and shift parameters to adjust the normalized values.