DL 4
DL 4
Explain in detail
Learning in machine learning differs from pure optimization because of how objectives are
defined and optimized. In pure optimization, we aim to minimize a cost function directly as the
final goal. However, in machine learning, especially in training deep models, we often optimize
indirectly.
In most cases, we care about the performance measure on a test set—something that's typically
difficult to compute exactly. Instead of directly optimizing , we minimize a different cost
function, , on the training data, hoping it will lead to better values on the test data. Thus, while
pure optimization focuses on directly minimizing , machine learning involves indirect
optimization with as a proxy to improve test performance.
The objective function is usually defined as the average loss over the training set, represented
as:
where:
For supervised learning, is the target label. Ideally, we want to minimize the true objective
function , where the expectation is over the actual data-generating distribution :
In summary, the key difference is that in machine learning, is optimized on a finite training set
as a means to improve performance on unseen data, while pure optimization directly seeks to
minimize a function for its own sake.
2. What are the most prominent challenges involved in optimization for training deep
models.
Training deep learning models involves several optimization challenges due to the high
complexity and non-convex nature of deep networks. Here are some of the most prominent
challenges:
1. *Vanishing and Exploding Gradients*: In very deep networks, gradients can become
extremely small (vanish) or large (explode) during backpropagation, making it difficult for
weights to update effectively. This often leads to slow learning in early layers or unstable
training.
2. *Local Minima and Saddle Points*: The loss landscape in deep networks is highly non-
convex, with numerous local minima and saddle points. Although local minima are usually less
problematic in deep learning, saddle points can trap the optimization process, slowing down
convergence.
4. *High Dimensionality*: Deep networks involve millions of parameters, creating a vast search
space for optimization. Navigating this space requires substantial computational resources and
efficient algorithms.
5. *Overfitting and Generalization*: As models become complex, they can fit the training data
too closely, leading to overfitting and poor generalization. Regularization techniques, such as
dropout and weight decay, are needed to address this challenge.
7. *Data and Label Noise*: Deep models are sensitive to noisy or mislabeled data, which can
mislead the optimization process. This requires techniques such as robust loss functions and
data cleaning strategies.
Addressing these challenges is key to successful optimization in deep learning and often
requires a combination of advanced algorithms, careful tuning, and regularization strategies.
3. Write Stochastic Gradient Descent algorithm for Optimization
4. Write short note on Momentum
5. Write short note on Nesterov Momentum
6. what are the Parameter Initialization Strategies for Optimization
Parameter initialization strategies for optimization in deep learning focus on finding a good
starting point to ensure effective learning. Here are the main strategies:
1. Symmetry Breaking: To prevent hidden units from learning identical features, initialize each
unit with different values, typically with random initialization, to break symmetry.
2. Random Initialization: Initialize weights randomly, often using values from Gaussian or
uniform distributions. The choice of distribution matters less, but the scale (or size) of values
affects optimization and generalization.
3. Heuristic Initialization:
Xavier Initialization (Glorot & Bengio): Weights are sampled from a uniform distribution, with
the range set to ensure consistent activation variance across layers. This helps balance gradients
across layers.
Orthogonal Initialization: Saxe et al. propose initializing weights as random orthogonal matrices
with a specific scaling factor to preserve gradient and activation norms.
4. Scaling Factor Adjustment: Adjust the initial weight scale to control forward and backward
signal flow. This helps avoid exploding or vanishing gradients, especially in deep networks.
5. Sparse Initialization: Initialize weights sparsely, assigning only a few non-zero values per
unit to preserve input diversity without shrinking weights.
7. Bias Initialization: Biases are often set to zero or a small positive value for ReLUs to avoid
saturation. For specific tasks, biases are set based on marginal statistics of the output, such as in
classification.
1. Delta-Bar-Delta Algorithm
Introduced by: Jacobs, 1988
Purpose: This early approach aims to adjust learning rates for each model parameter
individually during training.
How It Works: If the partial derivative of the loss with respect to a parameter keeps the same
sign, the learning rate for that parameter is increased. If it changes sign, the learning rate is
decreased. This can only be applied in full batch optimization.
Limitations: The delta-bar-delta rule is a heuristic approach and is limited to full batch
optimization, which restricts its use in more modern deep learning approaches.
Batch normalization is a technique that optimizes the training process of deep neural networks
by normalizing the inputs to each layer. It addresses the difficulties that arise when updating
very deep models, where changes in one layer affect all subsequent layers, making it hard to
select the right learning rate. Here’s a breakdown of the process and key benefits of batch
normalization:
1. Normalization: For each layer in the network, batch normalization first calculates the mean
and standard deviation of the activations in a minibatch of training data. These statistics are
used to normalize each unit in the layer, transforming the activations to have a mean of zero and
a standard deviation of one:
H' = \frac{H - \mu}{\sigma}
2. Scaling and Shifting: To retain the expressiveness of the network, batch normalization
introduces two new learnable parameters, and , which scale and shift the normalized output:
4. Running Averages for Inference: During training, the mean and standard deviation are
computed from each minibatch. However, for consistency during inference (when using a single
sample), batch normalization relies on running averages of these statistics calculated over the
training process.
Benefits of Batch Normalization
Reduces Internal Covariate Shift: Batch normalization reduces changes in the distribution of
layer inputs, stabilizing the learning process and making it easier to train very deep networks.
Improved Learning Dynamics: By normalizing the mean and variance, it minimizes higher-
order effects that make the gradient steps unpredictable, thereby allowing larger learning rates
and faster convergence.
### Applications
Coordinate Descent is widely used in Lasso regression, sparse learning, and other scenarios
where only a subset of variables is relevant, making it an efficient choice in feature selection
and large-scale machine learning problems.
10. Write Short note on Supervised Pretraining
Supervised Pretraining is a strategy in deep learning where a model is trained on simpler tasks
or with simpler architectures before addressing a more complex target task. This approach can
improve the model's performance by gradually guiding it towards learning essential features
without overwhelming it with complexity early on.
1. Simplified Training for Complex Models: Directly training deep models on challenging tasks
can be difficult. Supervised pretraining addresses this by breaking down the task into simpler,
manageable parts, or by initially training a simpler model, then incrementally adding
complexity.
2. Greedy Layer-Wise Pretraining: Pioneered by Bengio et al. (2007), this approach involves
training each hidden layer in isolation within a shallow supervised model. Each new layer is
pretrained using the output of the previously trained layer. After pretraining, the full model is
fine-tuned jointly. This helps stabilize learning by gradually guiding deeper layers.
3. Transfer Learning: Pretrained models, particularly deep convolutional networks, are often
used as starting points for new tasks. For example, Yosinski et al. (2014) pretrained a model on
ImageNet and then transferred the initial layers to a new model, adapting it to related tasks with
less data.
4. FitNets Approach: Developed by Romero et al. (2015), FitNets use a "teacher" model to help
train a deeper "student" model. The student model learns not only the output but also the
intermediate layer representations of the teacher model, which serves as a "hint" to guide its
training. This approach helps train deeper models with fewer parameters, enhancing
generalization.
Supervised pretraining is valuable both for optimization, making training more stable, and for
generalization, as it helps the model learn useful representations that transfer well to complex
tasks. This technique is widely used in deep learning to improve performance and
computational efficiency.
2. *Regularization Techniques*:
- Use *dropout* to prevent overfitting by randomly deactivating neurons during training.
- Implement *batch normalization*, which normalizes activations and can improve
convergence speed.
- Apply *L2 regularization* to penalize large weights and reduce overfitting.
6. *Gradient Clipping*:
- Use gradient clipping in cases where gradients may explode (common in RNNs) to maintain
training stability.
7. *Early Stopping*:
- Monitor the validation loss and stop training once it begins to increase, preventing
overfitting.
8. *Transfer Learning*:
- Use pre-trained models on large datasets and fine-tune them on your specific task. This often
yields faster and better results than training from scratch.
9. *Hyperparameter Tuning*:
- Use grid or random search, or leverage more advanced methods like *Bayesian
optimization* to find optimal hyperparameters, including learning rate, batch size, and network
depth.