0% found this document useful (0 votes)

20 views

DL 4

Uploaded by

225 Yash Khude

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

DL 4

Uploaded by

225 Yash Khude

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

1. How Learning differs from pure optimization?

Explain in detail
Learning in machine learning differs from pure optimization because of how objectives are
defined and optimized. In pure optimization, we aim to minimize a cost function directly as the
final goal. However, in machine learning, especially in training deep models, we often optimize
indirectly.

In most cases, we care about the performance measure on a test set—something that's typically
difficult to compute exactly. Instead of directly optimizing , we minimize a different cost
function, , on the training data, hoping it will lead to better values on the test data. Thus, while
pure optimization focuses on directly minimizing , machine learning involves indirect
optimization with as a proxy to improve test performance.

The objective function is usually defined as the average loss over the training set, represented
as:

J(\theta) = \mathbb{E}{(x,y) \sim \hat{p}{\text{data}}} L(f(x; \theta), y),

where:

it is the loss function for each example,

it is the model's predicted output for input ,

represents the empirical data distribution.

For supervised learning, is the target label. Ideally, we want to minimize the true objective
function , where the expectation is over the actual data-generating distribution :

J^*(\theta) = \mathbb{E}{(x,y) \sim p{\text{data}}} L(f(x; \theta), y).

In summary, the key difference is that in machine learning, is optimized on a finite training set
as a means to improve performance on unseen data, while pure optimization directly seeks to
minimize a function for its own sake.
2. What are the most prominent challenges involved in optimization for training deep
models.
Training deep learning models involves several optimization challenges due to the high
complexity and non-convex nature of deep networks. Here are some of the most prominent
challenges:

1. *Vanishing and Exploding Gradients*: In very deep networks, gradients can become
extremely small (vanish) or large (explode) during backpropagation, making it difficult for
weights to update effectively. This often leads to slow learning in early layers or unstable
training.

2. *Local Minima and Saddle Points*: The loss landscape in deep networks is highly non-
convex, with numerous local minima and saddle points. Although local minima are usually less
problematic in deep learning, saddle points can trap the optimization process, slowing down
convergence.

3. Ill-conditioned Problems: Some parameters or directions in the optimization landscape

may change at very different rates, causing gradients to oscillate or become erratic. This makes
it challenging to find a consistent path to minimize the loss function efficiently.

4. *High Dimensionality*: Deep networks involve millions of parameters, creating a vast search
space for optimization. Navigating this space requires substantial computational resources and
efficient algorithms.

5. *Overfitting and Generalization*: As models become complex, they can fit the training data
too closely, leading to overfitting and poor generalization. Regularization techniques, such as
dropout and weight decay, are needed to address this challenge.

6. Hyperparameter Sensitivity: The performance of deep networks is highly sensitive to

hyperparameters like learning rate, batch size, and momentum. Finding optimal values often
requires extensive tuning, which is computationally expensive.

7. *Data and Label Noise*: Deep models are sensitive to noisy or mislabeled data, which can
mislead the optimization process. This requires techniques such as robust loss functions and
data cleaning strategies.

Addressing these challenges is key to successful optimization in deep learning and often
requires a combination of advanced algorithms, careful tuning, and regularization strategies.
3. Write Stochastic Gradient Descent algorithm for Optimization
4. Write short note on Momentum
5. Write short note on Nesterov Momentum
6. what are the Parameter Initialization Strategies for Optimization

Parameter initialization strategies for optimization in deep learning focus on finding a good
starting point to ensure effective learning. Here are the main strategies:

1. Symmetry Breaking: To prevent hidden units from learning identical features, initialize each
unit with different values, typically with random initialization, to break symmetry.

2. Random Initialization: Initialize weights randomly, often using values from Gaussian or
uniform distributions. The choice of distribution matters less, but the scale (or size) of values
affects optimization and generalization.

3. Heuristic Initialization:
Xavier Initialization (Glorot & Bengio): Weights are sampled from a uniform distribution, with
the range set to ensure consistent activation variance across layers. This helps balance gradients
across layers.
Orthogonal Initialization: Saxe et al. propose initializing weights as random orthogonal matrices
with a specific scaling factor to preserve gradient and activation norms.

4. Scaling Factor Adjustment: Adjust the initial weight scale to control forward and backward
signal flow. This helps avoid exploding or vanishing gradients, especially in deep networks.

5. Sparse Initialization: Initialize weights sparsely, assigning only a few non-zero values per
unit to preserve input diversity without shrinking weights.

6. Hyperparameter Tuning: Treat initialization scales as hyperparameters to find optimal values

that lead to efficient training and good generalization.

7. Bias Initialization: Biases are often set to zero or a small positive value for ReLUs to avoid
saturation. For specific tasks, biases are set based on marginal statistics of the output, such as in
classification.

8. Transfer Learning: Initialize parameters using a pre-trained model on a similar or unrelated

task, which can help achieve faster convergence by leveraging prior knowledge.

7. List and Explain Algorithms with Adaptive Learning Rates

Here are the key algorithms with adaptive learning rates, explained based on the provided
information:

1. Delta-Bar-Delta Algorithm
Introduced by: Jacobs, 1988
Purpose: This early approach aims to adjust learning rates for each model parameter
individually during training.
How It Works: If the partial derivative of the loss with respect to a parameter keeps the same
sign, the learning rate for that parameter is increased. If it changes sign, the learning rate is
decreased. This can only be applied in full batch optimization.
Limitations: The delta-bar-delta rule is a heuristic approach and is limited to full batch
optimization, which restricts its use in more modern deep learning approaches.

2. AdaGrad (Adaptive Gradient Algorithm)

Introduced by: Duchi et al., 2011
Purpose: AdaGrad adjusts the learning rate of each parameter based on the historical gradient
information, aiming for faster convergence.
How It Works: The learning rate is scaled inversely with the square root of the cumulative sum
of all previous squared gradients for each parameter. This makes larger gradients decay faster,
allowing parameters with smaller gradients to keep a relatively higher learning rate.
Benefits: It is useful for convex optimization problems and gives greater weight to directions
with smaller gradients.
Limitations: In deep learning, where gradients can accumulate over time, the effective learning
rate may decay too quickly, resulting in premature stagnation.

3. RMSProp (Root Mean Square Propagation)

Introduced by: Hinton, 2012
Purpose: RMSProp modifies AdaGrad to improve performance in non-convex optimization
settings.
How It Works: Instead of a cumulative sum of squared gradients, RMSProp uses an
exponentially weighted moving average to keep track of past gradients. This decay factor
prevents the learning rate from becoming too small too quickly.
Benefits: RMSProp is effective in non-convex settings like deep learning, as it can "forget"
older gradients and adapt faster to recent gradients, helping it converge more rapidly.
Limitations: RMSProp introduces a decay hyperparameter, which controls the averaging scale
and needs to be set carefully.
4. Adam (Adaptive Moment Estimation)
Introduced by: Kingma and Ba, 2014
Purpose: Adam combines ideas from both RMSProp and momentum to create a more stable and
adaptive optimization method.
How It Works: Adam tracks two moving averages: the first moment (mean) of the gradients and
the second moment (uncentered variance) of the gradients. Both averages are corrected for bias,
which allows for a more accurate gradient estimate. Adam essentially integrates the benefits of
RMSProp and momentum into a single framework.
Benefits: Adam is robust and effective in practice, widely used in deep learning due to its
adaptability and reduced sensitivity to hyperparameters.
Limitations: It may still require slight tuning of the learning rate in some cases.
8. Explain the Process of Batch Normalization in optimizing Deep Neural Networks

Batch normalization is a technique that optimizes the training process of deep neural networks
by normalizing the inputs to each layer. It addresses the difficulties that arise when updating
very deep models, where changes in one layer affect all subsequent layers, making it hard to
select the right learning rate. Here’s a breakdown of the process and key benefits of batch
normalization:

Process of Batch Normalization

1. Normalization: For each layer in the network, batch normalization first calculates the mean
and standard deviation of the activations in a minibatch of training data. These statistics are
used to normalize each unit in the layer, transforming the activations to have a mean of zero and
a standard deviation of one:
H' = \frac{H - \mu}{\sigma}

2. Scaling and Shifting: To retain the expressiveness of the network, batch normalization
introduces two new learnable parameters, and , which scale and shift the normalized output:

\text{normalized output} = \gamma H' + \beta

3. Backpropagation Through Normalization: Batch normalization allows gradients to flow more

predictably through the network by backpropagating through the normalization steps. This
stabilizes the learning process by preventing drastic changes in layer statistics.

4. Running Averages for Inference: During training, the mean and standard deviation are
computed from each minibatch. However, for consistency during inference (when using a single
sample), batch normalization relies on running averages of these statistics calculated over the
training process.
Benefits of Batch Normalization

Reduces Internal Covariate Shift: Batch normalization reduces changes in the distribution of
layer inputs, stabilizing the learning process and making it easier to train very deep networks.

Improved Learning Dynamics: By normalizing the mean and variance, it minimizes higher-
order effects that make the gradient steps unpredictable, thereby allowing larger learning rates
and faster convergence.

Reduced Sensitivity to Initialization and Hyperparameters: Batch normalization makes the

network less sensitive to initial weights and learning rates, enabling more consistent
performance across different configurations.
Regularization Effect: By adding a small amount of noise (from minibatch statistics) during
training, batch normalization can act as a form of regularization, potentially reducing the need
for other regularizers like dropout.
Batch normalization has become a standard component in deep learning architectures because it
simplifies training and enhances model performance by maintaining stable input distributions
across layers.
9. What is Coordinate Descent practice? Elaborate the Strategy

Coordinate Descent is an optimization algorithm primarily used in machine learning and

statistical modeling. It involves optimizing an objective function by updating one variable
(coordinate) at a time while keeping others fixed. This iterative process continues until a
convergence criterion is met, which might be a minimum value of the objective function or a set
number of iterations.

### Strategy of Coordinate Descent

1. *Initialization*: Start with an initial guess or random values for the variables.
2. *Coordinate Update*: Pick one coordinate (variable) and optimize the objective function
with respect to this coordinate while keeping the others constant.
3. *Sequential Optimization*: Move sequentially through each coordinate, updating it and
keeping others fixed, and repeat until all coordinates have been optimized once (a full pass).
4. *Iteration*: After each pass, the algorithm checks if the convergence criterion is satisfied. If
not, it continues iterating, updating each coordinate until convergence.

### Why Coordinate Descent Works

- *Simplicity*: By breaking down a complex, high-dimensional optimization problem into
smaller, one-dimensional problems, it simplifies the computation.
- *Efficiency*: Often faster for high-dimensional problems where each dimension can be
optimized independently.
- *Convergence*: Works well for problems where the objective function is convex. In non-
convex scenarios, it might converge to a local minimum.

### Applications
Coordinate Descent is widely used in Lasso regression, sparse learning, and other scenarios
where only a subset of variables is relevant, making it an efficient choice in feature selection
and large-scale machine learning problems.
10. Write Short note on Supervised Pretraining

Supervised Pretraining is a strategy in deep learning where a model is trained on simpler tasks
or with simpler architectures before addressing a more complex target task. This approach can
improve the model's performance by gradually guiding it towards learning essential features
without overwhelming it with complexity early on.

Key Points of Supervised Pretraining

1. Simplified Training for Complex Models: Directly training deep models on challenging tasks
can be difficult. Supervised pretraining addresses this by breaking down the task into simpler,
manageable parts, or by initially training a simpler model, then incrementally adding
complexity.

2. Greedy Layer-Wise Pretraining: Pioneered by Bengio et al. (2007), this approach involves
training each hidden layer in isolation within a shallow supervised model. Each new layer is
pretrained using the output of the previously trained layer. After pretraining, the full model is
fine-tuned jointly. This helps stabilize learning by gradually guiding deeper layers.

3. Transfer Learning: Pretrained models, particularly deep convolutional networks, are often
used as starting points for new tasks. For example, Yosinski et al. (2014) pretrained a model on
ImageNet and then transferred the initial layers to a new model, adapting it to related tasks with
less data.

4. FitNets Approach: Developed by Romero et al. (2015), FitNets use a "teacher" model to help
train a deeper "student" model. The student model learns not only the output but also the
intermediate layer representations of the teacher model, which serves as a "hint" to guide its
training. This approach helps train deeper models with fewer parameters, enhancing
generalization.
Supervised pretraining is valuable both for optimization, making training more stable, and for
generalization, as it helps the model learn useful representations that transfer well to complex
tasks. This technique is widely used in deep learning to improve performance and
computational efficiency.

11. What is the best strategy to Improve Optimization

Improving optimization in deep learning involves refining the model training process to ensure
faster convergence, better accuracy, and reduced resource usage. Here are some effective
strategies:

1. Optimizers and Learning Rates:

- Use adaptive optimizers like *Adam* or *RMSprop* which adjust learning rates for
different parameters.
- Employ *learning rate schedules* (decay, warm-up) or use *cyclical learning rates* to help
the model converge faster.
- Consider *momentum* (SGD with momentum) to push past local minima.

2. *Regularization Techniques*:
- Use *dropout* to prevent overfitting by randomly deactivating neurons during training.
- Implement *batch normalization*, which normalizes activations and can improve
convergence speed.
- Apply *L2 regularization* to penalize large weights and reduce overfitting.

3. Data Augmentation and Quality:

- Augment training data (rotations, scaling, cropping, etc.) to improve generalization.
- Ensure data is clean and representative to improve model accuracy and avoid overfitting.
4. *Batch Size Tuning*:
- Experiment with *mini-batch sizes*—larger batches can make training more stable, while
smaller batches introduce noise which can help escape local minima.

5. Model Architecture Optimization:

- Optimize the number of layers and neurons per layer for your dataset.
- Prune unnecessary parameters or use *neural architecture search* for efficient architecture
selection.

6. *Gradient Clipping*:
- Use gradient clipping in cases where gradients may explode (common in RNNs) to maintain
training stability.

7. *Early Stopping*:
- Monitor the validation loss and stop training once it begins to increase, preventing
overfitting.

8. *Transfer Learning*:
- Use pre-trained models on large datasets and fine-tune them on your specific task. This often
yields faster and better results than training from scratch.

9. *Hyperparameter Tuning*:
- Use grid or random search, or leverage more advanced methods like *Bayesian
optimization* to find optimal hyperparameters, including learning rate, batch size, and network
depth.

10. Mixed Precision Training:

- Use mixed precision (FP16) training to speed up computation on modern GPUs without
significantly compromising model accuracy.
Applying these strategies should help achieve a well-optimized deep learning model with better
performance and faster convergence.

12. Write Short note on Continuation Methods and Curriculum Learning

Continuation Methods* in deep learning refer to gradually solving an optimization

problem by first solving an easier version, then progressively making it harder. This
approach helps avoid poor local minima by allowing the model to settle into a good
solution with simpler tasks before handling more complex ones. Often used in non-convex
optimization, continuation methods smooth out the objective function, gradually reducing
the smoothing to converge on the desired solution.

Curriculum Learning is inspired by human learning, where models are trained on

simpler tasks before progressively tackling harder ones. In deep learning, this involves
feeding the model easy examples initially, then gradually increasing the difficulty. This
approach helps models converge faster and find better generalization by allowing them to
develop strong foundational representations. Curriculum learning is particularly effective
in tasks requiring hierarchical feature learning, like natural language processing or
complex visual tasks.

CT & PT Calculations
No ratings yet
CT & PT Calculations
10 pages
TK110 User Manual
60% (5)
TK110 User Manual
8 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Deep Learning
No ratings yet
Deep Learning
4 pages
UNIT3
No ratings yet
UNIT3
17 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Module3_notes
No ratings yet
Module3_notes
18 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
ML Concepts
No ratings yet
ML Concepts
3 pages
Deep Learning - Unit-III Two marks
No ratings yet
Deep Learning - Unit-III Two marks
3 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Neural Net Research
No ratings yet
Neural Net Research
10 pages
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
No ratings yet
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
44 pages
DL-12
No ratings yet
DL-12
55 pages
5th Unit DL Final Class Notes (1)
No ratings yet
5th Unit DL Final Class Notes (1)
77 pages
UNIT 5
No ratings yet
UNIT 5
36 pages
dl 3unit last topic meta algoritham
No ratings yet
dl 3unit last topic meta algoritham
32 pages
DL mod 2
No ratings yet
DL mod 2
4 pages
Unit-3
No ratings yet
Unit-3
47 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Optimization For Deep Learning Theory and Algorithms
No ratings yet
Optimization For Deep Learning Theory and Algorithms
60 pages
DL UNIT 5 NOTES 2
No ratings yet
DL UNIT 5 NOTES 2
23 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
Optimizers
No ratings yet
Optimizers
4 pages
(1)_IJAIML23022024P0A3_(p.1-8)
No ratings yet
(1)_IJAIML23022024P0A3_(p.1-8)
8 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Lecture-4-1
No ratings yet
Lecture-4-1
60 pages
Deep Learning 3rd Module
No ratings yet
Deep Learning 3rd Module
2 pages
IMP Deep Learning
No ratings yet
IMP Deep Learning
9 pages
ADL Unit-3
No ratings yet
ADL Unit-3
21 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Deep Learning Module-03
No ratings yet
Deep Learning Module-03
20 pages
LR, GR, FL
No ratings yet
LR, GR, FL
2 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Machine Learning
No ratings yet
Machine Learning
40 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
DL 3
No ratings yet
DL 3
72 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
Lecture 04 - Optimization - 4p
No ratings yet
Lecture 04 - Optimization - 4p
11 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
MODULE 3
No ratings yet
MODULE 3
7 pages
DL (2)
No ratings yet
DL (2)
18 pages
Unit 4 A
No ratings yet
Unit 4 A
16 pages
MODULE 2 Deep Learning
No ratings yet
MODULE 2 Deep Learning
26 pages
Lec 2
No ratings yet
Lec 2
5 pages
Unit 3
No ratings yet
Unit 3
110 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
17-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-21!08!2024
No ratings yet
17-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-21!08!2024
3 pages
Unit Online 1.4
No ratings yet
Unit Online 1.4
132 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Industrie-Und Dichtungstechnik GMBH: Plant Kupferring Plant Essen Plant Kirchheim
No ratings yet
Industrie-Und Dichtungstechnik GMBH: Plant Kupferring Plant Essen Plant Kirchheim
83 pages
Kualitas Pelayanan Publik Bidang Administrasi Kependudukan Di Kecamatan Wonocolo Surabaya
No ratings yet
Kualitas Pelayanan Publik Bidang Administrasi Kependudukan Di Kecamatan Wonocolo Surabaya
6 pages
OSPF (Open Shortest Path First)
No ratings yet
OSPF (Open Shortest Path First)
5 pages
LAANC Concept of Operations
No ratings yet
LAANC Concept of Operations
43 pages
Thevenin Theorem
No ratings yet
Thevenin Theorem
5 pages
DX6-Manual Kurt
No ratings yet
DX6-Manual Kurt
21 pages
RAMOS_Lesson-Exemplar-PT (1)
No ratings yet
RAMOS_Lesson-Exemplar-PT (1)
8 pages
LIN Basics and Implementation of The MCC LIN Stack Library On 8-Bit PIC Microcontrollers
No ratings yet
LIN Basics and Implementation of The MCC LIN Stack Library On 8-Bit PIC Microcontrollers
28 pages
F
No ratings yet
F
38 pages
Raw Material Cutting Request Form: Raw Size SR # Date Job No Material Plate No Length Width Thickness Heat Number
No ratings yet
Raw Material Cutting Request Form: Raw Size SR # Date Job No Material Plate No Length Width Thickness Heat Number
1 page
En GIA SA 01 21 Disguised Transportable Gas Cylinders
No ratings yet
En GIA SA 01 21 Disguised Transportable Gas Cylinders
14 pages
Kedozuvogizagi
No ratings yet
Kedozuvogizagi
3 pages
Raam Group Data - SVNIT
No ratings yet
Raam Group Data - SVNIT
22 pages
XP4i GB - Chapter 01 Safety Instructions V1.2
No ratings yet
XP4i GB - Chapter 01 Safety Instructions V1.2
4 pages
Electricity and Circuits
No ratings yet
Electricity and Circuits
6 pages
ElecticDrivce4360905
No ratings yet
ElecticDrivce4360905
9 pages
1 - Record Fiori Test Case
No ratings yet
1 - Record Fiori Test Case
14 pages
Spa Benchmark Report Full Year 2020
No ratings yet
Spa Benchmark Report Full Year 2020
5 pages
Viraj Catalouge-2
No ratings yet
Viraj Catalouge-2
21 pages
Intelligent Systems: Research Team
No ratings yet
Intelligent Systems: Research Team
26 pages
Ict Syllabus For Primary School Students
No ratings yet
Ict Syllabus For Primary School Students
12 pages
NK Arjun
No ratings yet
NK Arjun
3 pages
Objectives of The Study Topic #3
100% (1)
Objectives of The Study Topic #3
2 pages
User Guide: Thinkpad T450
No ratings yet
User Guide: Thinkpad T450
185 pages
Synergy HD4
No ratings yet
Synergy HD4
74 pages
NR-Number Series
No ratings yet
NR-Number Series
8 pages
Bernards
No ratings yet
Bernards
19 pages
SPLIT TEE WELDING Procedure
100% (2)
SPLIT TEE WELDING Procedure
42 pages

Uploaded by

Uploaded by

1. How Learning differs from pure optimization?

J(\theta) = \mathbb{E}{(x,y) \sim \hat{p}{\text{data}}} L(f(x; \theta), y),

it is the loss function for each example,

it is the model's predicted output for input ,

represents the empirical data distribution.

J^*(\theta) = \mathbb{E}{(x,y) \sim p{\text{data}}} L(f(x; \theta), y).

3. *Ill-conditioned Problems*: Some parameters or directions in the optimization landscape

6. *Hyperparameter Sensitivity*: The performance of deep networks is highly sensitive to

6. Hyperparameter Tuning: Treat initialization scales as hyperparameters to find optimal values

8. Transfer Learning: Initialize parameters using a pre-trained model on a similar or unrelated

7. List and Explain Algorithms with Adaptive Learning Rates

2. AdaGrad (Adaptive Gradient Algorithm)

3. RMSProp (Root Mean Square Propagation)

Process of Batch Normalization

\text{normalized output} = \gamma H' + \beta

3. Backpropagation Through Normalization: Batch normalization allows gradients to flow more

Reduced Sensitivity to Initialization and Hyperparameters: Batch normalization makes the

Coordinate Descent is an optimization algorithm primarily used in machine learning and

### Strategy of Coordinate Descent

### Why Coordinate Descent Works

Key Points of Supervised Pretraining

11. What is the best strategy to Improve Optimization

1. *Optimizers and Learning Rates*:

3. *Data Augmentation and Quality*:

5. *Model Architecture Optimization*:

10. *Mixed Precision Training*:

12. Write Short note on Continuation Methods and Curriculum Learning

Continuation Methods* in deep learning refer to gradually solving an optimization

*Curriculum Learning* is inspired by human learning, where models are trained on

You might also like

3. Ill-conditioned Problems: Some parameters or directions in the optimization landscape

6. Hyperparameter Sensitivity: The performance of deep networks is highly sensitive to

1. Optimizers and Learning Rates:

3. Data Augmentation and Quality:

5. Model Architecture Optimization:

10. Mixed Precision Training:

Curriculum Learning is inspired by human learning, where models are trained on