HyperparameterTuninginMachineLearningAComprehensiveReview
HyperparameterTuninginMachineLearningAComprehensiveReview
net/publication/381255284
CITATIONS READS
10 2,631
12 authors, including:
All content following this page was uploaded by Olamide Isaac Durodola on 08 June 2024.
USA
g Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee,
USA.
h Creighton University, Nebraska, USA.
I Agricultural and Environmental Engineering Department, Obafemi Awolowo University, Nigeria.
Authors’ contributions
This work was carried out in collaboration among all authors. All authors read and approved the final
manuscript.
Article Information
DOI: https://doi.org/10.9734/jerr/2024/v26i61188
Received: 29/03/2024
Review Article Accepted: 03/06/2024
Published: 07/06/2024
_____________________________________________________________________________________________________
Cite as: A Ilemobayo, Justus, Olamide Durodola, Oreoluwa Alade, Opeyemi J Awotunde, Adewumi T Olanrewaju, Olumide
Falana, Adedolapo Ogungbire, Abraham Osinuga, Dabira Ogunbiyi, Ark Ifeanyi, Ikenna E Odezuligbo, and Oluwagbotemi E
Edu. 2024. “Hyperparameter Tuning in Machine Learning: A Comprehensive Review”. Journal of Engineering Research and
Reports 26 (6):388-95. https://doi.org/10.9734/jerr/2024/v26i61188.
Ilemobayo et al.; J. Eng. Res. Rep., vol. 26, no. 6, pp. 388-395, 2024; Article no.JERR.118312
ABSTRACT
Hyperparameter tuning is essential for optimizing the performance and generalization of machine
learning (ML) models. This review explores the critical role of hyperparameter tuning in ML,
detailing its importance, applications, and various optimization techniques. Key factors influencing
ML performance, such as data quality, algorithm selection, and model complexity, are discussed,
along with the impact of hyperparameters like learning rate and batch size on model training.
Various tuning methods are examined, including grid search, random search, Bayesian
optimization, and meta-learning. Special focus is given to the learning rate in deep learning,
highlighting strategies for its optimization. Trade-offs in hyperparameter tuning, such as balancing
computational cost and performance gain, are also addressed. Concluding with challenges and
future directions, this review provides a comprehensive resource for improving the effectiveness
and efficiency of ML models.
Keywords: Hyperparameter tuning; learning rate; batch size; grid search; random search; Bayesian
optimization; meta-learning; neural networks.
389
Ilemobayo et al.; J. Eng. Res. Rep., vol. 26, no. 6, pp. 388-395, 2024; Article no.JERR.118312
Hyperparameters are the parameters that govern 1.1 Factors Influencing Performance of
the training process and structure of machine Machine Learning Models
learning models. Unlike model parameters, which
are learned during training, hyperparameters are Several factors influence the performance of
set before the training process begins. They play machine learning models. High-quality data that
a critical role in determining the performance of accurately represents the problem domain is
the model. Examples of hyperparameters include crucial. Data preprocessing steps, such as
the learning rate in neural networks, the number cleaning, normalization, and feature engineering,
of trees in a random forest, the depth of a enhance data quality. Additionally, having a large
decision tree, the penalty term in support vector dataset provides more information, enabling the
machines, momentum, learning rate decay, a model to learn better and generalize well [11].
gradual reduction in the learning rate over The choice of algorithm is also critical, as
time to speed up learning and regularization different algorithms have different strengths and
constant. are suitable for different types of problems.
Selecting an appropriate algorithm that aligns
The relationship between hyperparameters and with the problem's nature and data
performance is complex. Properly tuned characteristics is essential for achieving high
hyperparameters can lead to significant performance [12].
improvements in model performance, while
poorly chosen hyperparameters can result in Hyperparameter tuning is another significant
suboptimal models. For instance, in neural factor, as hyperparameters control the behavior
networks, the learning rate controls how quickly and complexity of the model. Properly tuned
the model updates its weights during training. A hyperparameters can lead to significant
learning rate that is too high can cause improvements in performance. For example, in
the model to converge too quickly to a deep learning, hyperparameters such as learning
suboptimal solution, while a learning rate that is rate, batch size, and the number of layers and
too low can make the training process units in the network can greatly affect the
unnecessarily slow [8]. In momentum, it gives the convergence and accuracy of the model [13].
direction of the next step with respect to the Model complexity, defined by its architecture and
previous step. the number of parameters, also affects
performance. A model that is too simple may
The importance of achieving high performance in underfit the data, failing to capture underlying
ML cannot be overstated. High-performing patterns, while a model that is too complex may
models are needed for: overfit, capturing noise and spurious correlations
[14].
1. Operational Efficiency: High performance
translates to better decision-making and Regularization techniques, such as L1 and L2
operational efficiency. In industrial regularization, dropout, and early stopping, help
applications, this means optimized prevent overfitting by adding constraints to the
processes, reduced downtime, and model. These techniques maintain a balance
increased productivity. between bias and variance, leading to better
2. Competitive Advantage: Businesses generalization [15,16]. Finally, the choice of
leveraging high-performing ML models can evaluation methods, such as cross-validation and
gain a competitive edge by offering better bootstrapping, influences the assessment of
products and services. For instance, model performance. Proper evaluation ensures
recommendation systems used by that performance metrics are reliable and not
companies like Amazon and Netflix rely on biased by the specificities of the training and test
high-performing models to provide datasets [17].
personalized experiences that keep
customers engaged [9]. 2. HYPERPARAMETER TUNING IN
3. Advancement of Research: In scientific MACHINE LEARNING
research, high-performing models enable
the discovery of new knowledge and Hyperparameter tuning is the process of finding
insights. For example, in drug discovery, the optimal set of hyperparameters that yield the
ML models can predict the efficacy of new best performance for a machine learning model.
compounds, accelerating the development This process is critical because hyperparameters
of new treatments [10]. control the learning process and the structure of
390
Ilemobayo et al.; J. Eng. Res. Rep., vol. 26, no. 6, pp. 388-395, 2024; Article no.JERR.118312
the model such as learning rate, the number of Hyperparameters in various machine learning
neurons in a neural network, or kernel size in algorithms include learning rate, batch size,
support vector machine, directly impacting its number of layers in neural networks,
performance. Unlike model parameters, which regularization parameters, number of tree and
are learned from the data, hyperparameters are depth of trees. The learning rate in gradient-
set before training and require careful based optimization algorithms determines the
selection. Hyperparameter tuning can step size during each iteration of the optimization
improve the performance and generalization of process. A suitable learning rate is crucial for
the model. ensuring that the model converges to a good
solution without overshooting or slow
The importance of hyperparameter tuning in convergence. In stochastic gradient descent
machine learning cannot be overstated. Proper (SGD), the batch size defines the number of
hyperparameter tuning can significantly enhance samples used to compute the gradient at each
model performance. For example, selecting the step. Smaller batch sizes can provide more
right learning rate in neural networks can speed accurate gradient estimates but may require
up convergence and improve accuracy [13]. more iterations to converge.
Additionally, hyperparameter tuning helps
achieve a balance between bias and variance, The architecture of neural networks, including the
thereby improving the model's ability to number of layers and the number of units per
generalize to unseen data. This is crucial for the layer, determines the model's capacity to learn
model's robustness and reliability in real-world complex representations. These
applications [14]. Moreover, by identifying hyperparameters must be carefully selected to
optimal hyperparameters, computational balance model capacity and computational
resources are used more efficiently, reducing efficiency. Regularization parameters, such as L1
training time and costs. This efficiency is and L2, control the penalty applied to the model's
particularly important for large-scale models and parameters, helping to prevent overfitting. L1
datasets [18]. regularization promotes sparsity, while L2
regularization discourages large parameter
Hyperparameters play a crucial role in the values.
performance of various machine learning
models. In neural networks, hyperparameters For support vector machines (SVM), kernel
such as learning rate, batch size, number of parameters such as gamma in the radial basis
layers, and number of units per layer significantly function (RBF) kernel influence the model's
influence the model's performance. Proper tuning ability to handle non-linearly separable data.
of these hyperparameters can lead to faster Proper tuning of these parameters is essential for
convergence and higher accuracy [8,19]. For achieving good classification performance. In
support vector machines, the penalty parameter random forests, the number of trees and the
(C) and the kernel parameters, such as gamma maximum depth of each tree determine the
in the RBF kernel, are critical in determining the model's complexity and its ability to capture
decision boundary and margin. Tuning these interactions between features. Proper tuning of
parameters enhances the model's ability to these hyperparameters can improve both
handle non-linearly separable data [20]. accuracy and generalization.
In decision trees and random forests, In NLP applications, determining the optimal size
hyperparameters such as the depth of the tree, of the word embeddings could impact the
the minimum samples per leaf, and the number accuracy of predictions. Proper tuning of
of trees in a random forest influence the model's hyperparameters like size of the context window
complexity and performance. Proper tuning of and dimension of the embeddings can help strike
these hyperparameters can prevent overfitting a balance between computational efficiency and
and improve generalization [21]. Similarly, in model performance [23].
gradient boosting machines like XGBoost, and
LightGBM, hyperparameters like learning rate, 2.1 Techniques Used for Hyperparameter
number of estimators, and maximum depth of Tuning
trees are essential for capturing complex
patterns. Tuning these parameters can Several techniques have been developed to
significantly enhance performance in predictive automate and optimize the hyperparameter
tasks [22]. tuning process. Grid search is a brute-force
391
Ilemobayo et al.; J. Eng. Res. Rep., vol. 26, no. 6, pp. 388-395, 2024; Article no.JERR.118312
technique that exhaustively searches over a neural architecture search, to create robust and
predefined set of hyperparameters. Although efficient models with minimal human intervention
straightforward and easy to implement, it can be [30].
computationally expensive, especially for large
hyperparameter spaces [24]. Random search 2.2 Learning Rate as a Hyperparameter
offers a more efficient alternative, sampling in Deep Learning
hyperparameters randomly from a distribution.
This method has proven to be more effective in The learning rate is one of the most critical
finding optimal hyperparameters as it explores a hyperparameters in deep learning, governing
larger and more diverse set of combinations [24]. how much to change the model in response to
the estimated error each time the model weights
Bayesian optimization is a probabilistic model- are updated. It directly influences the
based approach that builds a surrogate model to convergence rate and final performance of neural
approximate the objective function. It iteratively networks. Selecting an appropriate learning rate
selects the most promising hyperparameters to is crucial for training neural networks efficiently.
evaluate, balancing exploration and exploitation, There are several strategies to optimize the
making it particularly useful for optimizing learning rate.
expensive functions [18]. Genetic algorithms,
inspired by the process of natural selection, use Using learning rate schedules can help adjust the
a population-based approach to search for learning rate during training. Common schedules
optimal hyperparameters. They apply genetic include step decay, where the learning rate is
operators such as mutation, crossover, and reduced by a factor after a fixed number of
selection to evolve the population towards better epochs; exponential decay, where the learning
solutions, effectively exploring complex and large rate decreases exponentially; and cosine
hyperparameter spaces [25]. annealing, which uses a cosine function to
decrease the learning rate. Adaptive learning
Early stopping is a regularization technique that rates are another effective strategy. Algorithms
monitors the model's performance on a validation such as Adaptive Gradient Algorithm (AdaGrad),
set and halts training when performance starts to Root Mean Square Propagation (RMSProp), and
degrade, preventing overfitting and saving Adaptive Moment Estimation (Adam) adjust the
computational resources [26]. Hyperband is an learning rate based on the gradients. These
adaptive resource allocation and early-stopping adaptive methods help improve convergence by
strategy for hyperparameter optimization. It scaling the learning rate according to the
evaluates a large number of hyperparameter historical gradient information.
configurations and allocates more resources to
promising ones, effectively balancing exploration Cyclical learning rates involve periodically
and exploitation [27,19]. varying the learning rate between a lower and
upper bound. This approach can help escape
Meta-learning, or learning to learn, leverages local minima and saddle points, potentially
past experiences to accelerate the leading to better solutions. Another useful
hyperparameter tuning process. It uses technique is learning rate warm-up, where the
knowledge from previously optimized models to learning rate is gradually increased at the
inform the search for optimal hyperparameters in beginning of training. This method can stabilize
new tasks [28]. Multi-fidelity optimization training and prevent divergence, which is
techniques employ approximations of the especially useful when training large models or
objective function at different levels of fidelity to using large batch sizes [31].
speed up the hyperparameter tuning process. By
evaluating cheaper approximations first and 3. TRADE-OFFS TO CONSIDER WHEN
refining promising configurations with more PERFORMING HYPERPARAMETER
expensive evaluations, these methods
TUNING
significantly reduce computational costs [29].
Hyperparameter tuning involves several trade-
Automated Machine Learning (AutoML) aims to
offs that need to be considered. Balancing
automate the entire machine learning pipeline,
exploration and exploitation is crucial, as it
including hyperparameter tuning. AutoML
involves searching a wide range of
combines various optimization techniques, such
hyperparameters while also focusing on
as Bayesian optimization, meta-learning, and
promising regions. Techniques like Bayesian
392
Ilemobayo et al.; J. Eng. Res. Rep., vol. 26, no. 6, pp. 388-395, 2024; Article no.JERR.118312
393
Ilemobayo et al.; J. Eng. Res. Rep., vol. 26, no. 6, pp. 388-395, 2024; Article no.JERR.118312
394
Ilemobayo et al.; J. Eng. Res. Rep., vol. 26, no. 6, pp. 388-395, 2024; Article no.JERR.118312
25. Bergstra J, Bengio Y. Random search for 31. Feurer M, Klein A, Eggensperger K,
hyper-parameter optimization. Journal of Springenberg JT, Blum M, Hutter F.
Machine Learning Research. 2012;13(2). Efficient and robust automated
26. Santos EC, Monteiro GL, Moura-Pires F. A machine learning. In Advances in neural
genetic algorithm for neural network information processing systems. 2015;
hyperparameter optimization. In 2962-2970.
Proceedings of the International 32. Goyal P. et al. Accurate, large minibatch
Conference on Artificial Neural Networks. SGD: Training ImageNet in 1 Hour;
2010;217-225. 2018.
27. Prechelt L. Early stopping-but when? In 33. Molnar C, Casalicchio G, Bischl B.
Neural Networks: Tricks of the trade (pp. Interpretable machine learning – A brief
55-69). Springer, Berlin, Heidelberg; 1998. history, state-of-the-art and challenges.
28. Li L, Jamieson KG, DeSalvo G, ECML PKDD 2020 Workshops. 2020;417–
Rostamizadeh A, Talwalkar A. Hyperband: 431.
A novel bandit-based approach to Available:https://doi.org/10.1007/978-3-
hyperparameter optimization. Journal of 030-65965-3_28
Machine Learning Research. 34. Elsken T, Metzen JH, Hutter F. Neural
2017;18(1):6765-6816. architecture search: A survey. Journal of
29. Andrychowicz M, Denil M, Gomez S, Machine Learning Research. 2019;20(55):
Hoffman MW, Pfau D, Schaul T, de Freitas 1-21.
N. Learning to learn by gradient descent by 35. Henderson P, Islam R, Bachman P,
gradient descent. In Advances in neural Pineau J, Precup D, Meger D. Deep
information processing systems. 2016; reinforcement learning that matters. In
3981-3989. Proceedings of the AAAI Conference on
30. Kandasamy K, Dasarathy G, Oliva JB, Artificial Intelligence. 2018;32(1).
Schneider J, Póczos B. Multi-fidelity 36. Friedman JH. Greedy function
bayesian optimisation with continuous approximation: A gradient boosting
approximations. In Proceedings of the 34th machine.’, The Annals of Statistics.
International Conference on Machine 2001;29(5).
Learning. 2016;48:1799-1808. DOI: 10.1214/aos/1013203451
© Copyright (2024): Author(s). The licensee is the journal publisher. This is an Open Access article distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Peer-review history:
The peer review history for this paper can be accessed here:
https://www.sdiarticle5.com/review-history/118312
395