Ch-4 Ensemble Learning
Ch-4 Ensemble Learning
Random Forest
Advantages of AdaBoost
1. Improved Accuracy:
o AdaBoost can significantly enhance the performance of weak
learners, leading to high accuracy in classification tasks.
2. Less Prone to Overfitting:
o While boosting can overfit if the base learners are too complex,
AdaBoost generally has lower risk of overfitting compared to
other methods like bagging.
3. Flexibility:
o It can be used with various weak learners, including decision
trees, support vector machines, and others.
4. Feature Importance:
o Like Random Forest, AdaBoost can provide insights into
feature importance, helping to identify influential variables.
Disadvantages of AdaBoost
1. Sensitivity to Noisy Data:
o AdaBoost can be sensitive to outliers and noisy data since it
assigns higher weights to misclassified instances.
2. Performance with Complex Models:
o If the weak learners are too complex, AdaBoost may overfit the
training data.
3. Computational Cost:
o Sequential training can be computationally intensive, especially
with large datasets.
Gradient boost
Gradient Boosting is a popular boosting algorithm in machine learning
used for classification and regression tasks. Boosting is one kind
of ensemble Learning method which trains the model sequentially and each
new model tries to correct the previous model. It combines several weak
learners into strong learners.
Gradient Boosting is a powerful boosting algorithm that combines several
weak learners into strong learners, in which each new model is trained to
minimize the loss function such as mean squared error or cross-entropy of
the previous model using gradient descent.
In each iteration, the algorithm computes the gradient of the loss function
with respect to the predictions of the current ensemble and then trains a
new weak model to minimize this gradient. The predictions of the new
model are then added to the ensemble, and the process is repeated until a
stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not
adjusted, instead, each predictor is trained using the residual errors of the
predecessor as labels. There is a technique called the Gradient Boosted
Trees whose base learner is CART (Classification and Regression Trees).
The below diagram explains how gradient-boosted trees are trained for
regression problems.
Advantages
High Performance: Gradient boosting can achieve state-of-the-art
results on many problems.
Flexibility: It can handle various types of data and loss functions
(e.g., regression, binary classification).
Feature Importance: It provides insights into the importance of
each feature in the prediction process.
Popular Implementations
XGBoost: An optimized and efficient implementation of gradient
boosting that is widely used in competitions.
LightGBM: A fast, distributed version of gradient boosting that uses a
histogram-based approach.
CatBoost: Designed to handle categorical features effectively,
reducing the need for extensive preprocessing.
Considerations
Overfitting: Gradient boosting can overfit on small datasets, so
techniques like regularization and early stopping are important.
Parameter Tuning: It often requires careful tuning of
hyperparameters (e.g., learning rate, number of trees, depth of trees)
to achieve the best performance.
4.4 Voting
Voting is a simple and effective ensemble learning technique used to improve
the accuracy and robustness of predictions by combining the outputs of
multiple models (also known as "base learners" or "weak learners").
Types of Voting
1. Hard Voting:
o In hard voting, each model in the ensemble predicts a class label,
and the final prediction is made based on the majority vote.
o For example, if three models predict classes A, A, and B, the
ensemble's prediction will be A.
2. Soft Voting:
o In soft voting, each model provides a probability distribution
over the class labels. The final prediction is made by averaging
these probabilities and selecting the class with the highest
average probability.
o This method can be more effective, especially when the
classifiers have different confidence levels in their predictions.
Steps in Voting
1. Training Multiple Models: Train different classifiers (e.g., decision
trees, logistic regression, support vector machines) on the same
dataset.
2. Making Predictions: Each model makes its predictions (either class
labels in hard voting or probabilities in soft voting).
3. Combining Predictions:
o Hard Voting: Tally the votes and select the class with the most
votes.
o Soft Voting: Average the predicted probabilities and choose the
class with the highest average.
Advantages
Improved Accuracy: Combining predictions can lead to better
performance than individual models, especially if the models are
diverse.
Robustness: Voting helps mitigate the risk of overfitting by balancing
the predictions of multiple models.
Flexibility: Can be applied to various types of models, making it
adaptable to different problems.
Considerations
Model Diversity: The effectiveness of voting often depends on having
diverse models. If all models make similar errors, the ensemble may
not perform better than individual models.
Computational Cost: Training multiple models can be
computationally expensive, particularly with complex algorithms.
Use Cases
Voting is commonly used in various applications, including:
Image classification
Sentiment analysis
Medical diagnosis
4.5. Stacking
Stacking, or stacked generalization, is an ensemble learning technique that
combines multiple models to improve predictive performance. Unlike simple
methods like bagging or boosting, stacking involves training a new model
(often called a meta-learner) to learn how to best combine the predictions of
several base models.
How Stacking Works
1. Train Base Models:
o Multiple base models (e.g., decision trees, logistic regression,
support vector machines) are trained on the same training
dataset.
2. Generate Predictions:
o For each base model, predictions are made on either the training
set or a separate validation set. This step is often performed
using k-fold cross-validation to avoid overfitting.
3. Create a New Dataset:
o The predictions from each base model form a new dataset. Each
column represents the predictions of a different base model, and
each row corresponds to a data point.
4. Train Meta-Learner:
o A new model (the meta-learner) is trained on this new dataset of
predictions. The goal of the meta-learner is to learn the best way
to combine the outputs of the base models.
5. Final Prediction:
o For new, unseen data, the base models generate predictions,
which are then input into the trained meta-learner to produce
the final prediction.
Advantages
Improved Performance: Stacking can often achieve better
performance than individual models by leveraging their strengths.
Flexibility: You can combine different types of models, allowing for
greater diversity in predictions.
Reduction of Overfitting: By using a meta-learner, stacking can
reduce the risk of overfitting that may occur with individual models.
Considerations
Complexity: Stacking can be more complex to implement than
simpler ensemble methods.
Computational Cost: Training multiple models and a meta-learner
can be computationally intensive, especially with large datasets.
Where:
K is the number of Gaussian components.
πk are the mixing coefficients (weights) for each component.
N(x∣μk,Σk) is the Gaussian distribution with mean μk and covariance
Σk.
3. Expectation-Maximization (EM) Algorithm: GMMs are typically
fitted using the EM algorithm, which iteratively optimizes the
parameters (means, variances, and weights) to best fit the data.