0% found this document useful (0 votes)
91 views

Ch-4 Ensemble Learning

Very help full doc

Uploaded by

Patil Mangesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

Ch-4 Ensemble Learning

Very help full doc

Uploaded by

Patil Mangesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Chapter 4 Ensemble Learning

In machine learning, an ensemble refers to a collection of multiple models


that work together to make a prediction or classification. The idea is to
combine the strengths of individual models to improve overall performance
and reduce errors. By aggregating the outputs of multiple models, an
ensemble often achieves better predictive performance than any single
model within the ensemble.
There are several techniques for creating ensembles, and they can be
broadly categorized into three types: bagging, boosting, and stacking.
4.1.Model Combination Schemes
Bagging: Multiple models (often of the same type) are trained on different
subsets of the training data, which are created by bootstrapping (sampling
with replacement).
Bagging is a short form of Bootstrap Aggregating, used to improve the
accuracy of the model. It is used when dealing with problems related to
Classification and Regression. This technique improves the accuracy of the
model by reducing variance, and helps to prevent the overfitting of the
model. Bagging can be applied with any type of method in machine learning,
but generally it is implemented using Decision Trees.
Aggregation is a technique in bagging that helps to incorporate all the
possible outcomes of the predictions and randomizes the outcomes at the
end.
Example: Random Forest, where many decision trees are built on different
samples of the data and their predictions are averaged (for regression) or
voted on (for classification).
Boosting: Boosting is an ensemble technique, which converts the weak
machine learning models into strong models. The main goal of this
technique is to reduce bias and variance of a model to improve accuracy.
This technique learns from the previous predictor mistakes of data to make
better predictions in future by improving the performance of the model.
Example: AdaBoost, Gradient Boosting Machines (GBM), and XGBoost.
Each model adds to the final prediction, often weighted by its accuracy.
Voting: Voting is the easiest ensemble method in machine learning. It is
mainly used for classification purposes. In this technique, the first step is to
create multiple classification models using a training dataset. When the
voting is applied to regression problems, the prediction is made with the
average of multiple other regression models.
Types:
 Hard Voting: Each model votes for a class, and the class with the
most votes is selected.
 Soft Voting: Probabilities predicted by each model are averaged, and
the class with the highest probability is chosen.
Stacking: This method is a combination of multiple regression or classifier
techniques with a meta-regressor or meta-classifier. Stacking is different
from bagging and boosting. Bagging and boosting models work mainly on
homogeneous weak learners and don’t consider heterogeneous learners,
whereas stacking works mainly on heterogeneous weak learners, and
consists of different algorithms altogether.
Example: Using logistic regression as a meta-learner to combine the
outputs of several different classifiers.
4.2 Bagging: Random Forest Trees

Bagging, short for Bootstrap Aggregating, is an ensemble learning


technique that improves the stability and accuracy of machine learning
algorithms. It works particularly well with models that have high variance,
such as decision trees.
Key Concepts of Bagging
1. Bootstrap Sampling:
o Random subsets of the training data are created by sampling
with replacement. This means some instances may be included
multiple times, while others may not be included at all.
2. Independent Model Training:
o Each subset of data is used to train a separate model (often of
the same type). These models operate independently of one
another.
3. Aggregation:
o The predictions from all the individual models are combined to
produce a final prediction. For classification tasks, this is
usually done via majority voting, while for regression tasks, the
predictions are averaged.

Random Forest

Random Forest is an extension of the bagging technique specifically


designed for decision trees. It enhances the diversity among the trees in the
ensemble, leading to better performance. Here’s how it works:

1. Creating Decision Trees:


o Like in standard bagging, a number of decision trees are
trained using bootstrap samples of the original dataset.
2. Feature Randomness:
o In addition to using different subsets of data, Random Forest
introduces randomness in the feature selection process. When
creating a split in each tree, only a random subset of features is
considered. This further decorrelates the trees, making the
model more robust.
3. Final Prediction:
o For classification, each tree in the forest votes for a class label,
and the class with the majority vote is chosen as the final
output. For regression, the average of all tree predictions is
taken.

Advantages of Random Forest


1. Robustness: The combination of multiple trees helps to reduce
overfitting and variance, leading to a more generalized model.
2. Handles Missing Values: Random Forest can maintain accuracy
even when a significant proportion of data is missing.
3. Feature Importance: It provides a measure of feature importance,
which helps in understanding the influence of different features on
the predictions.
4. Scalability: Random Forest can handle large datasets and is capable
of running efficiently on multicore processors.
Disadvantages of Random Forest
1. Complexity: The model can become complex, making it harder to
interpret than a single decision tree.
2. Slower Predictions: Since it involves multiple trees, predictions can
be slower compared to simpler models, especially for large datasets.
3. Memory Usage: It can require more memory to store multiple trees
compared to a single model.
4.3.Boosting: Adaboost, Gradient boost
Boosting is an ensemble learning technique designed to improve the
performance of weak learners by combining them into a single strong
learner. It works by sequentially training models, each focusing on the
mistakes made by previous models. One of the most well-known boosting
algorithms is AdaBoost (Adaptive Boosting).
Key Concepts of Boosting
1. Weak Learners:
o In boosting, weak learners are models that perform slightly
better than random guessing. These are often simple models,
such as decision stumps (one-level decision trees).
2. Sequential Learning:
o Boosting involves training models in sequence. Each new
model is trained to correct the errors of the combined
ensemble of previous models.
3. Weight Adjustment:
o After each iteration, the weights of the training instances are
adjusted. Misclassified instances receive higher weights,
making them more influential in the training of subsequent
models.
4. Final Prediction:
o The final model is a weighted sum of the predictions from all
the weak learners, where models that perform better have
more influence on the final output.
How AdaBoost Works
1. Initialization:
o Assign equal weights to all training instances.
2. Training Iterations:
o For a specified number of iterations:
 Train a weak learner on the weighted dataset.
 Evaluate the model’s performance and calculate the error
rate.
 Compute the model’s weight based on its accuracy.
 Update the weights of the training instances:
 Increase the weights of misclassified instances.
 Decrease the weights of correctly classified
instances.
3. Final Model:
o Combine the weak learners into a single strong learner using
their weights. The final prediction is made by aggregating the
weighted predictions of all the models.

Advantages of AdaBoost
1. Improved Accuracy:
o AdaBoost can significantly enhance the performance of weak
learners, leading to high accuracy in classification tasks.
2. Less Prone to Overfitting:
o While boosting can overfit if the base learners are too complex,
AdaBoost generally has lower risk of overfitting compared to
other methods like bagging.
3. Flexibility:
o It can be used with various weak learners, including decision
trees, support vector machines, and others.
4. Feature Importance:
o Like Random Forest, AdaBoost can provide insights into
feature importance, helping to identify influential variables.
Disadvantages of AdaBoost
1. Sensitivity to Noisy Data:
o AdaBoost can be sensitive to outliers and noisy data since it
assigns higher weights to misclassified instances.
2. Performance with Complex Models:
o If the weak learners are too complex, AdaBoost may overfit the
training data.
3. Computational Cost:
o Sequential training can be computationally intensive, especially
with large datasets.
Gradient boost
Gradient Boosting is a popular boosting algorithm in machine learning
used for classification and regression tasks. Boosting is one kind
of ensemble Learning method which trains the model sequentially and each
new model tries to correct the previous model. It combines several weak
learners into strong learners.
Gradient Boosting is a powerful boosting algorithm that combines several
weak learners into strong learners, in which each new model is trained to
minimize the loss function such as mean squared error or cross-entropy of
the previous model using gradient descent.
In each iteration, the algorithm computes the gradient of the loss function
with respect to the predictions of the current ensemble and then trains a
new weak model to minimize this gradient. The predictions of the new
model are then added to the ensemble, and the process is repeated until a
stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not
adjusted, instead, each predictor is trained using the residual errors of the
predecessor as labels. There is a technique called the Gradient Boosted
Trees whose base learner is CART (Classification and Regression Trees).
The below diagram explains how gradient-boosted trees are trained for
regression problems.

The ensemble consists of M trees. Tree1 is trained using the feature


matrix X and the labels y. The predictions labeled y1(hat) are used to
determine the training set residual errors r1.
Tree2 is then trained using the feature matrix X and the residual errors r1 of
Tree1 as labels. The predicted results r1(hat) are then used to determine the
residual r2.
The process is repeated until all the M trees forming the ensemble are
trained.
There is an important parameter used in this technique known
as Shrinkage. Shrinkage refers to the fact that the prediction of each tree in
the ensemble is shrunk after it is multiplied by the learning rate η (eta)
which ranges between 0 to 1. The final prediction is given by the formula,
y(pred) = y1 + (η * r1) + (η * r2) + ....... + (η * rN)
Steps in Gradient Boosting
1. Initialization: Start with an initial prediction, often the mean of the
target values.
2. Iterative Training:
o Calculate the residuals (errors) from the current model’s
predictions.
o Fit a new weak learner to these residuals.
o Update the predictions by adding a scaled version of the new
learner’s predictions.
3. Learning Rate: The contribution of each new weak learner is scaled
by a learning rate (often between 0 and 1) to control how much
influence each learner has on the final prediction.
4. Stopping Criteria: The process continues until a predetermined
number of models are added, or the improvement in the loss function
falls below a certain threshold.

Advantages
 High Performance: Gradient boosting can achieve state-of-the-art
results on many problems.
 Flexibility: It can handle various types of data and loss functions
(e.g., regression, binary classification).
 Feature Importance: It provides insights into the importance of
each feature in the prediction process.
Popular Implementations
 XGBoost: An optimized and efficient implementation of gradient
boosting that is widely used in competitions.
 LightGBM: A fast, distributed version of gradient boosting that uses a
histogram-based approach.
 CatBoost: Designed to handle categorical features effectively,
reducing the need for extensive preprocessing.
Considerations
 Overfitting: Gradient boosting can overfit on small datasets, so
techniques like regularization and early stopping are important.
 Parameter Tuning: It often requires careful tuning of
hyperparameters (e.g., learning rate, number of trees, depth of trees)
to achieve the best performance.

4.4 Voting
Voting is a simple and effective ensemble learning technique used to improve
the accuracy and robustness of predictions by combining the outputs of
multiple models (also known as "base learners" or "weak learners").
Types of Voting
1. Hard Voting:
o In hard voting, each model in the ensemble predicts a class label,
and the final prediction is made based on the majority vote.
o For example, if three models predict classes A, A, and B, the
ensemble's prediction will be A.
2. Soft Voting:
o In soft voting, each model provides a probability distribution
over the class labels. The final prediction is made by averaging
these probabilities and selecting the class with the highest
average probability.
o This method can be more effective, especially when the
classifiers have different confidence levels in their predictions.
Steps in Voting
1. Training Multiple Models: Train different classifiers (e.g., decision
trees, logistic regression, support vector machines) on the same
dataset.
2. Making Predictions: Each model makes its predictions (either class
labels in hard voting or probabilities in soft voting).
3. Combining Predictions:
o Hard Voting: Tally the votes and select the class with the most
votes.
o Soft Voting: Average the predicted probabilities and choose the
class with the highest average.
Advantages
 Improved Accuracy: Combining predictions can lead to better
performance than individual models, especially if the models are
diverse.
 Robustness: Voting helps mitigate the risk of overfitting by balancing
the predictions of multiple models.
 Flexibility: Can be applied to various types of models, making it
adaptable to different problems.
Considerations
 Model Diversity: The effectiveness of voting often depends on having
diverse models. If all models make similar errors, the ensemble may
not perform better than individual models.
 Computational Cost: Training multiple models can be
computationally expensive, particularly with complex algorithms.
Use Cases
Voting is commonly used in various applications, including:
 Image classification
 Sentiment analysis
 Medical diagnosis
4.5. Stacking
Stacking, or stacked generalization, is an ensemble learning technique that
combines multiple models to improve predictive performance. Unlike simple
methods like bagging or boosting, stacking involves training a new model
(often called a meta-learner) to learn how to best combine the predictions of
several base models.
How Stacking Works
1. Train Base Models:
o Multiple base models (e.g., decision trees, logistic regression,
support vector machines) are trained on the same training
dataset.
2. Generate Predictions:
o For each base model, predictions are made on either the training
set or a separate validation set. This step is often performed
using k-fold cross-validation to avoid overfitting.
3. Create a New Dataset:
o The predictions from each base model form a new dataset. Each
column represents the predictions of a different base model, and
each row corresponds to a data point.
4. Train Meta-Learner:
o A new model (the meta-learner) is trained on this new dataset of
predictions. The goal of the meta-learner is to learn the best way
to combine the outputs of the base models.
5. Final Prediction:
o For new, unseen data, the base models generate predictions,
which are then input into the trained meta-learner to produce
the final prediction.
Advantages
 Improved Performance: Stacking can often achieve better
performance than individual models by leveraging their strengths.
 Flexibility: You can combine different types of models, allowing for
greater diversity in predictions.
 Reduction of Overfitting: By using a meta-learner, stacking can
reduce the risk of overfitting that may occur with individual models.
Considerations
 Complexity: Stacking can be more complex to implement than
simpler ensemble methods.
 Computational Cost: Training multiple models and a meta-learner
can be computationally intensive, especially with large datasets.

4.6.Error-Correcting Output Codes


Error-Correcting Output Codes (ECOC) is a technique used in ensemble
learning to improve the performance of multi-class classification problems. It
involves transforming a multi-class problem into a series of binary
classification problems, making it easier for individual classifiers to manage.
How ECOC Works
1. Codebook Creation:
o A binary code is assigned to each class in the multi-class
problem. For example, in a three-class problem (A, B, C), you
might create a codebook with binary vectors:
 Class A: 00
 Class B: 01
 Class C: 10
2. Binary Classifiers:
o For each bit in the binary code, a separate binary classifier is
trained. The classifiers learn to distinguish between a specific
subset of classes based on their binary code representations. For
instance:
 Classifiers are trained to predict:
 Class A vs. (B + C)
 Class B vs. (A + C)
 Class C vs. (A + B)
3. Prediction:
o When making predictions, each classifier outputs a binary value
(0 or 1) corresponding to its assigned bit.
o The final prediction is determined by interpreting the combined
output of all classifiers. The output is mapped back to the class
with the closest corresponding code.
4. Error Correction:
o If a classifier makes a mistake, the error-correcting property
comes into play. Because the binary codes are designed to be
distinct, the system can often still determine the correct class
even if some classifiers misclassify their input.
Advantages of ECOC
 Robustness: The error-correcting nature allows the ensemble to be
more resilient to individual classifier errors, improving overall
accuracy.
 Flexibility: It can combine different types of classifiers and works well
with both linear and non-linear models.
 Scalability: ECOC can be extended to handle a large number of classes,
making it suitable for complex classification tasks.
Considerations
 Code Design: The effectiveness of ECOC depends on the design of the
codebook. Efficient code design can enhance performance.
 Computational Complexity: Training multiple binary classifiers can
increase computational costs, especially with large datasets.
Example: Suppose you have a dataset with three classes: Cat, Dog, and
Rabbit. You might define the following binary codes:
 Cat: 00
 Dog: 01
 Rabbit: 10
You would then train binary classifiers as follows:
1. Classifier 1: Distinguishes between "Cat" (00) and "Not Cat" (01, 10).
2. Classifier 2: Distinguishes between "Dog" (01) and "Not Dog" (00, 10).
3. Classifier 3: Distinguishes between "Rabbit" (10) and "Not Rabbit" (00,
01).
During prediction, if you receive the outputs (0, 1, 0) from the classifiers, you
can decode this back to the class "Rabbit" based on the codebook.
4.7. Gaussian mixture models
Gaussian Mixture Models (GMMs) are a probabilistic model used in machine
learning and statistics to represent the presence of multiple Gaussian
distributions within a dataset. While not exclusively an ensemble learning
technique, GMMs can be integrated into ensemble methods to enhance their
performance, especially in clustering and density estimation tasks.
Key Concepts of GMMs
1. Mixture Model: A GMM assumes that the data is generated from a
mixture of several Gaussian distributions (also known as
components), each characterized by its own mean and variance.
2. Probability Density Function: The overall probability density
function of a GMM is a weighted sum of the individual Gaussian
distributions:

Where:
 K is the number of Gaussian components.
 πk are the mixing coefficients (weights) for each component.
 N(x∣μk,Σk) is the Gaussian distribution with mean μk and covariance
Σk.
3. Expectation-Maximization (EM) Algorithm: GMMs are typically
fitted using the EM algorithm, which iteratively optimizes the
parameters (means, variances, and weights) to best fit the data.

Applications in Ensemble Learning


While GMMs themselves are not an ensemble method, they can be used in
various ensemble learning frameworks in the following ways:
1. Clustering: GMMs can serve as base learners in ensemble methods
designed for clustering. For example, different GMMs can be trained
on various subsets of the data, and their results can be combined to
achieve a more robust clustering solution.
2. Anomaly Detection: In an ensemble context, multiple GMMs can be
trained on different features or data splits, and their predictions
regarding the likelihood of observations can be aggregated to improve
anomaly detection.
3. Model Blending: GMMs can be used to blend the predictions of
different models. For instance, after training several classifiers, the
probabilities produced by these classifiers can be modeled as a GMM,
allowing for a probabilistic interpretation of their combined outputs.
Advantages of GMMs
 Flexibility: GMMs can model complex data distributions as they can
adapt to different shapes of data through multiple Gaussian
components.
 Probabilistic Framework: GMMs provide a natural probabilistic
framework, allowing for uncertainty quantification in predictions.
 Robustness: By combining multiple GMMs, the ensemble can
become more robust to noise and outliers in the data.
Considerations
 Model Complexity: GMMs can become complex with a large number
of components, requiring careful selection of the number of mixtures.
 Overfitting: There's a risk of overfitting when using too many
Gaussian components, especially with limited data.
 Convergence Issues: The EM algorithm can converge to local
optima, making the initialization of parameters crucial.
4.8. The Expectation-Maximization (EM) Algorithm
A latent variable model consists of observable variables along
with unobservable variables. Observed variables are those variables in the
dataset that can be measured whereas unobserved (latent/hidden)
variables are inferred from the observed variables.
 It can be used to find the local maximum likenihood (MLE)
parameters or maximum a posteriori (MAP) parameters for latent
variables in a statistical or mathematical model.
 It is used to predict these missing values in the dataset, provided we
know the general form of probability distribution associated with
these latent variables.
 In simple words, the basic idea behind this algorithm is to use the
observable samples of latent variables to predict the values of
samples that are unobservable for learning. This process is repeated
until the convergence of the values occurs.
The EM algorithm consists of two main steps that are iteratively applied until
convergence:
 Given a set of incomplete data, start with a set of initialized
parameters.
 Expectation step (E – step): In this expectation step, by using the
observed available data of the dataset, we can try to estimate or guess
the values of the missing data. Finally, after this step, we get complete
data having no missing values.
 Maximization step (M – step): Now, we have to use the complete
data, which is prepared in the expectation step, and update the
parameters.
 Repeat step 2 and step 3 until we converge to our solution.
Flow chart for EM algorithm

Advantages and Disadvantages of EM algorithm


Advantages
 The basic two steps of the EM algorithm i.e, E-step and M-step are
often pretty easy for many of the machine learning problems in terms
of implementation.
 The solution to the M-steps often exists in the closed-form.
 It is always guaranteed that the value of likelihood will increase after
each iteration.
Disadvantages
 It has slow convergence.
 It converges to the local optimum only.
 It takes both forward and backward probabilities into account. This
thing is in contrast to that of numerical optimization which considers
only forward probabilities.
Applications of EM Algorithm
The latent variable model has several real-life applications in Machine
learning:
 Used to calculate the Gaussian density of a function.
 Helpful to fill in the missing data during a sample.
 It finds plenty of use in different domains such as Natural Language
Processing (NLP), Computer Vision, etc.
 Used in image reconstruction in the field of Medicine and Structural
Engineering.
 Used for estimating the parameters of the Hidden Markov Model
(HMM) and also for some other mixed models like Gaussian
Mixture Models, etc.
 Used for finding the values of latent variables.

This finishes the Chapter-4.

You might also like