0% found this document useful (0 votes)
93 views19 pages

ch9 Ensemble Learning

This document discusses ensemble learning techniques for machine learning. It introduces formulation of ensemble learning, bagging methods like random forests, and boosting methods. For boosting, it outlines gradient boosting, AdaBoost, and gradient tree boosting. Decision trees are also covered, including how they are constructed for regression and classification problems.

Uploaded by

Juan Zarate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views19 pages

ch9 Ensemble Learning

This document discusses ensemble learning techniques for machine learning. It introduces formulation of ensemble learning, bagging methods like random forests, and boosting methods. For boosting, it outlines gradient boosting, AdaBoost, and gradient tree boosting. Decision trees are also covered, including how they are constructed for regression and classification problems.

Uploaded by

Juan Zarate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Formulation Bagging Boosting

Chapter 9
Ensemble Learning

supplementary slides to
Machine Learning Fundamentals

c
Hui Jiang 2020
published by Cambridge University Press

August 2020

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Outline

1 Formulation of Ensemble Learning

2 Bagging

3 Boosting

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Ensemble Learning

ensemble learning: combine multiple base models that are


learned separately for the same task

how to choose base models?


◦ neural networks, linear models, decision trees, etc.

how to learn base models to ensure the diversity?


◦ re-sampling the training set, re-weighting training samples, etc.

how to combine base models optimally?


◦ bagging, boosting, stacking

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Decision Trees (I)

a popular non-parametric model for


regression or classification tasks
a tree-structured model:
◦ each non-terminal node is associated
with a binary question regarding an
input feature element xi and a threshold
tj , e.g. xi ≤ tj
◦ each leaf node represents a homogeneous
region Rl in the input space
each decision tree represents a particular
partition of the input space
decision trees are a highly interpretable
machine learning method
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Decision Trees (II)


fit a simple model to all y values in each x y
region Rl ML model
◦ regression: use a constant cl for each Rl
◦ classification: assign all x in each Rl to y = f¯(x)
one particular class
approximate the unknown target function
by a piece-wise constant function
X
y = f (x) = cl I(x ∈ Rl )
l

where

1 if x ∈ Rl
I(x ∈ Rl ) =
0 otherwise
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Decision Trees for Regression



a training set: D = (x(n) , y (n) ) n = 1, 2, · · · , N

construct the loss functional using a loss function l(·):


1
PN 1
PN 2
L(f ; D) = N n=1 l y (n) , f (x(n) ) = N n=1 y (n) − f (x(n) )
computationally infeasible to find the best partition to
minimize the above loss
use the greedy algorithm to recursively find an optimal split
x∗i ≤ t∗j at a time
h X 2 X 2 i
x∗i , t∗j = arg min y (n) − c∗l y (n) − c∗r

+
xi ,tj
x(n) ∈Dl x(n) ∈Dr

where
(n) (n)
Dl = (x(n) , y (n) ) xi ≤ tj , Dr = (x(n) , y (n) ) xi > tj ,
and c∗l and c∗r are the centroids of Dl and Dr
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Decision Trees for Classification


classification
problem involving K classes,
i.e. ω1 , ω2 , · · · , ωK
plk (k = 1, 2, · · · , K): the portion of class
k among all training samples assigned to
leaf node l representing Rl
1 X
plk = I(y (n) = ωk )
Nl
x(n) ∈Rl ◦ misclassification error:
1
P (n)
Nl x (n) ∈R l
I(y 6=
all input x in each region Rl is assigned ωkl∗ ) = 1 − plkl∗
to the majority class ◦ Gini index: 1 − K
P 2
k=1 plk
kl∗ = arg maxk plk ◦ entropy:
− K
P
the criteria for the best split {x∗i , t∗j } k=1 plk log(plk )

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Bagging and Random Forests

bagging stands for bootstrap aggregating


bootstrapping (sampling with replacement) a training set into
M subsets
use M bootstrap subsets to independently learn M models
combine M models by averaging or majority-voting
random forests: use decision trees as base models in bagging
◦ row sampling
◦ column sampling
◦ sub-optimal splitting

random forests are much more powerful than decision trees

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Boosting: Outline

1 Gradient Boosting

2 AdaBoost

3 Gradient Tree Boosting

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Boosting
consider an additive model for ensemble learning

Fm (x) = w1 f1 (x) + w2 f2 (x) + · · · + wm fm (x)

each base model fm (x) ∈ H, then Fm (x) ∈ lin(H) ⊇ H


ensemble learning: ⇐⇒ functional minimization
N
X
Fm (x) = arg min l f (xn ), yn
f ∈ lin(H)
n=1

boosting: a sequential learning strategy to add a new base


model to improve the current ensemble

Fm (x) = Fm−1 (x) + wm fm (x)

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Gradient Boosting
gradient boosting: estimate the new base
model along the direction of the gradient
at the current ensemble Fm−1 :

∂l f (x), y

∇l Fm−1 (x) =
∂f
f =Fm−1

project the gradient into H: N


∆ 1 X

hf, gi = f (xi )g(xi )
fm = arg max f, −∇l Fm−1 (x) N i=1
f ∈H

estimate the optimal weight:


PN
wm = arg minw n=1 l Fm−1 (xn ) + w fm (xn ), yn

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

AdaBoost (I)
apply gradient boosting to binary classification problems
H: all binary functions, i.e. ∀f ∈ H, f (x) ∈ {−1, +1}
the exponential loss function: l F (x), y = e −yF (x)

given a training set: D = (x1 , y1 ), (x2 , y2 ), · · · , (xN , yN ) ,
where xn ∈ Rd and yn ∈ {−1, +1}
the functional gradient:

∆ ∂l f (x), y
∇l Fm−1 (x) = = −y e−yFm−1 (x)
∂f
f =Fm−1

project into H:


fm = arg max f, −∇l Fm−1 (x)
f ∈H
N
1 X
= arg max yn f (xn )e−yn Fm−1 (xn )
f ∈H N n=1
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

AdaBoost (II)
(m) ∆
denote αn = exp(−yn Fm−1 xn ) :
X X
fm = arg max αn(m) − αn(m)
f ∈H
yn =f (xn ) yn 6=f (xn )
N
X X
= arg max αn(m) − 2 αn(m)
f ∈H
n=1 yn 6=f (xn )
X
= arg min αn(m)
f ∈H
yn 6=f (xn )

(m)
(m) ∆ αn
normalize all weights as ᾱn = PN (m) , we have
n=1 αn
X
fm = arg min ᾱn(m)
f ∈H
yn 6=f (xn )

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

AdaBoost (III)
estimate fm to minimize the weighted classification error:
X
m = ᾱn(m) (0 ≤ m ≤ 1)
yn 6=fm (xn )

replace the 0-1 loss function with a weighted loss function,


(m)
where ᾱn is treated as the loss if (xn , yn ) is misclassified
estimate the optimal weight:
N
X
wm = arg min e−yn Fm−1 (xn )+w fm (xn )
w
n=1

P (m)
ᾱn

1 yn =fm (xn ) 1 1 − m
=⇒ wm = ln (m)
= ln
2 2 m
P
yn 6=fm (xn ) ᾱn

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

AdaBoost (IV)

AdaBoost algorithm

input: (x1 , y1 ), · · · , (xN , yN ) , where xn ∈ Rd and yn ∈ {−1, +1}


output: an ensemble model Fm (x)

m = 1 and F0 (x) = 0
(1)
initialize ᾱn = N1 for all n = 1, 2, · · · , N
while not converged do P (m)
learn a binary classifier fm (x) to minimize m = yn 6=fm (xn ) ᾱn

estimate ensemble weight: wm = 21 ln 1− m
m

add to ensemble: Fm (x) = Fm−1 (x) + wm fm (x)


(m) −yn wm fm (xn )
(m+1) ᾱn e
update ᾱn = PN (m) −yn wm fm (xn ) for all n = 1, 2, · · · , N
n=1 ᾱn e
m=m+1
end while

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

AdaBoost (V)

Theorem
If AdaBoost generates m base models with errors 1 , 2 , · · · , m ,
the error of the ensemble model Fm (x) is bounded as:
m p
Y
ε ≤ 2m t (1 − t )
t=1

combine many weak classifiers towards a strong classifier, i.e.


ε → 0 as m → ∞ if all t 6= 12 (better than random guessing)
generalize well into unseen samples since it improves the
margin distribution of training samples

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Gradient Tree Boosting (I)


apply gradient boosting to regression problems
H: all decision trees
use the square error as the loss functional:
2
l f (x), y = 12 f (x) − y


the functional gradient: ∇l Fm−1 (x) = Fm−1 (x) − y

given a training set: D = (x1 , y1 ), (x2 , y2 ), · · · , (xN , yN )
project it to minimize:
2
fm = arg min f + ∇l Fm−1 (x)
f ∈H
N
X 2
= arg min f (xn ) − yn − Fm−1 (xn )
f ∈H
n=1
| {z }
residual

supplementary slides to Machine Learning Fundamentals


c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Gradient Tree Boosting (II)


gradient tree boosting: build a decision tree fm to
approximate the residuals, i.e. yn − Fm−1 (xn ), for all n
X
y = fm (x) = cml I(x ∈ Rml )
l

where cml is the mean of all residuals in the region Rml


a.k.a. gradient boosting machine (GBM), gradient boosted
regression tree (GBRT)
use a pre-set ”shrinkage” parameter ν as the weight:

Fm (x) = Fm−1 (x) + ν fm (x)

also applicable to multi-class classification problems


supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting

Gradient Tree Boosting (III)


Gradient Tree Boosting

input: (x1 , y1 ), (x2 , y2 ), · · · , (xN , yN )
output: an ensemble model Fm (x)

fit a regression tree f0 (x) to (x1 , y1 ), (x2 , y2 ), · · · , (xN , yN )
F0 (x) = ν f0 (x)
m=1
while not converged do
compute the negative gradients as pseudo outputs:
ỹn = −∇l Fm−1 (xn) for all n = 1, 2, · · · , N
fit a regression tree fm (x) to (x1 , ỹ1 ), · · · , (xN , ỹN )
Fm (x) = Fm−1 (x) + νfm (x)
m=m+1
end while
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9

You might also like