Lab4 RBM DBN Extra Slides
Lab4 RBM DBN Extra Slides
lab 4 support
DD2437
Pawel Herman
CST/EECS/KTH
Restricted Boltzmann machine (RBM)
Visible and hidden units are conditionally
independent given one another
p( h | v ) p( hi | v )
i
p( v | h) p( v j | h)
j
Restricted Boltzmann machine (RBM)
Visible and hidden units are conditionally
independent given one another
p( h | v ) p( hi | v )
i
p( v | h) p( v j | h)
j
Following the same principle of maximising log likelihood by means of
gradient ascent, one obtains:
w ji
L( W)
w ji
v j hi
data
v j hi
model
Restricted Boltzmann machine (RBM)
Visible and hidden units are conditionally
1 independent given one another
P(hi 1| v)
1 exp biash vTW:,i
i
p( h | v ) p( hi | v )
i
1
P(v j 1| h)
1 exp biasv j Wj ,: h p( v | h) p( v j | h)
j
Following the same principle of maximising log likelihood by means of
gradient ascent, one obtains:
w ji
L( W)
w ji
v j hi data
v j hi
model
RBM learning with Contrastive Divergence (CD)
Gibbs sampling
1
P( hi 1| v )
1 exp biash v T W:,i
i
1
P( v j 1| h)
1 exp biasv j Wj ,: h
increase energy “elsewhere”,
esp. in areas of low energy
for the observed data
Hinton, 2003
RBM learning with Contrastive Divergence (CD)
Gibbs sampling
1
P( hi 1| v )
1 exp biash v T W:,i
i
1
P( v j 1| h)
1 exp biasv j Wj ,: h
GOOD TO KNOW:
increase energy “elsewhere”,
Contrastive Divergence does esp. in areas of low energy
not optimise the likelihood for the observed data
but it works effectively! Hinton, 2003
CDk recipe for training RBM
Gibbs sampling
Objective:
v j hi and v j hi
data model
1) Set (clamp) the visible units with an input vector and update hidden units (binary states).
1
P( hi 1| v ) 1 exp biash v W:,i
T
i
2) Update all the visible units in parallel to get a reconstruction (probabs can be used).
1
P( v j 1| h) 1 exp biasv j Wj ,: h
3) Collect the statistics for correlations after k steps using mini‐batches (N samples) and
update weights: k‐th step
1 N (n) (n)
w j ,i v j hi vˆ (jn ) hˆi( n )
N n 1
The final update of the hidden
units should use the probability.
CD1 case
Gibbs sampling
N
1 (1, n ) ˆ (1, n )
w j ,i v (0)
j h
i
(0)
v h
(1)
j i
(1)
v (0,
j
n ) (0, n )
hi ˆ
v j hi
N n 1
Deep belief nets
Greedy layer‐wise training approach
with the use of RBMs
Deep belief nets
Greedy layer‐wise training approach
with the use of RBMs
h(3)
W(3)
h(2)
W(2)
h(1)
W(1)
v
Salakhutdinov, 2015
Deep belief nets
Greedy layer‐wise training approach
with the use of RBMs
h(3)
undirected part of
the network
(bipartite graph of RBM)
h(2)
h(1)
directed part of the
network (sigmoid
belief network)
v
Salakhutdinov, 2015
Deep belief nets
Bottom‐up pass by
stochastically activating
Approach 1 higher layers in time
Hinton et al.’s (2006) architecture
Building the stack of RBMs
500 units
RBM1
28x28
pixel
image
Hinton et al.’s (2006) architecture
Building the stack of RBMs
500 units
RBM2
28x28
pixel
image
Hinton et al.’s (2006) architecture
Building the stack of RBMs
2000 top‐level units
RBM3
10 label units 500 units
(soft‐max)
500 units
28x28
pixel
image
Hinton et al.’s (2006) architecture
The network used to model joint distribution of digit images and labels.
2000 top‐level units
10 label units 500 units
(soft‐max)
500 units
Once the top layer is added, the
connections between the layers
below (now hidden) get decoupled ‐
28x28
unidirectional
pixel
image
Hinton et al.’s (2006) architecture
Pretraining with labels once the stack of RBMs has been built
2000 top‐level units
10 label units 500 units
(soft‐max)
28x28
pixel
image
Hinton et al.’s (2006) architecture
Pretraining with labels once the stack of RBMs has been built
2000 top‐level units
10 label units 500 units
(soft‐max)
Pretraining with labels
Hinton et al.’s (2006) architecture
Pretraining with labels once the stack of RBMs has been built
2000 top‐level units
10 label units 500 units
(soft‐max)
Pretraining with labels
Approximate sampling from DBN
Gibbs sampling chain in the RBM part
RBM part
(undirected part of the graph)
single‐run sampling
(through the directed graph)
Hinton et al.’s (2006) architecture
Generating samples
2000 top‐level units
10 label units 500 units
28x28
pixel
image
Hinton et al.’s (2006) architecture
Generating samples
2000 top‐level units
top RBM
10 label units 500 units
Hinton et al.’s (2006) architecture
Generating samples
2000 top‐level units
top RBM
10 label units 500 units
Hinton et al.’s (2006) architecture
Generating samples
2000 top‐level units
10 label units 500 units
Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm
2000 top‐level units
10 label units 500 units
Bottom‐up wake phase
1. Drive the network bottom‐up by providing input
digit images and using recognition weights 500 units recognition
propagate binary samples to the visible layer of
the top RBM.
28x28
pixel
image
Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm
2000 top‐level units
top RBM
10 label units 500 units
(soft‐max)
Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm
2000 top‐level units
10 label units 500 units
(soft‐max)
Approximate sampling from DBN
Gibbs sampling chain in the RBM part
RBM part
(undirected part of the graph)
single‐run sampling
(through the directed graph)
Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm
Learning that results from the wake phase
(based on network activities sampled during wake phase)
xi
w ji
yj y j prediction
(probability)
w ji xi y j y j
Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm
CDk learning of the top RBM
labels are not
clamped here
(soft‐max is used)
w j ,i v (0)
j hi
(0)
v (k ) (k )
j hi
binary states
binary states binary states
(probabilities could be used too)
Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm
Learning that results from the sleep phase
(based on network activities sampled during sleep phase)
yi yi prediction
(probability)
wij
xj
wij x j yi yi