0% found this document useful (0 votes)
33 views

Lab4 RBM DBN Extra Slides

The document discusses restricted Boltzmann machines (RBMs) and their training using contrastive divergence (CD). It provides details on RBM structure and equations for conditional probabilities between visible and hidden units. CD training involves running Gibbs sampling for k steps to approximate maximum likelihood training. The weights are updated based on the difference between correlations in the data and reconstructed by the model after k steps of Gibbs sampling. CD1 uses just 1 step of reconstruction to perform efficient approximate maximum likelihood training of RBMs.

Uploaded by

Prem Nath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Lab4 RBM DBN Extra Slides

The document discusses restricted Boltzmann machines (RBMs) and their training using contrastive divergence (CD). It provides details on RBM structure and equations for conditional probabilities between visible and hidden units. CD training involves running Gibbs sampling for k steps to approximate maximum likelihood training. The weights are updated based on the difference between correlations in the data and reconstructed by the model after k steps of Gibbs sampling. CD1 uses just 1 step of reconstruction to perform efficient approximate maximum likelihood training of RBMs.

Uploaded by

Prem Nath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Extra 

lab 4 support
DD2437
Pawel Herman
CST/EECS/KTH

KTH Pawel Herman DD2437      annda


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Restricted Boltzmann machine (RBM)
Visible and hidden units are conditionally 
independent given one another
p( h | v )   p( hi | v )
i

p( v | h)   p( v j | h)
j

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Restricted Boltzmann machine (RBM)
Visible and hidden units are conditionally 
independent given one another
p( h | v )   p( hi | v )
i

p( v | h)   p( v j | h)
j

Following the same principle of maximising log likelihood by means of 
gradient ascent, one obtains:

w ji  
L( W)
w ji

  v j hi
data
 v j hi
model

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Restricted Boltzmann machine (RBM)
Visible and hidden units are conditionally 
1 independent given one another
P(hi  1| v) 

1  exp biash  vTW:,i
i
 p( h | v )   p( hi | v )
i
1
P(v j  1| h) 
 
1  exp biasv j  Wj ,: h p( v | h)   p( v j | h)
j

Following the same principle of maximising log likelihood by means of 
gradient ascent, one obtains:

w ji  
L( W)
w ji
  v j hi  data
 v j hi
model

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

RBM learning with Contrastive Divergence (CD)

Gibbs sampling

1
P( hi  1| v ) 

1  exp biash  v T W:,i
i

1
P( v j  1| h) 

1  exp biasv j  Wj ,: h 
increase energy “elsewhere”, 
esp. in areas of low energy
for the observed data

Hinton, 2003

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

RBM learning with Contrastive Divergence (CD)

Gibbs sampling

1
P( hi  1| v ) 

1  exp biash  v T W:,i
i

1
P( v j  1| h) 

1  exp biasv j  Wj ,: h 
GOOD TO KNOW: 
increase energy “elsewhere”, 
Contrastive Divergence does  esp. in areas of low energy
not optimise the likelihood  for the observed data

but it works effectively! Hinton, 2003

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

CDk recipe for training RBM

Gibbs sampling

Objective:
v j hi and v j hi
data model

1) Set (clamp) the visible units with an input vector and update hidden units (binary states).
  
1
P( hi  1| v )  1  exp biash  v W:,i
T
i

2) Update all the visible units in parallel to get a reconstruction (probabs can be used).

  
1
P( v j  1| h)  1  exp biasv j  Wj ,: h
3) Collect the statistics for correlations after k steps using mini‐batches (N samples) and 
update weights:  k‐th step


1 N (n) (n)
w j ,i   v j hi  vˆ (jn ) hˆi( n )
N n 1
 The final update of the hidden 
units should use the probability.

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

CD1 case

Gibbs sampling

 
N
1 (1, n ) ˆ (1, n )
w j ,i  v (0)
j h
i
(0)
 v h
(1)
j i
(1)
 v (0,
j
n ) (0, n )
hi  ˆ
v j hi
N n 1

probabilities binary  probabilities


samples bias (jv )  v (0)
j  v (1)
j

biasi( h )  hi(0)  hi(1)

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Deep belief nets
Greedy layer‐wise training approach 
with the use of RBMs

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Deep belief nets
Greedy layer‐wise training approach 
with the use of RBMs

h(3)
W(3)

h(2)
W(2)
h(1)
W(1)

v
Salakhutdinov, 2015

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Deep belief nets
Greedy layer‐wise training approach 
with the use of RBMs

h(3)
undirected part of 
the network 
(bipartite graph of RBM)
h(2)

h(1)

directed part of the 
network (sigmoid 
belief network)
v
Salakhutdinov, 2015

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Deep belief nets
Bottom‐up pass by 
stochastically activating 
Approach 1 higher layers in time

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Hinton et al.’s (2006) architecture
Building the stack of RBMs

500 units

RBM1
28x28 
pixel 
image

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Hinton et al.’s (2006) architecture
Building the stack of RBMs

500 units
RBM2

The visible layer of RBM2 is treated as  500 units


probabilities (just like v(0) in CD)

28x28 
pixel 
image

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Hinton et al.’s (2006) architecture
Building the stack of RBMs

2000 top‐level units

RBM3
10 label units 500 units
(soft‐max)

500 units

28x28 
pixel 
image

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model 

Hinton et al.’s (2006) architecture
The network used to model joint distribution of digit images and labels.

2000 top‐level units

10 label units 500 units
(soft‐max)

500 units
Once the top layer is added, the 
connections between the layers 
below (now hidden) get decoupled ‐
28x28 
unidirectional
pixel 
image

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: pretraining

Hinton et al.’s (2006) architecture
Pretraining with labels once the stack of RBMs has been built

2000 top‐level units

10 label units 500 units
(soft‐max)

1. Clamp a label unit corresponding to the


input digit image (or rather its probabilistic 500 units
representation).

28x28 
pixel 
image

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: pretraining

Hinton et al.’s (2006) architecture
Pretraining with labels once the stack of RBMs has been built

2000 top‐level units

10 label units 500 units
(soft‐max)
Pretraining with labels

1. Clamp a label unit corresponding to the


input digit image (or rather its probabilistic 500 units
representation).
2. Gibbs sampling for CD learning (CD1) of the
weights in the top RBM, 500+10 <‐> 2000. 28x28 
pixel 
* Label units clamped (forced / set) only in the first iteration,                       
then in subsequent Gibbs iterations – soft‐max. image
* 500‐unit layer is a probabilistic representation 
(coherently with the notion of CD1 learning of an RBM).

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: pretraining

Hinton et al.’s (2006) architecture
Pretraining with labels once the stack of RBMs has been built

2000 top‐level units

10 label units 500 units
(soft‐max)
Pretraining with labels

1. Clamp a label unit corresponding to the


input digit image (or rather its probabilistic 500 units
representation).
2. Gibbs sampling for CD learning (CD1) of the
weights in the top RBM, 500+10 <‐> 2000. 28x28 
3. Test recognition by Gibbs sampling (20 pixel 
iterations) with soft‐max for 10 label units image
Initialise 10 label units uniformly with 0.1 and use
binary sample representations for 500 units
KTH Pawel Herman DD2437  Representation learning, and generative models
• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: generative mode

Approximate sampling from DBN

Gibbs sampling chain in the RBM part

RBM part 
(undirected part of the graph)

single‐run sampling
(through the directed graph)

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: generative mode

Hinton et al.’s (2006) architecture
Generating samples

2000 top‐level units

10 label units 500 units

1. Keep a label unit corresponding to the 500 units


requested digit label clamped (fixed).

28x28 
pixel 
image

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: generative mode

Hinton et al.’s (2006) architecture
Generating samples

2000 top‐level units

top RBM
10 label units 500 units

1. Keep a label unit corresponding to the 500 units


requested digit label clamped (fixed).
2. Run Gibbs sampling for 200 iterations in the
top RBM (500+10 <‐> 2000) to converge. 28x28 
• 2000 units: binary states pixel 
image
• 10 label units: clamped in all iterations
• 500 units: binary samples.
KTH Pawel Herman DD2437  Representation learning, and generative models
• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: generative mode

Hinton et al.’s (2006) architecture
Generating samples

2000 top‐level units

top RBM
10 label units 500 units

1. Keep a label unit corresponding to the 500 units


requested digit label clamped (fixed).
2. Run Gibbs sampling for 200 iterations in the
top RBM (500+10 <‐> 2000) to converge. 28x28 
500‐unit layer can be initialized with either a random 
• 2000 units: binary states pixel 
sample (binomial distribution) or a sample from biases 
image
or  as a sample drawn from the distribution obtained by 
• 10 label units: clamped in all iterations propagating random image all the way form the input.
• 500 units: binary samples.
KTH Pawel Herman DD2437  Representation learning, and generative models
• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: generative mode

Hinton et al.’s (2006) architecture
Generating samples

2000 top‐level units

10 label units 500 units

1. Keep a label unit corresponding to the 500 units


requested digit label clamped (fixed).
2. Run Gibbs sampling for 200 iterations in the
generative 
top RBM (500+10 <‐> 2000) to converge. 28x28  weights
pixel 
3. Generate binary samples from probabilities image
and propagate down to the input layer
where you can see probabs again as images.

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: fine‐tuning

Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm

2000 top‐level units

10 label units 500 units
Bottom‐up wake phase
1. Drive the network bottom‐up by providing input
digit images and using recognition weights 500 units recognition
propagate binary samples to the visible layer of
the top RBM.

28x28 
pixel 
image

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: fine‐tuning

Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm

2000 top‐level units

top RBM
10 label units 500 units
(soft‐max)

1. Drive the network bottom‐up by providing input


digit images and using recognition weights 500 units
propagate binary samples to the visible layer of
the top RBM.
2. Input a label unit corresponding to digit.
3. Run Gibbs sampling for 10‐20 iterations in the top 28x28 
pixel 
RBM (500+10 <‐> 2000).
image

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: fine‐tuning

Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm

2000 top‐level units

10 label units 500 units
(soft‐max)

1. Drive the network bottom‐up by providing input


digit images and using recognition weights 500 units
propagate binary samples to the visible layer of
the top RBM.
2. Input a label unit corresponding to digit. generative 
3. Run Gibbs sampling for 10‐20 iterations in the top 28x28  weights
pixel 
RBM (500+10 <‐> 2000).
image
4. Propagate the activity using generative weights
(binary sampling all the way) to the input layer
represented with probabs. Top down sleep phase
KTH Pawel Herman DD2437  Representation learning, and generative models
• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: fine‐tuning

Approximate sampling from DBN

Gibbs sampling chain in the RBM part

RBM part 
(undirected part of the graph)

single‐run sampling
(through the directed graph)

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: fine‐tuning

Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm

Learning that results from the wake phase 
(based on network activities sampled during wake phase)

xi
w ji

yj y j prediction 
(probability)

w ji  xi  y j  y j 

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: fine‐tuning

Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm
CDk learning of the top RBM
labels are not 
clamped here 
(soft‐max is used)

w j ,i  v (0)
j hi
(0)
 v (k ) (k )
j hi

binary states
binary states binary states
(probabilities could be used too)

KTH Pawel Herman DD2437  Representation learning, and generative models


• RBMs and CD learning
• DBNs (stacking RBMs)
• Hinton et al.’s (2006) model: fine-tuning

Hinton et al.’s (2006) architecture
Fine‐tuning with a contrastive wake‐sleep algorithm

Learning that results from the sleep phase 
(based on network activities sampled during sleep phase)

yi yi prediction
(probability)

wij
xj

wij  x j  yi  yi 

KTH Pawel Herman DD2437  Representation learning, and generative models

You might also like