Machine Learning: Chapter 4. Artificial Neural Networks
Machine Learning: Chapter 4. Artificial Neural Networks
Tom M. Mitchell
Artificial Neural Networks
Threshold units
Gradient descent
Multilayer networks
Backpropagation
Hidden layer representations
Example: Face Recognition
Advanced topics
2
Connectionist Models (1/2)
Consider humans:
Neuron switching time ~ .001 second
Number of neurons ~ 1010
Connections per neuron ~ 104-5
Scene recognition time ~ .1 second
100 inference steps doesn’t seem like enough
much parallel computation
3
Connectionist Models (2/2)
Properties of artificial neural nets (ANN’s):
Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed process
Emphasis on tuning weights automatically
4
When to Consider Neural Networks
Input is high-dimensional discrete or real-valued
(e.g. raw sensor input)
Output is discrete or real valued
Output is a vector of values
Possibly noisy data
Form of target function is unknown
Human readability of result is unimportant
Examples:
Speech phoneme recognition [Waibel]
Image classification [Kanade, Baluja, Rowley]
Financial prediction
5
ALVINN drives 70 mph on highways
6
Perceptron
7
Decision Surface of a Perceptron
10
Gradient Descent (2/4)
Gradient
Training rule:
i.e.,
11
Gradient Descent (3/4)
12
Gradient Descent (4/4)
Initialize each wi to some small random value
Until the termination condition is met, Do
– Initialize each wi to zero.
– For each <x, t> in training_examples, Do
* Input the instance x to the unit and compute the
output o
* For each linear unit weight wi, Do
wi wi + (t – o) xi
– For each linear unit weight wi , Do
wi wi + wi
13
Summary
Perceptron training rule guaranteed to succeed if
Training examples are linearly separable
Sufficiently small learning rate
15
Incremental (Stochastic) Gradient Descent (2/2)
16
Multilayer Networks of Sigmoid Units
17
Sigmoid Unit
Nice property:
We can derive gradient decent rules to train
One sigmoid unit
Multilayer networks of sigmoid units Backpropagation
18
Error Gradient for a Sigmoid Unit
But we know:
So:
19
Backpropagation Algorithm
Initialize all weights to small random numbers. Until satisfied, Do
For each training example, Do
1. Input the training example to the network and compute the
network outputs
2. For each output unit k : k k(1 - k) (tk - k)
3. For each hidden unit h
h h(1 - h) k outputs wh,kk
4. Update each network weight wi,j
wi,j wi,j + wi,j
where wi,j = j xi,j
20
More on Backpropagation
Gradient descent over entire network weight vector
Easily generalized to arbitrary directed graphs
Will find a local, not necessarily global error minimum
– In practice, often works well (can run multiple times)
Often include weight momentum
wi,j (n) = j xi,j + wi,j (n - 1)
Minimizes error over training examples
– Will it generalize well to subsequent examples?
Training can take thousands of iterations slow!
Using network after training is very fast
21
Learning Hidden Layer Representations (1/2)
A target function:
23
Training (1/3)
24
Training (2/3)
25
Training (3/3)
26
Convergence of Backpropagation
Gradient descent to some local minimum
Perhaps not global minimum...
Add momentum
Stochastic gradient descent
Train multiple nets with different initial weights
Nature of convergence
Initialize weights near zero
Therefore, initial networks near-linear
Increasingly non-linear functions possible as training
progresses
27
Expressive Capabilities of ANNs
Boolean functions:
Every boolean function can be represented by network with
single hidden layer
but might require exponential (in number of inputs) hidden
units
Continuous functions:
Every bounded continuous function can be approximated
with arbitrarily small error, by network with one hidden layer
[Cybenko 1989; Hornik et al. 1989]
Any function can be approximated to arbitrary accuracy by a
network with two hidden layers [Cybenko 1988].
28
Overfitting in ANNs (1/2)
29
Overfitting in ANNs (2/2)
30
Neural Nets for Face Recognition
http://www.cs.cmu.edu/tom/faces.html
32
Alternative Error Functions
Penalize large weights:
33
Recurrent Networks
(a) (b) (c)
34