0% found this document useful (0 votes)
235 views34 pages

Machine Learning: Chapter 4. Artificial Neural Networks

The document discusses artificial neural networks. Some key points: - Neural networks are made up of interconnected nodes that operate in parallel to solve problems. Backpropagation is an algorithm used to train multilayer neural networks by calculating error gradients. - Recurrent neural networks can represent temporal behavior because they have connections between nodes within a layer. - Neural networks can learn complex patterns and approximate continuous functions, though they are prone to overfitting. Techniques like adding hidden units and alternative error functions help address this.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
235 views34 pages

Machine Learning: Chapter 4. Artificial Neural Networks

The document discusses artificial neural networks. Some key points: - Neural networks are made up of interconnected nodes that operate in parallel to solve problems. Backpropagation is an algorithm used to train multilayer neural networks by calculating error gradients. - Recurrent neural networks can represent temporal behavior because they have connections between nodes within a layer. - Neural networks can learn complex patterns and approximate continuous functions, though they are prone to overfitting. Techniques like adding hidden units and alternative error functions help address this.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

Machine Learning

Chapter 4. Artificial Neural


Networks

Tom M. Mitchell
Artificial Neural Networks
 Threshold units
 Gradient descent
 Multilayer networks
 Backpropagation
 Hidden layer representations
 Example: Face Recognition
 Advanced topics
2
Connectionist Models (1/2)
Consider humans:
 Neuron switching time ~ .001 second
 Number of neurons ~ 1010
 Connections per neuron ~ 104-5
 Scene recognition time ~ .1 second
 100 inference steps doesn’t seem like enough
 much parallel computation

3
Connectionist Models (2/2)
Properties of artificial neural nets (ANN’s):
 Many neuron-like threshold switching units
 Many weighted interconnections among units
 Highly parallel, distributed process
 Emphasis on tuning weights automatically

4
When to Consider Neural Networks
 Input is high-dimensional discrete or real-valued
(e.g. raw sensor input)
 Output is discrete or real valued
 Output is a vector of values
 Possibly noisy data
 Form of target function is unknown
 Human readability of result is unimportant
Examples:
 Speech phoneme recognition [Waibel]
 Image classification [Kanade, Baluja, Rowley]
 Financial prediction

5
ALVINN drives 70 mph on highways

6
Perceptron

Sometimes we’ll use simpler vector notation:

7
Decision Surface of a Perceptron

Represents some useful functions


 What weights represent
g(x1, x2) = AND(x1, x2)?
But some functions not representable
 e.g., not linearly separable
 Therefore, we’ll want networks of these...
8
Perceptron training rule
wi  wi + wi
where wi =  (t – o) xi
Where:
 t = c(x) is target value
 o is perceptron output
  is small constant (e.g., .1) called learning rate

Can prove it will converge


 If training data is linearly separable
 and  sufficiently small
9
Gradient Descent (1/4)
 To understand, consider simpler linear unit, where
o = w0 + w1x1 + ··· + wnxn
 Let's learn wi’s that minimize the squared error

 Where D is set of training examples

10
Gradient Descent (2/4)

Gradient

Training rule:

i.e.,

11
Gradient Descent (3/4)

12
Gradient Descent (4/4)
 Initialize each wi to some small random value
 Until the termination condition is met, Do
– Initialize each wi to zero.
– For each <x, t> in training_examples, Do
* Input the instance x to the unit and compute the
output o
* For each linear unit weight wi, Do
wi  wi +  (t – o) xi
– For each linear unit weight wi , Do
wi  wi + wi
13
Summary
Perceptron training rule guaranteed to succeed if
 Training examples are linearly separable
 Sufficiently small learning rate 

Linear unit training rule uses gradient descent


 Guaranteed to converge to hypothesis with minimum
squared error
 Given sufficiently small learning rate 
 Even when training data contains noise
 Even when training data not separable by H
14
Incremental (Stochastic) Gradient Descent (1/2)

Batch mode Gradient Descent:


Do until satisfied
1. Compute the gradient ED[w]
2. w  w -  ED[w]

Incremental mode Gradient Descent:


Do until satisfied
 For each training example d in D
1. Compute the gradient Ed[w]
2. w  w -  Ed[w]

15
Incremental (Stochastic) Gradient Descent (2/2)

Incremental Gradient Descent can approximate


Batch Gradient Descent arbitrarily closely if 
made small enough

16
Multilayer Networks of Sigmoid Units

17
Sigmoid Unit

(x) is the sigmoid function

Nice property:
We can derive gradient decent rules to train
 One sigmoid unit
 Multilayer networks of sigmoid units  Backpropagation
18
Error Gradient for a Sigmoid Unit

But we know:

So:

19
Backpropagation Algorithm
Initialize all weights to small random numbers. Until satisfied, Do
 For each training example, Do
1. Input the training example to the network and compute the
network outputs
2. For each output unit k : k  k(1 - k) (tk - k)
3. For each hidden unit h
h  h(1 - h) k outputs wh,kk 
4. Update each network weight wi,j
wi,j  wi,j + wi,j
where wi,j =  j xi,j

20
More on Backpropagation
 Gradient descent over entire network weight vector
 Easily generalized to arbitrary directed graphs
 Will find a local, not necessarily global error minimum
– In practice, often works well (can run multiple times)
 Often include weight momentum 
wi,j (n) =  j xi,j +  wi,j (n - 1)
 Minimizes error over training examples
– Will it generalize well to subsequent examples?
 Training can take thousands of iterations  slow!
 Using network after training is very fast

21
Learning Hidden Layer Representations (1/2)

A target function:

Can this be learned??


22
Learning Hidden Layer Representations (2/2)

A network: Learned hidden layer representation:

23
Training (1/3)

24
Training (2/3)

25
Training (3/3)

26
Convergence of Backpropagation
Gradient descent to some local minimum
 Perhaps not global minimum...
 Add momentum
 Stochastic gradient descent
 Train multiple nets with different initial weights

Nature of convergence
 Initialize weights near zero
 Therefore, initial networks near-linear
 Increasingly non-linear functions possible as training
progresses
27
Expressive Capabilities of ANNs
Boolean functions:
 Every boolean function can be represented by network with
single hidden layer
 but might require exponential (in number of inputs) hidden
units
Continuous functions:
 Every bounded continuous function can be approximated
with arbitrarily small error, by network with one hidden layer
[Cybenko 1989; Hornik et al. 1989]
 Any function can be approximated to arbitrary accuracy by a
network with two hidden layers [Cybenko 1988].

28
Overfitting in ANNs (1/2)

29
Overfitting in ANNs (2/2)

30
Neural Nets for Face Recognition

 90% accurate learning head pose, and recognizing


1-of-20 faces
31
Learned Hidden Unit Weights

http://www.cs.cmu.edu/tom/faces.html

32
Alternative Error Functions
Penalize large weights:

Train on target slopes as well as values:

Tie together weights:


 e.g., in phoneme recognition network

33
Recurrent Networks
(a) (b) (c)

(a) Feedforward network


(b) Recurrent network
(c) Recurrent network unfolded
in time

34

You might also like