0% found this document useful (0 votes)
4 views6 pages

11-Nonlinear Models (Neural Networks)

The document discusses nonlinear models, particularly focusing on neural networks and their ability to solve complex classification problems like XOR, which cannot be addressed by linear models. It explains the structure of artificial neural networks, including perceptrons and multi-layer perceptrons, as well as the processes of forward propagation and backpropagation for computing gradients. The document emphasizes the importance of learnable non-linear mappings and the challenges of programming these systems efficiently.

Uploaded by

soham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

11-Nonlinear Models (Neural Networks)

The document discusses nonlinear models, particularly focusing on neural networks and their ability to solve complex classification problems like XOR, which cannot be addressed by linear models. It explains the structure of artificial neural networks, including perceptrons and multi-layer perceptrons, as well as the processes of forward propagation and backpropagation for computing gradients. The document emphasizes the importance of learnable non-linear mappings and the challenges of programming these systems efficiently.

Uploaded by

soham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

11-Nonlinear Models (Neural Networks)

Linear models
In previous topics, we mainly dealt with linear models
• Regression ℎ(𝒙) = 𝒘! 𝒙 + 𝑏

• Classification:
ℎ(𝒙) = argmax 𝒘!
" 𝒙 + 𝑏"

Category 𝑖 is referred than Category 𝑗


𝒘! !
" 𝒙 + 𝑏" ≥ 𝒘# 𝒙 + 𝑏# i.e., (𝒘" − 𝒘# )! 𝒙 + (𝑏" − 𝑏# ) ≥ 0
Binary classification with logistic regression is a special case of the above.

Nonlinear models
• Non-linear features 𝝓(𝒙)
○ E.g., Gaussian discriminant analysis with the different covariance
matrices, in which case we have quadratic features of 𝒙.
• Non-linear kernel 𝑘6𝒙" , 𝒙# 8
○ A kernel is an inner-product of two data samples that are
transformed in a certain vector space. The vector space could be very
high-dimensional (e.g., with infinite dimensions). A linear
classification in such a high-dimensional space could be non-linear in
the original low dimensional space.
• Learnable non-linear mapping
○ We can probably stack a few layers of learnable non-linear functions
(e.g., logistic functions) to learn the non-linear feature 𝝓(𝒙) or a non-
linear kernel that is appropriate to the task at hand.

Motivation: XOR, a nonlinear classification problem


○ We can probably stack a few layers of learnable non-linear functions
(e.g., logistic functions) to learn the non-linear feature 𝝓(𝒙) or a non-
linear kernel that is appropriate to the task at hand.

Motivation: XOR, a nonlinear classification problem

• The problem cannot be solved by logistic regression


• What if we stack multiple logistic regression classifiers?

• The XOR problem is solvable by three linear classifiers


○ One built upon the other two
○ But this is programming [HW]
§ Some machinery that allows you to specify certain things
§ Programming means you specify these things (usually
heuristically by human intelligence) that are can be input to the
machinery
§ The machinery accomplishes a certain task according to your
input (program).
○ Programming is very tedious and only feasible simple tasks
○ We want to learn the weights.

• Can we learn the weights?


○ Yes, still by gradient descent.

• Can we compute the gradient?


○ Yes, it's still a differentiable function

○ Again, brute-force computation of gradient is very tedious.


○ We need a systematic way of
○ Again, brute-force computation of gradient is very tedious.
○ We need a systematic way of
§ Defining a deep architecture, and
§ Computing its gradient.

Artificial Neural Network


• A perceptron [Rosenblatt, 1958]

𝑧 = 𝒘! 𝒙 + 𝑏
𝑦 = 𝑓(𝑧)

where 𝑓 is a binary thresholding function

• A perceptron-like neuron, unit, or node


𝑧 = 𝒘! 𝒙 + 𝑏
𝑦 = 𝑓 (𝑧 )

𝑓 is an activation function, e.g., sigmoid, tanh, ReLU

○ Usually we use nonlinear activation


○ Linear activation may be used for regression

• A multi-layer neural network, or a multi-layer perceptron

A common structure is layer-wise fully connected

For each node 𝑗 at layer


For each node 𝑗 at layer

To simply notations, we omit the layer L, but call the output of the current
layer as 𝑦 and the input of the current layer 𝑥, which is the output of the
lower layer. In the simplified notation,

Since we have multiple layers, we need a recursive algorithm that


computes the activation of all nodes automatically.

Forward propagation (FP)


○ Initialization

○ Recursion

○ Termination

Gradient of multi-layer neural networks


Main idea: if we can compute the gradient for one layer, we may use
chain rule to compute the gradient for all layers.

Recursion on what?

We consider a local layer


We consider a local layer

Backpropagation (BP)
○ Initialization

○ Recursion
Backpropagation (BP)
○ Initialization

○ Recursion

○ Termination

• A few more thoughs


○ Non-layerwise connection: Topological sort
○ Multiple losses: BP is a linear system
○ Tied weights: Total derivative is the summation

Auto-differentiation in general

Numerical gradient checking

You might also like