Neural Networks and Deep Learning
Neural Networks and Deep Learning
Om Prabhu
19D170018
Undergraduate, Department of Energy Science and Engineering
Indian Institute of Technology Bombay
Last updated January 31, 2021
NOTE: This document is a brief compilation of my notes taken during the course ‘Neural Networks and Deep
Learning’. You are free to use it and my project files for your own personal use & modification. You may check
out the course and/or specialization here: deeplearning.ai.
Contents
1 Introduction 3
1.1 About myself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 A little bit about deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1
6 Deep Neural Networks 22
2
1 Introduction
1.1 About myself
Hello. I am Om Prabhu, currently an undergrad at the Department of Energy Science and Engineering, IIT
Bombay. If you have gone through my website (https://omprabhu31.github.io/) earlier, which is probably
where you found this document too, you will know that I am quite a bit into programming and tinkering with
code to try and do cool stuff. Additionally, I love playing video games, listening to music and engaging in a
little bit of creative writing as and when I get time. With this brief self-introduction, let us get into why I
decided to pursue this course.
3
2 Overview of Deep Learning
The terminology in AI is still not very well defined. For example, some people say that neural networks are a
subset of deep learning while others use the two words almost interchangeably. Through most of these notes,
we will refer to deep learning as being a science of building and training neural networks. With that said, let’s
look at what a neural network is.
All this single-neuron network does is compute the curve shown and ‘learn’ it in order to map any value of price
to the appropriate value of demand. A single neuron can be thought of as a Lego brick, and a neural network
as a very complicated stack, often in multiple layers, of such bricks.
Let’s look at a more complicated example. Suppose that instead of just the price, we have more variables
like shipping cost, marketing cost and material. Then we will have multiple factors that influence demand like
affordability, consumer awareness and perceived quality. We might then have a slightly more complicated neural
net like the one below:
This slightly more complicated neural network maps the 4 input parameters to the output that is the demand.
From that the way in which we have discussed neural networks above, it appears as if we have to actually figure
out the key factors as affordability, awareness and perceived quality. However, things do not work this way.
One of the best things about neural networks is that we only have to provide it the input and the output − all
of the stuff in the middle, it figures out by itself. It automatically ‘learns’ and completely trains itself to find
the most accurate possible function that maps from the input to the output.
4
Another little correction is that it seems like the first node takes in only price and shipping inputs and so on.
This is not the case. In practice, all nodes will be fed in with all the inputs and we let the individual neurons
decide how many inputs they want to use and how they use them.
online advertising: input is ad & user info using which the algorithm tries to figure out if the user is likely
to click on the ad
computer vision: this is a very vast application area of AI (for example, input is an image and the
algorithm tries to figure out whether the image is part of a dataset of 1000 images)
speech recognition: input an audio clip and output a text transcript based on the input
machine translation: input text or documents in certain languages and receive a translated output
autonomous driving: AI algorithm uses image & sensor info to figure out the position of nearby cars so it
can avoid them
It turns out that slightly different types of neural networks are useful for different applications. For example,
convolutional neural networks (CNN) are most commonly used for image applications while recurrent neural
networks (RNN) are better for processing sequential data such as audio. More complicated applications often
require the generation of custom neural network architecture.
Finally, supervised learning can be applied to both structured data (highly organized data such as tables,
spreadsheets and databases) as well as unstructured data (data in the form of images, audio, video and even
raw text).
5
2.3 Why is deep learning taking off now?
The ideas for deep learning and neural networks have been around for decades. Why is it that they have taken
off only recently?
One of the major drivers of deep learning is scale. This refers to the ability of technology to train large sets of
neural networks to process huge amounts of data while also improving AI performance. This can be illustrated
through a graph as follows:
for traditional deep learning systems, the data vs performance graph maxes out pretty early
on training increasingly complex neural networks with higher amounts of data, performance keeps on
getting better for much longer
Hence to achieve the highest performance levels, we need two things. Firstly, it helps to have a lot of data.
Additionally, we need the ability to train large sets of neural networks. Earlier, it was almost as if AI systems
didn’t know what to do with huge amounts of data. Now, with fast & powerful CPUs and GPUs, it is possible
for us to train large neural networks with a high amount of data.
Another factor is that of algorithmic innovation in order to increase the training speeds of neural networks. One
of the major breakthroughs in this area has been the switch from sigmoid functions to ReLU functions:
It turns out that on the tapering ends of the sigmoid function, since the gradient is nearly zero, the parameters
hardly change and the training becomes very slow. In a ReLU curve, the gradient does not gradually shift to
zero, so there is no point where the learning speed of the network is very slow.
Fast computation is very important in deep learning, since the idea of training a neural network involves a lot
of iterating − basically we input the data, then train the neural network, then get feedback about how well the
neural network performed, and then make changes to the data and neural network and repeat the cycle over
and over until the neural network is trained well enough. Hence by reducing computation speeds, it leads to a
huge rise in productivity while building out neural networks for AI projects.
END OF WEEK 1
This is the end of the course notes for Week 1. Keep on reading further for notes from further weeks, or spend
some time gaining further insight into the previously discussed topics.
6
3 Logistic Regression as a Neural Network
Now that we have an idea of what neural networks are and what they can do, let us dive into the basics of
neural network programming. Many of these ideas can be discussed using the concept of logistic regression.
NOTE: Throughout the document, there will be extensive use of the following notation:
1. a single input feature vector has dimension nx
2. a single training example is represented by (x, y) → x ∈ Rnx , y ∈ {0, 1}
3. a training set comprises m (or mtrain ) training examples, i.e. {(x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(m) , y (m) )}
4. a test set similarly contains m (or mtest ) test examples
5. to put all training examples x into a more compact notation, we define a matrix X ∈ Rnx ×m as follows:
.. .. ..
. . .
X = x(1) x(2) · · · x(m)
. .. ..
.. . .
6. to put all output labels y into a more compact notation, we define a matrix Y ∈ R1×m as follows:
h i
Y = y (1) y (2) · · · y (m)
7. terms of the form x(i) , y (i) , etc are associated with the ith training example
7
If you wish to know more about derivatives and/or limits, feel free to go through the resources listed below. A
expertise in differential calculus is not desired, but a working knowledge always helps:
Limits and Continuity (by Khan Academy)
Humans perceive images by their subtle features. A computer looks at an image as an array of pixels. A computer
stores 3 separate matrices, each the size of the pixel map (i.e. 1920×1080, 1024×768, etc), corresponding to
the RGB channels of the image. To process images, a computer ‘unpacks’ the pixel intensity values to create
what is known as a feature vector.
We just take all the red pixel values and list them one
by one, followed by the green pixel values and then
the blue pixel values, as shown below:
253
144
.
.
.
14
125
x= .
.
.
165
187
.
..
8
Instead what we do in logistic regression is ŷ = σ(wT x+b), where
σ is the sigmoid function (shown in the diagram alongside, ignore
the φ(z) notation). Defining z = wT x + b,
1
σ(z) =
1 + e−z
If z is a large positive number, e−z → 0, ∴ σ(z) → 1
If z is a large negative number, e−z → ∞, ∴ σ(z) → 0
We have now seen the logistic regression model in quite a bit of
detail. However, to train the parameters w and b, we need to
define a cost function.
Using this, we try to find parameters w & b that minimize the overall costs function J.
We naturally want to find w, b that minimize J(w, b). We do this using what is known as the gradient descent
algorithm. Here’s an illustration of gradient descent.
9
Note that, in practice, w is of a much higher dimension, but for
illustrative purposes let’s assume it is a single real number. As
seen, J(w, b) has a single global optimum, and we want to find
w, b corresponding to this minimum.
To do this, we first initialize w & b to some value (for convenience,
we usually initialize them both to 0).
After this, what the gradient descent algorithm does is take a step
in the steepest downhill direction. There can multiple iterations
of the gradient descent cycle until the point the algorithm decides
that it has reached the global optimum.
This gives us a brief overview of how gradient descent works. Let’s dive deeper into the details. Consider a
one-dimensional plot with only w as the parameter.
The gradient descent algorithm repeatedly carries out the follow-
ing update until it converges to the global optimum value:
dJ(w)
w ≡w−α
dw
where α is the learning rate, which controls how big each step in
the gradient descent is.
No matter where we initialize the value of w, the gradient descent
algorithm will always reach the global optimum at some point.
Since we actually have a function J in 2 parameters, the loop of gradient descent becomes as follows:
∂J(w, b) ∂J(w, b)
w ≡w−α ; b≡b−α
∂w ∂b
When we actually get down to coding such and algorithm, we will use the following (mathematically inconsistent)
notation:
∂J(w, b) ∂J(w, b)
dw ≡ ; db ≡ −→ w ≡ w − αdw; b ≡ b − αdb
∂w ∂b
dJ
Let’s see what happens with . Say we change the value of a from 5 to 5.001. Subsequently, v is changed from
da
11 to 11.001, and J from 33 to 33.003. Again, we see that every unit of change in a causes 3 units of change in
dJ
J. Similarly, we can compute that = 3.
du
10
3.6.2 The chain rule for derivatives
Finally let us consider the derivative of J wrt b. Say we change the value of b from 3 to 3.001. Note that the
value of u is not only affected by b but also by c. Hence, the value of u will change to 6.002, that of v to 11.002
dJ
and that of J to 33.006. So, the value of is 6, right? Well, yes and no.
db
Let’s think of it this way. What if b = 3 and c = 3? We get u = 9, v = 14 and J = 42. Changing b from 3 to
3.001 would change u from 9 to 9.003, v from 14 to 14.003 and finally, J from 42 to 42.009. Now it seems like
dJ dJ
= 9. How did this happen? It seems like the value of c affects the derivative term and indeed, it is so.
db db
We can look at it as changing b changes u first. This change in u causes a change in v. Finally, the change in v
causes a change in J. This is where the chain rule for derivatives comes in. According to the chain rule,
dJ ∂J ∂v ∂u
= × ×
db ∂v ∂u ∂b
We can verify this by quickly computing the values of the individual derivatives and multiplying them. It
is important to clarify here itself that even though the chain rule makes it look like derivatives compute like
ordinary fractions, this is NOT the case.
With this understanding of the chain rule, now the term ‘backward propagation’ really starts to make sense.
To find the derivative of J wrt b, we ‘propagate backwards’ from J to v, then from v to u and finally, from u
to b. This concept might seem trivial now, but it helps quite a lot in the context of logistic regression.
NOTE: Whenever we use computation graphs, there will usually be an output variable that we want to
optimize. Hence, a lot of the computation will be about calculating the derivative of the output variable (J)
with respect to some other variable (let’s call it var). When we implement it in code, we will have to assign
the derivative a variable name. While we could name it something like dFinalOutput/dVar or dJ/dvar, these
are not good variable names. For our convenience, we will name the variable as simply dvar. For example, if
we want to compute dJ/da, we will name the derivative variable da. This convention will be used in further
sections where we will actually code algorithms in Python.
z = wT x + b; ŷ = a = σ(z)
We want to modify the parameters w1 , w2 , b in order to reduce the loss L(a, y). This will require us to go
through backward propagation to compute the derivatives of L wrt w1 , w2 , b. To do this, we follow these steps:
∂L(a, y) y 1−y ∂L(a, y)
=− + (while writing code, the variable name for will be simply da)
∂a a 1−a ∂a
∂L(a, y) ∂L(a, y) ∂a ∂L(a, y)
= × = a − y (again the variable name for will be dz)
∂z ∂a ∂z ∂z
∂L(a, y) ∂L(a, y) ∂L(a, y)
= x1 (a − y); = x2 (a − y); = a − y (variable names dw1, dw2, db)
∂w1 ∂w2 ∂b
11
∂L
3.7.2 Derivation of (optional)
∂z
∂L
Let us explore the derivation of the expression for in a much greater detail:
∂z
∂L
Proof. Step 1:
∂a
L = −[y log a + (1 − y) log(1 − a)]
∂L d log a d log(1 − a)
∴ =− y − (1 − y)
∂a da da
∂L 1 1
∴ =− y× − (1 − y) × × (−1)
∂a a 1−a
∂L(a, y) y 1−y
∴ =− + (1)
∂a a 1−a
∂a
Step 2:
∂z
∂a ∂σ(z)
a = σ(z) =⇒ =
∂z ∂z
∂a ∂ 1
∴ =
∂z ∂z 1 + e−z
∂a 1 −z e−z
∴ =− (−e ) =
∂z (1 + e−z )2 (1 + e−z )2
∂a 1 1
∴ = 1− = σ(z)(1 − σ(z))
∂z 1 + e−z 1 + e−z
∂a
∴ = a(1 − a) (2)
∂z
∂L
Step 3: (from (1) & (2), using the chain rule for derivatives)
∂z
∂L ∂L ∂a
= ×
∂z ∂a ∂z
∂L y 1−y
∴ = − + × a(1 − a)
∂z a 1−a
∂L a−y
∴ = × a(1 − a)
∂z a(1 − a)
∂L
∴ =a−y
∂z
What we need to do is compute the derivatives on all the individual training examples and average them over
the training set to be able to carry out gradient descent. Let’s take all of this and wrap it into a concrete
algorithm until when we should implement the gradient descent (variable names are to be comprehended as
defined earlier):
initialise J=0, dw1=0, dw2=0, db=0
12
using a for loop, compute the derivatives corresponding to each training example and add them up − a
very rough sample program would look like this (obviously, it is far from complete)
> J,db = 0,0
for k in range(n_x):
dw[k] = 0
for i in range(m):
z[i] = w[i]*x[i] + b
a[i] = sigma(z[i])
J += -(y[i]*math.log(a[i]) + (1-y[i])*math.log(1-a[i]))
dz[i] = a[i] - y[i]
for k in range(n_x):
dw[k] += x[k,i]dz[i]
db += dz[i]
J/=m
for k in range(n_x):
dw[k]/=m
db/=m
When implementing deep learning algorithms, it is found that having explicit for loops greatly slows down the
learning speed of the algorithm. As we move to bigger and bigger datasets, it becomes even more cumbersome.
Hence, to avoid using explicit for loops, there is a set of techniques called vectorization.
Maximising this function is equivalent to maximising the individual probabilities, for which again we resort to
maximising log P (y (i) | x(i) ) rather than P (y (i) | x(i) ).
m
X m
X
log Ptrain = log P (y (i) | x(i) ) = − L(ŷ (i) , y (i) )
i=1 i=1
This now starts to look like the form of the cost function. Now, since we want to minimize the cost, we take the
negative of log Ptrain . And finally, to make sure that our quantities are better scaled, we divide by m (we can
work without this, but the value of the function may sometimes get too high for large datasets). This brings us
to our final form of the cost function
m m
1 X (i) (i) 1 X h (i) i
J(w, b) = L(a , y ) = − y log a(i) + (1 − y (i) ) log(1 − a(i) )
m i=1 m i=1
13
4 Python and Vectorization
Vectorization is a technique used to get rid of excessive for loops in programs. This is especially important
in deep learning because the presence of for loops in deep learning algorithms significantly increases the time
that it takes the code to run. With small datasets, this might be okay but with larger ones, it can take days on
end and this really slows down the training process.
4.1 Vectorization
In logistic regression, we needed to compute z = wT x + b, where w, x ∈ Rnx . A non-vectorized implementation
of this computation would involve something like this:
> z=0
for i in range[1,n_x]:
z += w_i*x_i
z += b
In contrast, a vectorized implementation of the same using the numpy library for Python would involve a much
simpler and faster piece of code:
> z = np.dot(w,x) + b
We could actually try a speed test with both the codes by getting them to multiply one million terms from 2
random arrays:
import time import time
import numpy as np import numpy as np
a = np.random.rand(1000000) c=0
b = np.random.rand(1000000)
initial = time.time() a = np.random.rand(1000000)
c = np.dot(a,b) b = np.random.rand(1000000)
final = time.time() initial = time.time()
for i in range(1000000):
print("Vectorized version:" + str(1000* c += a[i]*b[i]
(final-initial)) + "ms") final = time.time()
14
J/=m
for k in range(n_x):
dw[k]/=m
db/=m
It turns out that we can easily get rid of the nested for loop in the code. This can be done by getting rid of
individual vectors dw1, dw2 and so on into a single numpy vector dw. The code may look something like this:
> J,db = 0,0
dw = np.zeros((n_x,1))
for i in range(m):
z[i] = w[i]*x[i] + b
a[i] = sigma(z[i])
J += -(y[i]*math.log(a[i]) + (1-y[i])*math.log(1-a[i]))
dz[i] = a[i] - y[i]
dw += x[i]dz[i]
db += dz[i]
J/=m
dw/=m
db/=m
We’ve gone from having 2 for loops to just one. Though, we still loop over the individual training examples.
Using these 2 matrices, we can express the system of equations z (i) = wT x(i) + b in a single matrix equation,
which translate into a single line of code:
z (1) = wT x(1) + b
z (2) = wT x(2) + b
.. −→ Z = wT X + B −→ z = np.dot(w.T,x)+b
.
z (m) = wT x(m) + b
There is one subtlety in the above piece of code. We don’t explicitly need to define b as a 1 × m vector. Even if
b is defined as a real number, given that the term np.dot(w.T,x) is of dimension 1 × m, Python automatically
expands b out into a vector of the same dimension. This operation is also known as ‘broadcasting’ in Python.
Finally, we would like to compute the variables a(i) . We can define a matrix A ∈ R1×m similar to how we
defined Z. We can then implement a vector-valued sigmoid function to output A (this will be discussed in a
later section): h i
A = a(1) a(2) ··· a(m) = σ(Z)
15
We got rid of the for loop which calculated dw[k] by initializing dw as a single numpy zero vector. But we still
have the for loop over the m examples in the training set. Let’s now attempt to vectorize all these operations
and get rid of this for loop as well.
Step 1: for db, we initialize it as db=0 and then add dz[i] to it in each iteration of the for loop
db += dz[1]
db += dz[2]
.. followed by db /= m −→ db = np.sum(dZ)/m
.
db += dz[m]
Step 2: for dw, we again initialize it as 0 and then add x[i]dz[i] in each iteration of the for loop
dw += x[1]dz[1]
dw += x[2]dz[2]
.. followed by dz /= m −→ dw = np.dot(x,dZ.T)/m
.
dw += x[m]dz[m]
Now that we have vectorized the code for all parameters, it’s time to put it all together into a single program.
This is our original, highly inefficient non-vectorized implementation:
> J,db = 0,0
for k in range(n_x):
dw[k] = 0
for i in range(m):
z[i] = w[i]*x[i] + b
a[i] = sigma(z[i])
J += -(y[i]*math.log(a[i]) + (1-y[i])*math.log(1-a[i]))
dz[i] = a[i] - y[i]
for k in range(n_x):
dw[k] += x[k,i]dz[i]
db += dz[i]
J/=m
for k in range(n_x):
dw[k]/=m
db/=m
A vectorized version of the same would look like this:
> Z = np.dot(w.t,x) + b
A = sigma(Z)
dZ = A - Y
dw = np.dot(x,dZ.T)/m
db = np.sum(dZ)/m
w = w - alpha*dw
b = b - alpha*db
We have successfully implemented one iteration of gradient descent without using any for loops. However, note
that for multiple gradient descents, we will have to loop over the number of iterations.
16
> import numpy as np
cal = A.sum(axis=0) # axis = 0 --> sum vertically; axis=1 --> sum horizontally
percentage = 100*A/cal
print(percentage)
The line percentage = 100*A/cal is an example of broadcasting in python. What we have essentially done is
divide a 3 × 4 matrix by a 1 × 4 matrix. This type of broadcasting works with both column vectors as well as
row vectors. Let’s go through another example. If we try to perform the following addition in Python,
1 1 100 101
2 2 100 102
+ 100 −→ + =
3 3 100 103
4 4 100 104
We saw this type of broadcasting in our earlier code for logistic regression with the line Z = np.dot(w.t,x)+b.
a = np.random.randn(5)
print(a.shape)
print(a) # prints out the array
print(a.T) # still prints out the same array
print(np.dot(a,a.T)) # prints a single number instead of a matrix
The first line prints out (5,), indicating that the array is of rank 1. This type of data structure does not
behaves consistently as either a row vector or as a column vector. Its transpose is the same itself. This can lead
to some unexpected results with multiplication, etc.
To avoid this, it is better to initialise a = np.random.randn(5,1). Another alternative is to force the dimension
using assert(a.shape == (5,1)). If, for some reason, you end up using a rank 1 array, it can be reshaped as
a.reshape(5,1).
This is the end of the course notes for Week 2. Keep on reading further for the notes from further weeks, or
spend some time gaining further insight into the previously discussed topics.
17
5 Shallow Neural Networks
In the previous section, we saw how to implement a single iteration of logistic regression, i.e. a neural network
with a single node. The node computed 2 variables z and a.
Most often, neural networks will have multiple nodes distributed between multiple ‘hidden’ layers. Let us see
an example of a neural network with one hidden node:
the second layer (a single node) will receive input W [2] , b[2] , a[1] from the first layer and follow a similar
computation
z [2] = W [2] a[1] + b[2] −→ a[2] = σ(z [2] ) −→ L(a[2] , y)
NOTE: The above notation will be used throughout the next sections. This notation W [i] , b[i] , a[i] corresponds
to different layers in the neural network and is not to be confused with W (i) , b(i) , a(i) , which corresponds to
individual training examples.
18
5.2 Computing the output of a neural network
In logistic regression, we had a single node which computed z = wT x + b and a = σ(z). A neural network
just does this a lot more times. Consider the first node in the hidden layer of the network in Section 5.1. It
[1] [1] T [1] [1] [1] [1] [1] T [1]
computes z1 = w1 x + b1 and a1 = σ(z1 ). Similarly, the second layer computes z2 = w2 x + b2 and
[1] [1]
a2 = σ(z2 ), and so on. Naturally, using a for loop in such a case is inefficient, and even more so when the
number of nodes increases. Let’s see how we can vectorize the set of 4 equations corresponding to the middle
layer.
Let’s start with computing z as a vector. We define the vector Z [1] and matrices W [1] , B [1] as
[1]T [1] [1]T [1] [1]
· · · w1 ··· b1 w1 x + b1 z1
x1
· · · w[1]T
[1] [1]T [1] [1]
· · ·
x2 + b2 = W [1] x + B [1] = w2 x + b2 = z2
Z [1] =
2
· · · w[1]T · · · b[1] w[1]T x + b[1] z [1]
3 x3 3 3 3 3
[1]T [1] [1]T [1] [1]
· · · w4 ··· b4 w4 x + b4 z4
[1] [1] [1] [1]
Finally we define another vector A[1] = σ(Z [1] ) which simply stacks up all the values a1 , a2 , a3 , a4 . The
computation in the final node is very similar, except for a change in the layer index.
We now have to refine the matrices Z [1] and A[1] so that we vertically span over the number of nodes in the
hidden layer and horizontally span across the number of training examples:
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
X = A[0] x(1) x(2) · · · x(m) ; Z [1] = z [1](1) z [1](2) · · · z [1](m) ; A[1] = a[1](1) a[1](2) · · · a[1](m)
. .. .. .. .. .. .. .. ..
.. . . . . . . . .
The implementation of this into code is very similar to what we did with logistic regression − only the matrices
are of a bigger order. The final vector equations we now have are:
T
Z [1] = W [1] X + b[1] −→ A[1] = σ(Z [1] )
T
Z [2] = W [2] A[1] + b[2] −→ A[2] = σ(Z [2] )
19
It turns out that applying tanh as the activation function for the middle layer almost always leads to better
results than using the sigmoid. This is because, since the values lie between −1 and 1, the mean of the data
lies closer to 0 than to 0.5. This actually makes the learning for the next layer easier. One exception to when
sigmoid would be a more preferable choice is during binary classification, where we want the output predictions
to be between 0 and 1.
One disadvantage of both the sigmoid and tanh function is that for very large and small z, the slope of the
function ends up very close to 0, which can slow down gradient descent. This brings us to another popular
activation function called the ReLU (rectified linear unit).
The ReLU function is defined as R(z) = max(0, z). Technically, the deriva-
tive of this function at z = 0 is not defined, but the odds of running into
the exact case z = 0 are extremely small. To handle for that case, we can
artificially define it to be either 0 or 1 without affecting the accuracy much.
The ReLU is increasingly becoming the default choice of activation function.
One disadvantage of the ReLU is that the derivative is zero for all z < 0.
This can be rectified by changing the slope to a very subtle angle − this
type of ReLU is called the ‘leaky’ ReLU. However, despite it working better
than the ReLU, it is not used as much.
NOTE: Linear activation functions are usually not preferred in hidden layers, but one of the rare applications
of linear activation functions is in compression algorithms. Linear activation functions are also somewhat
commonly used in the output layer since the final range of y and ŷ are the same.
dJ dJ dJ dJ
To then carry out gradient descent, we need to compute the gradients , , , . After this is
dW [1] db[1] dW [2] db[2]
done, we would need to update the parameters on each iteration using
dJ dJ dJ dJ
W [1] = W [1] − α ; b[1]
= b [1]
− α ; W [2]
= W [2]
− α ; b[2]
= b [2]
− α
dW [1] db[1] dW [2] db[2]
Let us finally express the gradient descent algorithm as a bunch of equations (keep in mind that any notation
dJ
of the form dZ [1] actually means . We use the sigmoid as the activation function for the output layer,
dZ [1]
and a function g [1] (Z [1] ) for the hidden layer.
20
Forward Propagation Back Propagation
T
Z [1] = W [1] X + b[1] dZ [2] = A[2] − Y
1
A[1] = σ(Z [1] ) dW [2] = np.dot(dZ [2] , A[1] .T)
m
T 1
Z [2] = W [2] A[1] + b[2] db[2] = np.sum(dZ [2] ,axis=1,keepdims=True)
m
0
A[2] = σ(Z [2] ) dZ [1] = np.dot(W [2] .T,dZ [2] ) × g [1] (Z [1] )
1
dW [1] = np.dot(dZ [1] , X.T)
m
1
db[1] = np.sum(dZ [1] ,axis=1,keepdims=True)
m
END OF WEEK 3
This is the end of the course notes for Week 3. Keep on reading further for the notes from further weeks, or
spend some time gaining further insight into the previously discussed topics.
21
6 Deep Neural Networks
(This section is currently in the works. Expect it to be up within a week’s time.)
22