Visual Introduction Deep Learning v21-02
Visual Introduction Deep Learning v21-02
a visual introduction to
deep learning
meor amer
Sold to
[email protected]
about this book
Deep learning is the algorithm powering the current renaissance of artificial
intelligence (AI). And its progress is not showing signs of slowing down. A
McKinsey report estimates that by 2030, AI will potentially deliver $13 trillion to
the global economy, or 16% of the world's current GDP. This opens up exciting
career opportunities in the coming decade.
But deep learning can be quite daunting to learn. With the abundance of learning
resources in recent years has emerged another problem—information overload.
This book aims to compress this knowledge and make the subject approachable.
By the end of this book, you will be able to build a visual intuition about deep
learning and neural networks.
If you already know some background about deep learning but want to gain
further intuition.
book format
This book uses a visuals-first approach. Each page of this book begins with a
visual and is supported by concise text.
This book doesn’t include math derivations and code examples. There are some
parts where basic math is involved, but generally it is kept to a minimum.
I have also worked in the telecoms data analytics space, serving clients in over 15
countries.
email: [email protected]
table of contents
Machine Learning 6
Performance 135
Deep Learning 12
Linear activation 137
Algorithm 13
Linearity 139
Data 14
Non-linearity 141
Computation 15
Relu activation 142
Roadmap 19
Performance 145
Key Concepts
20
Activation functions
148
Weighted Sum 24
classification vs. regression 152
Activation 27
Architecture 156
Data 30
Sigmoid activation 158
Dataset 34
Binary cross entropy 166
Training 38
Accuracy 169
Testing
40
Performance 170
Introduction 43
F1 Score
177
Predict-Measure-Feedback-Ad 50
4 - Multi-class Classificatio
just Introduction 180
Loss Function 58
One-hot Encoding 183
Minimizing Loss 64
Softmax Activation 188
Gradient 68
Categorical Cross Entropy 195
Gradient Descent 73
Performance 197
Learning Rat 76
Improving performance
Epoch 81
Hyperparameters 202
210
Performance
86
Architecture 102
Recurrent 225
Predict 108
Generative Adversarial 230
Measure 112
other Architectures 232
Feedback 114
Conclusion 233
Computation graph
117
Much that goes on in our lives involves some form of prediction. These
predictions differ in one way, namely, how sure we are of them. In some tasks,
they don't feel like predictions because we feel so sure about them. In some
others, we know next to nothing about them, so they become mere guesses.
All of this depends on how simple a task is and, more importantly, how much
experience we have with it.
introduction 4
prediction decision
language
translation
what is the reply
person saying?
customer
service will the
discount
customer churn?
driving
is that an steer
obstacle?
examples
Let's take the example of language translation. As we listen to someone speaking,
we are predicting what the person means. The more experience we have with this
language, the better our prediction becomes, and the better our decision, that is
our reply, becomes.
As for driving, the more miles we clock, the more skilled we become and the
more adept we are at evaluating our surroundings.
introduction 5
data prediction decision
In the machine’s world, the experience comes in the form of data. Just as we learn
from experience, the machine learns from data.
That is what machine learning is all about—learning from the data and turning it
into predictions.
introduction 6
machine learning in the real world
In fact, machine learning can even handle the decision part. In some domains,
most notably self-driving cars, we are not far from seeing full automation
becoming the norm.
But in most other domains, this is still far from reality. For this reason, the focus
of this book is on the prediction part.
introduction 7
prediction
speed
$$$
accuracy
In the book Prediction Machines, the authors argued for a few reasons why
prediction machines are so valuable, the first being that ‘they can often produce
better, faster, and cheaper predictions than humans can’.
introduction 8
prediction
...
cost
introduction 9
before after
...
evolution in roles
Some may worry that this will spell the end of most jobs, and rightly so. But
looking at the bigger picture, there will in fact be even more job opportunities.
The World Economic Forum’s The Future of Jobs Report 2020 estimates that by
2025, 85 million jobs may be displaced. But on the other hand, 97 million new
roles may emerge. This already takes into account the economic slowdown due to
the pandemic, and still, the net effect is positive.
Job roles will evolve, and the machine’s role is to serve us so we can pursue more
creative and challenging endeavors.
introduction 10
DATA
COMPLEXITY
MACHINE LEARNING
RULES-BASED SYSTEM
HUMAN
JUDGMENT
RULES
COMPLEXITY
The first, that is without automation, is relying on human judgment, either based
on data or experience.
The second is using a rules-based system. We translate our experience into rules
that software can understand and execute based on data as inputs.
The third is machine learning, which uses data to create its own rules, guided by
the goal defined by humans.
As the data and rules become more complex, it makes sense to use machine
learning. Otherwise, it may not be cost-effective to do so.
introduction 11
DATA
COMPLEXITY
deep learning
RULES
COMPLEXITY
It’s not necessarily the most accurate, but it's extremely adaptable. And this
comes from its modular and flexible form, which will become evident throughout
this book.
introduction 12
algorithm
algorithm
In fact, deep learning has revived the push toward Artificial Intelligence (AI) over
the past decade.
The progress is gathering pace now is because of three main reasons. The first is
the algorithm, which in truth, has been around for many decades.
introduction 13
data
data
The second reason is data.
The impetus came from the Internet and followed by social media, smartphones,
digital transformation, and a long list of other waves of innovation. They produce
new forms of data that we've never seen before, generated in large volumes.
This data contains invaluable information that we can now extract with the help
of algorithms.
introduction 14
computation
cPU GPU
queue queue
computation
The third reason is computational power.
Enter the Graphics Processing Units (GPU). Originally designed for games, it has
emerged as the perfect solution for deep learning.
This is a hot area of research as we speak. Even more efficient hardware designs
are yet to come.
introduction 15
algorithm data
computation
introduction 16
computer
vision tree
tree
pole
car
natural
sentence sentiment
language
...
applications
Today, there are widespread applications in computer vision, natural language
processing, business automation, and beyond. And it is just the beginning.
introduction 17
focus oF
this book
visuals
compression
depth
code
math
This book doesn’t cover mathematical proofs and code examples. As you advance
your learning further, these are the domains you should progress into. They will
provide you with the depth you need to be successful in this field.
introduction 18
the bigger
task
picture
4
multi-class
classification
3
binary
classification
2
non-linear
regression
1
1 LInear
regression
foundations
algorithm
roadmap
We'll see how deep learning works via four tasks - linear regression, non-linear
regression, binary classification, multi-class classification.
They are correspondingly split into four chapters, in which new concepts are
introduced one at a time and built upon the previous ones. Therefore, it is
recommended that you read the chapters in sequence.
On either side of these four chapters, we'll have a short section for foundations
and a final section where we take a brief look beyond those covered in the four
chapters.
introduction 19
Data task
features linear
target non-linear
training predict regression
testing weighted sum classification
activation
neural network
neurons
adjust layers measure
weights architecture cost
biases metrics
feedback
gradients
backpropagation
key concepts
Here is a summary of the key concepts that we’ll explore in this book. As you go
through the book, it'll be useful to return to this page from time to time to keep
track of what you have learned.
Let’s begin!
introduction 20
foundations
task
multi-class
classification
binary
classification
non-linear
regression
LInear
regression
foundations
algorithm
foundations 21
A Neuron
We have so far used the term deep learning, but from now on, we’ll use neural
network instead. These terms are used interchangeably and refer to the same
thing. But as we start to go into the inner workings, neural network is a more
natural term to use.
To begin our journey, let's start with a neuron. The neuron is the smallest unit and
the building block of a neural network.
foundations 22
inputs
The inputs and outputs are numbers, either positive or negative. In this example,
the neuron takes two inputs. However, there is no limit to the number of inputs a
neuron can take.
foundations 23
input weight bias weighted
sum
x1 w1
=
+ =
b z
=
x2 w2
z = w1x1 + w2x2 + b
weighted sum
The first computation that a neuron performs is the weighted sum. It multiplies
each input by its corresponding weight. Then all the inputs are summed and a
term called bias is added.
foundations 24
weight
weight
bias
These adjustable parameters are the medium through which a neural network
learns, which we'll explore in detail in this book.
foundations 25
example #1
3.0 0.5 1.5
=
4.5 1.0 5.5
+ =
2.0 1.5 3.0
=
example #2
3.0 -0.5 -1.5
=
0.5 -2.0 -1.5
+ =
2.0 1.0 2.0
=
Example
Here we have a neuron with two inputs, 3.0 and 2.0. Given different weight
values, it will correspondingly output different values.
foundations 26
weighted sum activation
x1
z a y
x2
activation
The second computation performed by the neuron is called an activation. This is
done by taking the output of the weighted sum and passing it through an
activation function.
foundations 27
linear activation
The activation function gives the neural network the ability to express itself. This
will not make much sense now but will become clear by the end of this book.
There are a few common activation functions. To start with, here we have a linear
activation function. It’s as basic as it gets - it simply outputs the same input it
receives. Plotted on a chart, it gives a straight-line relationship between the input
and the output.
foundations 28
z a
weighted activation
sum
Recap
Let’s do a quick recap. When inputs are passed through a neuron, it performs a
sequence of computations.
foundations 29
Data
Neural networks are nothing without data. Let’s now turn our attention to data
and what it brings.
foundations 30
Learning
A machine learning algorithm, in this case a neural network, uses data to find
useful patterns and relationships. It uses these insights to learn and update itself.
foundations 31
structured
semi-structured
unstructured
xx x x xx : {
xxx xx xxxxxx x xx
xx : x,
xx x xxx x xx xxxx.
x xx x x : xxxxx,
xxx, x xxx.
xxx : {
xx x x xx : xxx,
x xxx xxxx, x xx
xxxx : xx,
xxxxx xx, xxxx x xx x.
}
xxx x xx xx xxx x xx
} xx xx.
Types of data
Data can come in many different forms. The most obvious form of data is the
tabular format. This is an example of structured data, where each data point and
its properties can be deciphered in a straightforward manner.
foundations 32
Sources of data
In fact, most of the data around us are in the unstructured form. According to
projections from IDC, 80 percent of worldwide data will be unstructured by 2025.
And indeed, most of the exciting innovations in deep learning today come from
unstructured data, such as text, images, videos, and so on.
foundations 33
a dataset
Now let’s look at how we should prepare a dataset for the neural network.
foundations 34
distance rating
(Mi)
1.5 3.6
Example
Let's take the example of hotel room rates, a dataset we'll use throughout this
book.
Each data point represents a hotel. Here we have a hotel with a distance of 1.5
miles from the city center and a guest rating of 3.6 stars.
foundations 35
features
distance
rating
(mi)
Features
These two pieces of information are called features. Features describe the
properties of a data point.
Each data point in a dataset is described with the same features, of course with
different values, making each of them unique.
From now on, we'll refer to these two features as distance and rating for short.
foundations 36
target
price
($)
Target
Recall that the goal of a neural network is to make predictions.
In this example, we want to predict the average daily room rate (or price for short)
for any given hotel. This means, given the two features earlier, we want to predict
how much each hotel will cost.
The is called the target. In other resources, you may also find the term label being
used.
foundations 37
dist rating PRICE
(Mi) ($)
training
We'll give the neural network enough data points containing the features and
target values, which it will learn from.
A machine learning task where we specify the target is called supervised learning,
which will be our focus in this book.
foundations 38
training
features
model
target
Training
We have just described a process called training.
During training, the neural network learns from the data and updates its
parameters. By this point, we'll have a trained model.
That is why the training process is sometimes also called ‘fitting the model to the
data’.
foundations 39
training
data
test
data
Testing
Once the training is complete, we need a way to know how the model is
performing.
foundations 40
TESTiNG
features
target
model
testing
During testing, we'll provide the neural network with just the features, without
the target. Now that it’s already trained, it’s job is to predict the target values.
In the coming four chapters, we'll revisit these training and testing steps.
foundations 41
1 - linear regression
task
multi-class
classification
binary
classification
non-linear
regression
1
LInear
regression
algorithm
1 - linear regression 42
a single-neuron neural network
Now let's look at how the neural network works. We'll start with its simplest
possible version—a network with only one neuron and one input!
1 - linear regression 43
the plan
We'll lay the necessary foundation in this chapter and use that in the subsequent
chapters when we start building larger neural networks.
1 - linear regression 44
$$$ $ $$
city distance
center
the goal
Let's revisit the dataset from the previous chapter, which contains a list of hotels
in a city.
Our goal is to predict the average daily room rate for a given hotel (i.e. price)
based on the features.
In this chapter, we'll use only one of the features—the distance from the city
center (i.e. distance).
1 - linear regression 45
distance
price
(mi) ($)
0.5 146.00
1.1 149.00
1.6 140.00
2.4 134.00
3.5 127.00
4.6 110.00
6.2 112.00
9.5 81.00
0.3 156.00
0.7 168.00
4.9 116.00
8.5 99.00
the dataset
This is what the dataset looks like. It contains twelve data points, one feature,
and one target.
The distance and price values are continuous values—numeric values that can take
any value within a range.
1 - linear regression 46
price
149
dist. price
4 ? ?? ? ?
9.5
DIstance
regression
This is a type of task called regression. In regression tasks, the target value to be
predicted is a continuous value.
We’ll split the dataset into the training and testing datasets and train the model
using the training data.
Ultimately, we want the model to give good predictions for the four test data
points.
1 - linear regression 47
price
149
9.5
DIstance
learning
By visual inspection, it’s clear that there is a straight-line relationship between
the feature and the target. This is called linear regression.
This is the relationship that we want our single-neuron network to capture. Let’s
see if it can do that.
1 - linear regression 48
distance
price
(mi) ($)
0.5 146.00
1.1 149.00
1.6 140.00
training
2.4 134.00
data 3.5 127.00
4.6 110.00
6.2 112.00
9.5 81.00
training data
We'll take the first eight data points as training data and leave the other four for
testing later.
1 - linear regression 49
predict
adjust measure
feedback
One cycle represents one training round. This is repeated for a number of rounds
until the neural network becomes good at the task it’s training for.
None of this will make sense to you yet, but that's exactly what we’ll learn about
next!
Also note that these four terms were chosen for this book to make it easy for
someone new to deep learning. In other resources, you will find other terms used
(e.g. forward instead of predict, backward instead of feedback, update instead of
adjust). They refer to the same concepts.
1 - linear regression 50
predict
predict
In the first step of the cycle, predict, we'll pass the training data through the
neuron.
1 - linear regression 51
feature weighted sum activation
... z a
neuron computations
Recall that this means going through two steps of computations - weighted sum
and activation, one data point at a time.
1 - linear regression 52
weighted sum
z
weight
bias
parameter count
We’ve already seen that the number of weights of a neuron is equal to the number
of inputs. The inputs are the dataset’s features. And since we have only one
feature, there is going to be only one input, and hence, one weight.
We also saw that on top of that, a neuron has one bias value.
We’ll assign initial values for these parameters, in which there are a number of
initialization techniques we can choose from. These techniques help the neural
network learn more effectively compared to simply assigning random initial
values. However, this book doesn’t cover this topic as it is quite mathematically
involved.
1 - linear regression 53
activation
a
3
activation
For this task, we'll stick to the linear activation function.
1 - linear regression 54
weighted sum activation predicted values
z a ...
output
By now, we will have the neuron successfully outputting eight values. They
represent the prices that the neuron predicted.
The problem, however, is that the neuron hasn't learned anything yet. As a result,
its predictions will be completely wide of the mark.
1 - linear regression 55
measure
measure
But how do we actually know if the neuron's prediction is good or bad?
This is when we move to the second step, measure, where we'll quantify its
performance.
1 - linear regression 56
predicted
actual
error
value value
error value
Since we know the actual value of the target, we can quantify the performance by
computing the difference between the predicted and actual prices. This is called
the error value.
1 - linear regression 57
loss function
parameters
loss function
This brings us to one of the most crucial parts of designing a neural
network—choosing its loss function.
While the parameters are the dials that the network adjusts to reach its goal, the
loss function is the goal itself.
The loss function comes in various forms and it all depends on the nature of the
task. This will become clearer in Chapters 3 and 4, where we'll use other kinds of
loss functions.
1 - linear regression 58
predicted
values ...
error ...
squared
error
...
mean squared
error
1 - linear regression 59
price
price
predict
ion pre
dic
tio
n
distance
... distance
mean
squared error
minimize loss
The MSE is a measure of error. That means the smaller the MSE, the better the
network is doing.
In other words, the neuron's goal is to minimize its loss over many training
rounds.
1 - linear regression 60
loss function
weight
Recall that a neural network learns by adjusting its parameters - weights and
biases. Let's first focus on weights since this is where most of the learning takes
place. We’ll come back to biases later.
We want to find out how changing the weights affects the loss.
1 - linear regression 61
feedback
feedback
At this point, the neuron hasn't learned anything yet. And learning is exactly
what is going to happen in the third step, feedback.
1 - linear regression 62
learning
We have established that the neuron’s goal is to minimize the training loss by
adjusting its parameters.
This is the essence of learning in a neural network. Let’s find out how this works.
1 - linear regression 63
loss
weight
minimizing loss
Let's start with one training cycle and plot the loss (i.e. MSE) on a chart.
Now, we want to bring this MSE down to be as close to zero as possible. What we
need is to find the weight value that gets us there. But how do we do this?
1 - linear regression 64
loss
weight
1 - linear regression 65
loss
weight
minimum point
Its width and position may vary, but its shape will always be the same - there is a
single point where the curve reaches its minimum. And that is what we are after!
1 - linear regression 66
loss
weight
goal
And that is our goal - to get the neuron to find the weight value that will bring
the MSE to its minimum.
In practice, we won't be able to get exactly to the lowest point. But we can get
very close, and that’s good enough for most tasks.
1 - linear regression 67
loss
weight
gradient
The next question then is, how does the neuron know by how much to adjust its
weight? The answer is to find the weight gradient.
A gradient is the derivative of an output with respect to its input. In our case, the
output is the loss function, while the input is the weight. In Chapter 2, we’ll find
gradients of the loss function with respect to other types of inputs, so it’s worth
keeping this in mind.
We won’t go deep into the math, but let’s understand why this is useful for the
neural network.
1 - linear regression 68
Steepness
The gradient is a measure of the steepness of the curve of the loss function. And
where we are now, it's very steep. The steeper the curve, the greater the gradient.
A large gradient indicates that the weight is still far from the optimal value, so
we’ll need to adjust it by some amount.
1 - linear regression 69
MINIMUM gradient
But why is this so? To better understand, let's pretend that we've succeeded in
finding the ideal weight value that brings the loss to the bottom of the curve.
The gradient here is zero, which means that we no longer need to adjust the
weight.
1 - linear regression 70
magnitude
Notice that as we decrease the weight from its initial position to the bottom of
the curve, the gradient continues to decrease until it reaches zero.
The magnitude of the gradient informs the neuron how far its prediction is from
the actual. And by the same token, it also informs how much the neuron needs to
adjust its weights.
1 - linear regression 71
direction
The second property of the weight gradient is its direction.
Suppose the starting weight is on the other side of the curve. This causes the sign
of the gradient to become negative.
This indicates that the gradient is too small rather than high. Instead of
decreasing, we'll need to increase the weight to reach the bottom of the curve.
Therefore, the sign of the weight gradient informs the neuron about the direction
of weight adjustment.
1 - linear regression 72
dw
- 0 +
gradient descent
The magnitude and direction of the weight gradient are the two types of
feedback returned to the network.
As the network goes through multiple training cycles, we want the weight to
move down the curve toward its minimum point.
1 - linear regression 73
dw = input error
input error
weight gradient
We’ve seen that the weight gradient is the derivative of the loss function with
respect to the weight. We won’t go into the mathematical proof, but the result is
the input value multiplied by the error. We’ll represent the weight gradient as dw
for short.
The final gradient to be passed back to the network is the average gradient from
all the training data points.
1 - linear regression 74
adjust
adjust
We have now reached the fourth and final step of the training cycle, adjust. In this
step, the neuron will adjust its weight according to the gradient that it receives.
1 - linear regression 75
w new = w previous - alpha dw
dw dw
- +
learning rate
If the gradient is a positive value, the previous weight is reduced commensurate
to the magnitude. On the other hand, if the gradient is negative, the previous
weight is increased.
Here we introduce another term called the learning rate, represented by alpha for
short. It is a value multiplied by the gradient before making the weight
adjustment.
1 - linear regression 76
without
with
Without the learning rate, the descent may become unstable, or worse, never
reach the bottom of the curve.
Choosing the right values is an art in and of itself. Too small and the neuron
learns too slowly. Too high and the neuron never finds the minimum point.
1 - linear regression 77
bias
price
weight
distance
bias
Let’s now bring the bias parameter into the discussion.
The weight’s role is to adjust the shape of the prediction line or curve.
Meanwhile, the bias’ role is to adjust the position of the function, shifting it up or
down.
1 - linear regression 78
dw = input error
input error
db = error
error
BIAS GRADIENT
We get the bias gradient db by taking the derivative of the loss function with
respect to the bias. Without going into the mathematical proof, the result is the
error value itself.
The final gradient to be passed back to the network is the average gradient from
all the training data points.
1 - linear regression 79
b new = b previous - alpha db
db db
- +
adjust
As in the weight, the bias gradient is multiplied by the learning rate before
making the bias adjustment.
1 - linear regression 80
training
predict
training data
... ...
adjust measure
feedback
epoch
1 - linear regression 81
# of Epochs
# of training data
# of training data
complete training
We'll repeat the four steps for 100 epochs. And once we've gone through all the
epochs, training will be complete!
1 - linear regression 82
... ...
measure
cost
metric
The first is cost. Cost is simply the average loss value over the training data
points (i.e. the MSE). So far, we’ve only used the term loss for simplicity, but cost
is the more precise term. In practice though, it’s not uncommon to see these two
terms used interchangeably.
The second is metric. For this task, it’s also equivalent to the MSE. It may not
make sense now why we need two measures for the same thing, but it will
become apparent when we get to Chapters 3 and 4, where the cost and metric are
different.
1 - linear regression 83
internal external
cost metric
performance measure
In other words, the cost is an internal performance measure - that is, for the
algorithm.
Conversely, the metric is an external performance measure - that is, for the
human.
1 - linear regression 84
cost metric
mse
mse
monitoring
Over many epochs, we can see that the MSE continues to improve and converge
toward zero.
1 - linear regression 85
prediction - training data
price
mse = 16.4
149
predicted
actual
9.5
DIstance
training performance
At the end of the 100 epochs, we have a trained model that gives a respectable
MSE of 16.4.
Plotting the predicted values on a chart gives us a linear regression line that’s
defined by the learned parameters.
Note that the MSE can only go to zero if the actual training data are perfectly
aligned along a straight line. In this case, they aren’t.
1 - linear regression 86
distance
price
(mi) ($)
0.3 156.00
test
0.7 168.00
data
4.9 116.00
8.5 99.00
testing
Measuring the performance of a model based on training data is a good indicator
of success, but it is far from the true measure. The reason is that we are
measuring its performance based on the data it has seen.
We need to measure its performance based on data it has never seen. For this, we
use the four test data points that we set aside.
1 - linear regression 87
testing
test data
predict
measure
metric
testing
For the test data, we don’t need to go through all four steps of the cycle. We only
need the predict and measure steps.
In predict, we pass through the features (distance) through the neural network and
get the prediction (price) at the other end.
In measure, we compute the metric (the MSE) of the prediction. The cost is
internal to the model and it’s used only during training, so we won’t need to
consider that.
1 - linear regression 88
prediction - test data
price
mse = 144.5
168
predicted
actual
8.5
DIstance
test performance
We get an MSE that is substantially worse than that of the training data.
Examining this visually, this is because the test data are more sporadic compared
to the training data. This indicates a slightly different distribution of the test
data compared to the training data.
However, for this amount of data, we can’t really confirm that, and this
performance is fine.
In general, the performance with the test data will inevitably not be at the same
level as with the training data. But our job is to get them as close as possible.
1 - linear regression 89
price
DIstance
linear regression
And that's the end of our first example. If you're familiar with linear regression,
you might be wondering why we're taking all the trouble to use a neural network
to build a linear regression model?
You are right! There are other simpler algorithms that would have achieved the
same outcome.
But this is where the similarity ends. In the following chapters, we'll start to
build complexities and non-linearities, which is when the neural network shines.
1 - linear regression 90
predict
adjust measure
feedback
recap
In measure, we measure the cost, that is how far are the predictions from the
actual values.
In feedback, we compute the parameter gradients and feed them back to the
neuron.
1 - linear regression 91
2 - non-linear regression
task
multi-class
classification
binary
classification
2
non-linear
regression
LInear
regression
algorithm
2 - non-linear regression 92
target
feature
non-linearity
In the last chapter, our single-neuron neural network had the task of modeling a
linear relationship.
2 - non-linear regression 93
neural network
This is when we need a proper neural network.
A neural network consists of its building blocks - neurons. These neurons learn
uniquely at the individual level and synergistically at the collective level.
2 - non-linear regression 94
layers
units
In its basic architecture, a neural network consists of layers. Each layer has its
own number of neurons, or units.
2 - non-linear regression 95
layers
units
data flow
The lines represent the flow of data. Each neuron receives inputs from all the
neurons in the preceding layers. Then it sends its output to all neurons in the
next layer.
2 - non-linear regression 96
a d
b e g
c f
a e
example
The neurons and connections can make the neural network look complicated, so
let's break it down and look at a couple of examples.
Example 1: Neuron A receives three inputs directly from the data and sends its
output to 3 neurons in the next layer.
Example 2: Neuron E receives three, which are the outputs of neurons A, B, and
C in the previous layer, and sends its output to one neuron in the next layer.
2 - non-linear regression 97
a d
b e g
c f
z a
neuron computations
Regardless of how many inputs and outputs a neuron is connected to, or in which
layer the neuron is located, it goes through the same set of
computations—weighted sum and activation.
Take Neuron B, for example. It takes three inputs, computes the weighted sum on
these inputs, and then performs the activation. The output of the activation is
then passed to three neurons in the next layer.
2 - non-linear regression 98
$$$ $ $$
city distance
center
the goal
Let's now build the neural network architecture we need for this task.
Before that, we’ll define the goal for this task, which is the same as in Chapter
1—to predict a hotel’s average daily room rates (i.e. price).
2 - non-linear regression 99
Dist (MI) Rating price ($)
0.2 3.5 157.00
0.2 4.8 155.00
0.5 3.7 146.00
0.7 4.3 168.00
0.8 2.7 147.00
1.5 3.6 136.00
1.6 2.6 140.00
2.4 4.7 134.00
3.5 4.2 116.00
3.5 3.5 127.00
training
4.6 2.8 106.00
data
4.6 4.2 110.00
(24) 4.9 3.8 116.00
6.2 3.6 112.00
6.5 2.4 92.00
8.5 3.1 99.00
9.5 2.1 81.00
9.7 3.7 92.00
11.3 2.9 75.00
14.6 3.8 108.00
17.5 4.6 166.00
18.7 3.8 188.00
19.5 4.4 211.00
19.8 3.6 207.00
0.3 4.6 156.00
0.5 4.2 162.00
test
1.1 3.5 149.00
data
1.2 4.7 145.00
(8) 2.7 2.7 123.00
3.8 4.1 118.00
7.3 4.6 82.00
19.4 4.8 209.00
the dataset
The difference is this time, we have two features instead of one. Here, we bring
back the rating feature that we left out in Chapter 1.
Another difference is the size of the dataset. We had only 12 data points in
Chapter 1. For this task, we are adding 20 more, making up a total of 32 data
points. We'll use 24 for training and 8 for testing.
The result is, instead of linear, our task now becomes a non-linear regression task.
Let's see why this is so.
211
19.8
DIstance
the dataset
The data points we used in Chapter 1 are on the left side of this curve. As we add
more hotels to the dataset, we find that the dynamic changes. In the beginning,
the farther we get from the city center, the cheaper the prices become. This is
expected because there will be a higher demand for hotels closer to the center.
But there is a point in the middle where the room rates get more expensive the
further away we get. The reason is that these are the resort-type hotels that
charge similar, if not higher, prices.
layers
We have seen that a neural network consists of layers. A typical neural network,
like the one we are building, has one input layer and one output layer. Everything
in between is called the hidden layer.
rating
input layer
Let's get started with the architecture.
The number of inputs is equal to the number of features, which means we'll have
two inputs.
The choice of the number of layers and units depends on the complexity of the
data and the task. In our case, we have a small dataset, so this configuration is
sufficient.
What do hidden layers do? A hidden layer transforms the information it receives
from the previous layer into useful forms. Guided by the goal of the task, it looks
for patterns and signals and decides which ones are important.
This cascades across the layers and up to the output layer, which will have
received a summarized and relevant piece of information to aid its predictions.
biases
So, in our case, we have a total of nine weights and four biases. And as in Chapter
1, we'll assign initial values for these parameters.
predict
training data
... ...
adjust measure
feedback
training
Now that the data and architecture are in place, it's time to start training.
predict
Let’s begin with the first step, predict.
z a z a
z a
predict
Recall that in this step, each data point is passed through the neural network and
prediction is generated on the other side.
Now that we have more neurons, the weighted sum and activation computations
will take place at each neuron. Let’s look at a couple of examples.
z a
z = x1w1 + x2w2 + b
a=z
example 1
The first example is the first neuron in the hidden layer. It takes the original
data’s features as inputs, performs the weighted sum, and adds a bias value. Then
it goes through a linear activation function, which returns the same output as the
input.
z a
example 2
The second example is the neuron in the output layer. It takes the three outputs
of the previous layer as inputs. As in the first example, it then performs the
weighted sum, adds a bias, and performs the linear activation.
measure
In the second step, measure, we quantify the performance of the prediction.
error)
measure
This is still a regression task, so we can use the same loss function as in Chapter
1—the MSE.
Averaging the squared error over all twenty-four training data points gives us the
MSE.
feedback
The third step, feedback, is where it gets interesting. Here, we’ll find a lot more
things going on compared to the single-neuron case.
This part of the book will be quite dense. For this, it is helpful to keep in mind
the goal of this step, which is to find the parameter gradients so the neural
network can adjust its parameters.
feedback
We'll start with the output layer and move backward to the input layer. Again, for
simplicity, we'll focus on the weights for now and come back to the biases later.
single neuron
In Chapter 1, with a single-neuron network, we computed the weight gradient
based on the loss (MSE). But what really happened under the hood? Let’s now see
how the loss was fed back to the neural network.
i
weighted sum activation
z a
weight
Here we have the forward computation graph, which represents the predict step
of the training cycle.
Note that the graph you are seeing is a slightly simplified version, sufficient to
aid our discussion.
z a
weight
dz da loss
w
dw
dw = input error
= input dz
We need these other gradients to arrive at dw. The concept is called the chain rule.
We won't cover the math, but the idea is this: We can compute a particular
gradient if we know the gradient adjacent to it. Here, we can compute da from
the loss value, which means we can compute dz, which means we can compute
dw.
In fact, whenever error was mentioned in Chapter 1, it was referring to dz, which
is the gradient adjacent to dw.
2 - non-linear regression 118
z a
dz
da
w
i dw
dw
w
... z a da
z a
dz dz da loss
dw
w
i dw
da
w
... z a
dz dw
backpropagation
Let’s return to this chapter’s neural network. The backward computation graph is
shown here for all four neurons. Starting from the loss value, information flows
back to all neurons so that each weight receives its gradient, dw.
w
i
w b
z a
A B
i z a z a
dz da dz da
w dw
w dw
dw = input dz
example
In this example, we’ll focus on two weight gradients, red and blue, involving two
neurons, A and B.
Tracing the lines along the graph, we obtain the red dw by multiplying the input
data i by neuron A’s dz.
You may be wondering, how did we know the formula for dw in the first place,
and what about the formulas for dz and da? Unfortunately, we won’t cover them
in this book, but if you are curious, do check out other resources to understand
the math derivations.
dw = input dz
magnitude
Let's take a closer look at the magnitude of the weight gradients, starting with
the neuron in the output layer.
Inspecting the formula shows that the larger the input, the larger the
corresponding weight gradient. But what does this mean?
#2
#3
weight dw dz loss
error contribution
This means that the larger inputs have a greater influence on the output value,
and thus on the prediction outcome. Therefore, this formula enables the network
to assign the larger weight adjustments to inputs that make the bigger difference.
Suppose input #1 is the largest. This tells us that input #1 is the one that
contributed the most toward the error. So, we want to tell input #1 to make the
biggest weight adjustment, and we do that by giving it the biggest weight
gradient.
+ dw1 - dw1
direction
Recall that weight gradients have both magnitude and direction. As the gradients
backpropagate, their directions can change too.
If either the input or dz is negative, then the weight gradient will be negative.
weight dw dz loss
negative gradient
A negative weight gradient means that the network will need to increase its
weight instead.
If you recall the gradient descent discussion in Chapter 1, a negative value causes
the gradient to flip its shape.
The good news is, in practice, there are well-established deep learning
frameworks such as PyTorch and Tensorflow that handle all the tedious
computations on our behalf!
example neuron
Take this neuron in the middle for example. It has just received the da gradients
from its outputs. But what does it do with them?
-
+
+
weight dw dz sum da
Example neuron
The first thing it does is to sum up these da values. This represents the net
gradient coming from the three outputs.
And then, we get the dw for each input by multiplying dz by the inputs.
adjust
The fourth step, adjust, is where we perform the weight adjustments.
weight adjustment
As in Chapter 1, we can now adjust the nine weights according to the gradients
that they receive.
w dw
b db
w dw
b db
db = dz = error
bias
Let’s now bring the bias term back into the picture. As before, the formula for the
bias gradient db is simply equal to the error, which we now know is given by dz.
bias adjustment
The neural network can now adjust the biases based on their gradients.
# of training data
# of training data
EPOCHS
We'll repeat the training cycle for 800 epochs. This task requires more epochs
compared to Chapter 1. This is because of the non-linearity, which will take a
longer time to learn than a linear case.
error
Now that training is complete, the neural network should have learned enough to
produce decent predictions.
predicted
actual
19.8
DIstance
training performance
However, this doesn’t seem to be the case. The MSE of the training data is very
poor. Meanwhile, when plotted, the predictions seem as good as random. It
appears that the neural network hasn’t learned much!
predicted
actual
19.8
DIstance
TEST performance
The same is true with the test data. The predictions don’t seem anywhere close to
the actual values.
activation
a
output
3
input
activation function
The answer lies in the activation function that we used. All neurons were using
linear activation.
linear
output
linear activation
As we’ve seen, a linear activation function returns exactly the same value of the
input it receives.
For this reason, our neural network was unable to capture the non-linearities in
the data.
g D I s ta
rat i n nce
linear plane
Returning to our earlier plot of training predictions, it may not appear that the
results were linear.
But indeed, they were. If we were to plot the predictions on a 3D chart, we would
see them falling on a linear 2D plane.
The more layers and neurons the neural network has, the more complex
relationships in the data it can capture.
activation
a
output
3
input
For any positive input, the ReLU activation outputs the same value, just like the
linear activation.
The difference is for the negative inputs. If the input is negative, the ReLU
activation will output zero.
relu
output
choice
The ReLU effectively works like a gate - it turns on whenever the input is above
zero and turns off otherwise. As simple as it may look, it enables the neuron to
have a pretty powerful ability—a choice.
But with ReLU, every neuron can make a difference. It can now choose to only
‘activate’ when the input is positive.
relu linear
relu
reconfigure
We'll now replace the linear activation functions in the hidden layer with ReLU.
In the output layer, it’s fine to keep the linear activation for this task. In the next
chapter, we'll see a task where we do need to change the activation function in
the output layer.
211 mse
mse == 100.4
50.8
predicted
actual
19.8
DIstance
training performance
This time, it looks like we are heading in the right direction. The MSE of the
training predictions is much better now.
predicted
actual
19.8
DIstance
TEST performance
And the MSE of the test predictions is not too bad either. As in Chapter 1, we can
bring it closer to the training MSE by having more data points and ensuring that
the training and test distributions are similar.
more neurons
If we want to, we can further improve the performance by adding more layers and
units.
As we add more neurons, the neural network will be able to increase the
granularity of its predictions, resulting in a prediction curve that is smoother and
more well-defined.
output
output
input input
sigmoid tanh
output
output
input input
...
activation functions
There are many other types of activation functions. Some of the more commonly
used ones are shown here. Each activation function allows the neural network to
exhibit a different kind of quality compared to the others.
multi-class
classification
3
binary
classification
non-linear
regression
LInear
regression
algorithm
1.45 car
tree
classification
We have seen how the neural network performs regression tasks. In this chapter,
we'll see how it performs another type of task—classification.
vision
natural
sentence sentiment
language
...
function function
In fact, there are only two major differences to note. The first is the type of
activation function and the second is the type of loss function. We’ll learn more
about them in this chapter.
the dataset
We'll use the same dataset as in Chapter 2 with one difference—we have replaced
the target with a new one.
Let's suppose that we are tracking the hotels with the highest demand,
segregating them from those which are not. This is represented by the new target
called hot which has two possible values - yes and no.
hot
yes
no
distance
the dataset
Recall that for regression, we wanted to model the line (for a single-feature
scenario) or plane (for a two-feature scenario) where the predicted values would
fall on.
Here is shown the plot of the actual class for each training data point. Also
shown is an example of a hand-drawn boundary that separates the two classes.
This is what we want our neural network to produce.
YES 1
YES 1
YES 1
YES 1
no 0
... ...
no 0
YES 1
YES 1
... ...
YES 1
binary classification
Our task is called a binary classification task because the outcome will be either
one or the other - yes or no.
Before we can proceed with training, we need to convert the labels into a
numeric format. For this purpose, we'll assign discrete values 1 for yes and 0 for
no.
relu
relu
We'll stick with the same number of layers and units as in the previous task. The
activation function in the hidden layer also remains unchanged as ReLU.
hot
relu sigmoid
relu
output layer
In fact, the overall architecture remains the same.
The only difference is in the output, where we’ll replace the linear activation
function with a new one called sigmoid. Let’s take a look at how it works.
0.5
0 input
We can’t achieve that with the current configuration. This is because the linear
activation outputs continuous values instead of discrete.
A sigmoid function solves this problem by squeezing its input into a value
between 0 and 1.
sigmoid
output
1
activation
a
output
0.95
3 input
output layer
For this reason, we'll use the sigmoid activation function in the output layer.
output
input input
discretize
This still leaves us with a small problem. We can now convert the output to fall
between 0 and 1, but what we need is a value of either 0 or 1. The output has to be
discrete.
For this, we can add an extra computation step to convert the output to 1 if it's
greater than 0.5, and to 0 if it's less than 0.5.
0 1
0.10
0 1
probability
Why does this work? We can think of the output of the sigmoid activation as
representing a probability.
For example, if the output is 0.88, the neural network is indicating that the label
has a higher probability of being 1. And if the output is 0.12, it has a higher
probability of being 0.
loss function
But how can we implement this concept? The answer is the loss function.
Recall that the loss function defines the goal of the neural network, and as a
result, dictates how it behaves.
It turns out however (we won't cover the math here) that using an MSE in a
binary classification task will result in a non-convex loss function. It means that
there will be more than one location along the loss curve where a local minimum
exists, making it difficult for the neural network to find its true, or global
minimum.
For this reason, we'll need to use a different type of loss function.
GOAL GOAL
0
actual
sigmoid
actual
sigmoid
the goal
Our goal is to output a prediction of 1 when the actual value is 1 and a prediction
of 0 when the actual value is 0.
class = 1 class = 0
loss loss
0 1 sigmoid
0 1 sigmoid
output output
our
our
goal goal
loss function
The loss function we'll be using is called the binary cross entropy. This function
fulfills our need to have a convex loss curve.
The plots here depicts the shape of this loss function, with each class having its
own loss curve.
value = 1 value = 0
gradient descent
From here, we can perform the weight updates using the same gradient descent
approach as in Chapters 1 and 2.
adjust measure
feedback
training
We are now in a position to train the neural network. The training cycle follows
the same four steps as in the previous chapters.
actual
predicted
correct?
value value
1 1 Y
0 0 Y
1 0 N
0 1 N
1 0 N
... ... ...
0 0 y
total correct
accuracy =
total predictions
accuracy
There is one last change, which is the metric to measure performance.
In the regression task, we used the MSE both as the cost and the metric.
accuracy = 92%
actual
yes
no
predicted
distance
training performance
Using the accuracy metric, we can see that the model does pretty well on the
training dataset.
The plot also shows the decision boundary of our trained neural network,
representing the predictions.
accuracy = 88%
actual
yes
no
predicted
distance
test performance
The prediction on the test data also shows a respectable performance, given the
limited number of data points.
accuracy = 100%
actual
yes
no
predicted
distance
a bigger network
As in the regression task, we can further improve the performance by adding
more layers and units.
The plot here shows the decision boundary after modifying the neural network to
contain two hidden layers with five and ten units of neurons each. The
granularity of its predictions increased, resulting in a more well-defined and
accurate prediction curve.
1 0 0 y
2 0 0 y
3 0 0 y
4 0 0 y
5 0 0 y
6 0 0 y
total correct
accuracy = = 93%
total predictions
However, it can become misleading in some cases. This is especially true when
dealing with imbalanced data, which is when certain classes contain way more
data points than the rest.
Let's take the example of predicting fraud credit card transactions. Suppose we
have a dataset of 100 data points, of which only 7 are fraud cases. If we simply
predicted all the outputs to be 0, we would still manage to get an accuracy value
of 93%! Clearly something isn’t right.
true
false
0 negative positive
actual value
1 false
true
negative positive
confusion matrix
The confusion matrix provides a way to measure performance in a balanced way. It
shows the count of predictions falling into one of the following
True Negative (TN): when both the actual and predicted values are 0
False Positive (FP): when the actual value is 0 but the predicted value is 1
False Negative (FN): when the actual value is 1 but the predicted value is 0
True Positive (TP): when both the actual and predicted values are 1.
For our example, 0 represents the not fraud class while 1 represents the fraud
class.
1 0 0 y
2 0 0 y
3 0 0 y
4 0 0 y
5 0 0 y
6 0 0 y
... ... ... ...
89 0 0 y predicted value
90 0 0 y 0 1
91 0 1 n
92 0 1 n tn fp
93 0 1 n
0
90 3
actual value
94 1 0 n
95 1 0 n
96 1 0 n
97 1 0 n
fn tp
98 1 0 n 1
99 1 1 y 5 2
100 1 1 y
applied to data
Applied to the credit card fraud dataset, we get 90, 3, 5, and 2 respectively for TN,
FP, FN, and TP.
0 1
tp
fp 2
3
0
= 40%
tp tp fp
1
2 2 + 3
recall
0 1
tp
0 2
= 29%
fn tp tp fn
1
5 2 2 + 5
The reason for the curiously high accuracy was the dominance of the true
negatives, diluting the rest. With precision and recall, we remove the focus on
these true negatives and instead give all the attention to the other three
categories.
We can see from the precision and recall scores that they offer a more reliable
performance indicator than accuracy.
f1 score
precision * recall
2*
precision + recall
f1 Score
However, we still need to strike a balance between precision and recall.
To illustrate its importance, suppose the model from the credit card fraud
prediction achieved high recall and low precision. This would lead to a high
number of false positives. This is good for detecting as many fraud cases as
possible, but comes at the expense of flagging non-fraud cases as frauds.
Conversely, if the model achieved high precision and low recall, this would result
in a high number of false negatives. This is good for correctly classifying the
non-fraud cases but will miss out on real fraud cases.
We can address this problem with the F1 Score, which provides a balanced
emphasis on precision and recall.
predicted value
0 1
tn fp
0
11 2
actual value
fn tp
1
1 10
performance
Going back to our hotels dataset, we get an accuracy score that’s comparable
with precision and recall, indicating that our dataset is balanced.
4
multi-class
classification
binary
classification
non-linear
regression
LInear
regression
algorithm
classification classification
car / car
Not car
bus
motorbike
bicycle
Such a task is called multi-class classification. To make the neural network work
for this type of task, we'll need to modify its architecture slightly. This is what
this chapter is about.
the dataset
We’ll use the same dataset as Chapter 3 and again, have a new label called
category. Suppose there's a certification agency that classifies hotels into three
categories—gold, silver, and bronze.
Our goal is to predict the category for a given hotel based on the features.
The label is a categorical variable, which means that order is not implied. Though
the class names suggest that some order may exist, we just want to classify them
without worrying about which class is better than which.
category
gold
silver
bronze
distance
the dataset
Here is shown the plot of the actual class for each training data point, along with
an example of hand-drawn boundaries separating the three classes. This is what
we want our neural network to produce.
silver 0 1 0
gold 1 0 0
silver 0 1 0
gold 1 0 0
Bronze 0 0 1
... ... ... ...
gold 1 0 0
gold 1 0 0
silver 0 1 0
... ... ... ...
gold 1 0 0
one-hot encoding
First, we need to convert the classes into a numeric format. Since we have more
than two classes this time, converting them into 0s and 1s won’t work.
We need to use a method called one-hot encoding. Here, we create a new column
for each class. Then we treat each column as a binary classification output,
assigning the value 1 for yes and 0 for no.
Note that if we also had features of the categorical type, we would apply the same
method. Suppose we had a feature called view with possible values of pool, garden,
and none. This will translate into three one-hot encoded features, one for each
class.
relu
relu
We'll also retain the same number of hidden layers and units, and keep ReLU as
the activation function.
relu
relu
gold
0 1 0 1 0
silver
1 0 1 0 0
bronze
0 0 0 0 1
output values
Here we have the labels of the first five data points—silver, gold, silver, gold, and
bronze.
Take the first data point as an example. For the silver class, the neural network
should ideally predict 0, 1, and 0 for the first, second, and third neurons.
In short, for each data point, the neuron of the actual class should output 1 while
other neurons should output 0.
relu softmax
relu softmax
activation
a
exponent normalize
output
20.1
3 input
exponent
In the first step, it turns the input into its exponent.
exponent
output
exponent
The effect of exponentiation is to transform any number into a positive number.
Additionally, it amplifies the large inputs more than the small ones.
silver
softmax
bronze
softmax
Here we have the three units of neurons, one for each class.
exponent
Each neuron performs the exponentiation on its input, which then becomes the
input for the normalization step.
+ +
+ +
+ +
+ + = 1
normalIZE
In the normalization step, each input is divided by the sum of all inputs. This
becomes the output of the neural network.
As a result, the sum of all outputs will always be 1. This is a useful outcome
because we can now treat the outputs as probability values, as we did in Chapter
3.
gold
0 0.5 1
silver
1 0.2 0
bronze
0 0.3 0
example prediction
Let’s take an example where the actual class is silver. And suppose that each
neuron’s softmax activation produces 0.5, 0.2, and 0.3.
Treating them as probabilities, we assign 1 to the neuron with the largest output
and 0 to the other neurons.
In this example, the predicted class does not match the actual class. This brings
us to the next discussion - the loss function.
0 1 sigmoid
output
The
goal
loss function
The loss function we’ll be using is called categorical cross entropy. It is essentially
the same as the binary cross entropy loss function we used in Chapter 3, but a
generalized version. The categorical cross entropy works for any number of
classes, unlike its binary counterpart.
0.2 0
0.3 1 0.52
0.5 0 0.3
...
loss
predicted actual
0.1 0
0.6 1
0.22
0.3 0 0.6
In one of the earlier training epochs, we can see that the output at the second
neuron is 0.3, which isn’t very good. This results in a high loss value.
In one of the later epochs, this neuron produces an output of 0.6. This indicates
an improvement, which is reflected in the decreasing loss value.
accuracy = 96%
actual
gold
silver
bronze
predicted
distance
training performance
Using the accuracy metric, we can see that the model does pretty well on the
training dataset.
The plot also shows the decision boundary of our trained neural network,
representing the predictions.
accuracy = 75%
actual
gold
silver
bronze
predicted
distance
test performance
The prediction on the test data also shows a respectable performance, given the
limited number of data points.
accuracy = 100%
actual
gold
silver
bronze
predicted
distance
a bigger network
As in Chapter 3, we can further improve the performance by adding more layers
and units.
The plot here shows the decision boundaries after modifying the neural network
to contain two hidden layers with five and ten units of neurons each. The
granularity of its predictions increased, resulting in a more well-defined and
accurate prediction curve.
silver
confusion matrix
The confusion matrix for this task is now expanded to a 3-by-3 matrix. The
diagonal cells account for the correct predictions for each class, while the
remaining cells account for the incorrect predictions.
gold
true
cat
true
false
true + false + false
false
recall (gold)
gold
true
gold
For multi-class classification, each class will have its own set of precision and
recall metrics. Here we have an example for the gold class.
For the remainder of this chapter, we’ll look at various ways to improve our
prediction results. We can divide them into two groups—hyperparameters and
data.
First, let’s look at hyperparameters. You may not have noticed, but we have
covered some of them in the four tasks. Now let’s take a closer look.
parameters
Recall that parameters—weights and biases—are learned by the neural network
during training.
hyperparameters
On the other hand, hyperparameters are parameters that cannot be learned by
the neural network. Instead, we need to provide them.
Choosing the right hyperparameters requires experience and is both an art and a
science. They are no less important in successfully training a neural network.
There are many types of hyperparameters, so let’s cover some of the key ones.
units
size
This is obvious, but one of the most important hyperparameters is the size of the
neural network itself.
We change the size by increasing or decreasing the number of layers and units in
each layer. A larger network is capable of handling more complex tasks and data.
activation function
Another hyperparameter is the type of activation function in each neuron.
This depends on the nature of the task, but ReLU is typically used in the hidden
layers, or at least is a good option to start with.
As for the output layer, this largely depends on the task, for example, whether it’s
a regression or classification task.
too
just
too
learning rate
In all of our tasks, we have kept to a learning rate (or alpha) of 0.08. This value
was chosen simply by trial and error, and there is no reason not to change it in
other scenarios.
As discussed earlier, we don’t want the learning rate to be too large or too small.
Too big and learning will be erratic. Too small and learning will be slow.
# of training data
# of training data
number of epochs
We chose 100 epochs for Chapter 1 and 800 epochs for Chapters 2, 3, and 4 by
trial and error.
An alternative approach is to set a very large epoch and configure training to stop
automatically once certain performance criteria are met. This is called early
stopping.
Finding the right epoch for a task is important because too many epochs can lead
to the model ‘learning too much', including unwanted noise in the data. On the
other hand, too few epochs can lead to the model not capturing enough
information from the data.
...
...
...
...
mini batch
size
batch size
In our tasks, we had a maximum of twenty-four training data points, which we
trained as a full batch. However, in a typical training dataset, the number of data
points is so large that we need to split them into mini batches. This
hyperparameter is called the batch size. In the diagram above, the batch size is
four.
This brings us to the other way to improve our prediction results—by working on
the data.
...
...
...
...
minI batch
size
mini batch
update update
mini batch
next
current
update update
full batch
next
current
update update
Mini Batch
Recall that in the fours tasks, the model performed parameter adjustments after
an entire epoch is complete. This is called full batch gradient descent.
At the other extreme, the model may also perform updates at every data point.
This is called stochastic gradient descent.
Mini batch gradient descent is somewhere in between. For example, a batch size of
four means that the model performs the updates after every four training data
points.
stochastic
mini batch
full batch
stability
BALANCE
How do these options affect performance?
Imagine having a huge dataset where each epoch takes a tremendous amount of
time to complete. With full batch gradient descent, the time between parameter
updates will be too long.
The mini batch gradient descent captures the best of both worlds, offering a
balance between speed and stability.
One way to address this is to use K-fold cross validation. Here, we cycle through
the training-test data split over many training rounds such that by the end, every
data point will have been used for training.
The performance metrics are then averaged over the number of rounds.
data
validation
data
test
data
validation set
In all our tasks, for simplicity, we have split the dataset into two sets—training
and test. However, in practice, it is common to split them into three sets—the
other one being the validation set.
In this case, the role of the test set as we’ve been using will be replaced by the
validation set.
This means we can set aside the test set until the model is fully ready. This offers
a more accurate way to measure performance since these are fresh data points
that the model has never seen.
the bigger
task
picture
multi-class
classification
binary
classification
non-linear
regression
LInear
regression
algorithm
In this final chapter, we’ll take a brief tour of deep learning architectures beyond
what’s been covered so far.
Let’s start with the feedforward neural network. This is the quintessential version of
neural networks, and it is the version we used in the four tasks. It’s called
feedforward because information flows through the layers only in the forward
direction.
28
2
28
...
...
...
28 28 = 784
784
Take a look at this MNIST dataset. It is an image dataset where each data point
contains an image of a handwritten number. This is a multi-class classification
task to predict the number based on the image. The features are the individual
pixels (28 x 28), while the label is a number between 0 and 9.
There are 784 features, which means that 784 inputs are going into the first
hidden layer. Additionally, such a task will require even more hidden layers with
a substantial number of neurons in each.
filters
...
...
The idea with this architecture is, instead of feeding all inputs to all neurons, we
group neurons into smaller sets called filters. These filters don’t take the inputs
all at once. Instead, they scan through the small sections of the inputs
sequentially.
#1
#2
#3
These filters learn uniquely at the individual level and synergistically at the
collective level.
This process is repeated over several layers, and the result is a compressed
version of the data. This information is finally fed into a standard layer to
produce the prediction.
image classification
Let’s look at some use cases of the CNN. The first is image classification, where
the task is to predict the class of an image. It’s the same type of task as the
MNIST example, but with a different architecture.
tree
tree
pole
car
object detection
Another use case is object detection. It involves both classification and regression.
As for classification, this time each image could contain more than one class.
Here is an example where we have four objects belonging to one of three classes.
As for regression, the task is to predict the position on the image where these
objects are located.
semantic segmentation
Another use case is semantic segmentation. This task is similar to image
classification, except that classification is performed at the pixel level instead of
at the image level.
cell cell
data
This diagram depicts one building block of the RNN. It has two inputs. The first
input is the data from the current step of a sequence, while the other input is the
output of the previous step’s RNN block.
positive
negative
text classification
Let’s take a use case - text classification. Here we have an example of sentiment
analysis, where the task is to predict the sentiment in a sentence.
je m’appelle
my name
machine translation
Another use case is machine translation. Here, the RNN generates many outputs
instead of one.
lunch
I am having
text generation
Another interesting use case is text generation. Here, we can train the neural
network to predict the next word, given a sequence of words.
We can repeat this for as many words as we want. This means we can train it to
generate a complete sentence, an essay, or even an entire book!
new data
discriminator
classification
real / not real
Given a training dataset, the GAN can create new data points that follow the
same distribution but yet distinct from the original.
The GAN consists of two competing neural networks, the generator and the
discriminator. The generator’s role is to generate new examples that look as real as
possible, while the discriminator’s role is to determine if they are real.
new
data
discriminator
real
not
real
real
data
The architecture is designed so that both networks will keep improving after
each training round. As more examples are given, the discriminator becomes
increasingly good at detecting the real ones. At the same time, the generator’s
output will become more and more similar to the real ones.
graph
...
other architectures
There are many other architectures out there, and new ones continue to emerge.
It is an exciting area of research where breakthroughs keep on coming.
By and large, it is still a relatively new technology. The opportunities are wide
open for us to innovate and create solutions that make the world a better place.
neural network
neurons
adjust layers measure
weights architecture cost
biases metrics
feedback
gradients
backpropagation
I just wanted to mention that there are way more resources out there than listed
here. You can treat this as a starting point but by no means an exhaustive list of
resources.
thank you
Thank you so much for reading my book! I hope you have enjoyed reading it.
If you have received value from it, I'd be so grateful to receive a short testimonial
from you, to be displayed on the book's product website. It's the best gift I can
get from a reader. And it will go a long way to support me in my endeavor.
Leave a Testimonial
Meor Amer
235
kDimensions
@kdimensions1