100% found this document useful (3 votes)
2K views134 pages

Machine Learning Notes

This document provides an introduction to machine learning. It defines machine learning as programming computers to optimize performance using example data or past experience. Machine learning is useful when human expertise does not exist, humans cannot explain their expertise, or solutions need to adapt over time. The document discusses applications of machine learning including association analysis, supervised learning techniques like classification and regression, and unsupervised learning. It provides examples of classification and regression problems.

Uploaded by

sasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
2K views134 pages

Machine Learning Notes

This document provides an introduction to machine learning. It defines machine learning as programming computers to optimize performance using example data or past experience. Machine learning is useful when human expertise does not exist, humans cannot explain their expertise, or solutions need to adapt over time. The document discusses applications of machine learning including association analysis, supervised learning techniques like classification and regression, and unsupervised learning. It provides examples of classification and regression problems.

Uploaded by

sasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 134

Unit-I

Introduction to machine learning


Dr. G. Paavai Anand
• We have lots of data.
• What do we do with the data?

• We need automated methods to analyze this


abundant data and come out with valid
Why “Learn”?
• Machine learning is programming computers to
optimize a performance criterion using example
data or past experience.
• There is no need to “learn” to calculate payroll
• Learning is used when:
– Human expertise does not exist (navigating on Mars),
– Humans are unable to explain their expertise (speech
recognition)
– Solution changes in time (routing on a computer
network)
– Solution needs to be adapted to particular cases (user
biometrics)

4
What We Talk About When We Talk
About“Learning”
• Learning general models from a data of particular
examples
• Data is cheap and abundant (data warehouses,
data marts); knowledge is expensive and scarce.
• Example in retail: Customer transactions to
consumer behavior:
People who bought “Da Vinci Code” also bought “The
Five People You Meet in Heaven” (www.amazon.com)
• Build a model that is a good and useful
approximation to the data.
5
What is Machine Learning?
• Machine Learning
– Study of algorithms that
– improve their performance
– at some task
– with experience
• Optimize a performance criterion using example data
or past experience.
• Role of Statistics: Inference from a sample
• Role of Computer science: Efficient algorithms to
– Solve the optimization problem
– Representing and evaluating the model for inference

6
What is Machine Learning?
Definition by Kevin Murphy
• we define machine learning as a set of
methods that can
– automatically detect patterns in data,
– and then use the uncovered patterns to predict
future data,
– or to perform other kinds of decision making
under uncertainty (such as planning how to collect
more data!).
Applications
• Association Analysis
• Supervised Learning
– Classification
– Regression/Prediction
• Unsupervised Learning
• Reinforcement Learning

8
11/19/2019 Dr. G. Paavai Anand 9
Learning Associations
• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y
where X and Y are products/services.

Example: P ( jam | bread ) = 0.7

Market-Basket transactions – is the input


Supervised Learning: Uses
Example: decision trees tools that create rules
• Prediction of future cases: Use the rule to
predict the output for future inputs
• Knowledge extraction: The rule is easy to
understand
• Compression: The rule is simpler than the data
it explains
• Outlier detection: Exceptions that are not
covered by the rule, e.g., fraud

11
Predictive or supervised learning
approach
• In the predictive or supervised learning
approach, the goal is to learn a mapping from
inputs x to outputs y
• given a labeled set of input-output pairs
𝐷 = { 𝑥𝑖 , 𝑦𝑖 }𝑖=1𝑡𝑜𝑁
• Here D is called the training set, and N is the
number of training examples.
• 𝑥𝑖 is called as feature, attribute or covariate.
• output or response variable can in principle be anything,
• but most methods assume that yi is a categorical or
nominal variable from some finite set,
• yi ∈{ 1,...,C} (such as male or female),
• or that yi is a real-valued scalar (such as income level).
When yi is categorical, the problem is known as
classification or pattern recognition,
• and when yi is real-valued, the problem is known as
regression.
• Another variant, known as ordinal regression, occurs
where label space Y has some natural ordering, such as
grades A–F.
Prediction: Regression

• Example: Price of a
used car
y = wx+w0
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters

14
Regression
• we have a single real-valued input xi ∈ R
• and a single real-valued response yi ∈R

• Various extensions of this basic problem can


arise, such as
– having high-dimensional inputs
– Outliers
– non-smooth responses, etc.
.
Regression
Here are some examples of real-world regression problems.
• Predict tomorrow’s stock market price given current market
conditions and other possible side information.
• Predict the age of a viewer watching a given video on
YouTube.
• Predict the location in 3d space of a robot arm end effector,
given control signals (torques) sent to its various motors.
• Predict the amount of prostate specific antigen (PSA) in the
body as a function of a number of different clinical
measurements.
• Predict the temperature at any location inside a building
using weather data, time, door sensors, etc.
Classification: Applications
• Aka Pattern recognition
• Face recognition: Pose, lighting, occlusion
(glasses, beard), make-up, hair style
• Character recognition: Different handwriting
styles.
• Speech recognition: Temporal dependency.
– Use of a dictionary or the syntax of the language.
– Sensor fusion: Combine multiple modalities; eg, visual
(lip image) and acoustic for speech
• Medical diagnosis: From symptoms to illnesses
• Web Advertizing: Predict if a user clicks on an ad
on the Internet.
17
Classification
• Everyday, all the time we classify things.
• Eg crossing the street:
– Is there a car coming?
– At what speed?
– How far is it to the other side?
– Classification: Safe to walk or not!!!
Problems in classifying data

• Often high dimension of data.


• Hard to put up simple rules.
• Amount of data.
• Need automated ways to deal with the data.
• Use computers – data processing, statistical
analysis, try to learn patterns from the data
(Machine Learning)
Classification
• Here the goal is to learn a mapping from inputs x to outputs
y,
• Where y ∈{ 1,...,C}, with C being the number of classes.
• If C =2, this is called binary classification (in which case we
often assume y ∈{ 0,1});
• if C>2, this is called multiclass classification .
• If the class labels are not mutually exclusive (e.g.,
somebody may be classified as tall and strong), we call it
multi-label classification, but this is best viewed as
predicting multiple related binary class labels (a so-called
multiple output model).
• When we use the term “classification”, we will mean
multiclass classification with a single output, unless we
state otherwise.
Classification
• One way to formalize the problem is as function
approximation.
• We assume y = f(x) for some unknown function f, and
the goal of learning is to estimate the function f given a
labeled training set, and then to make predictions
using ˆ y = ˆ f(x).
• (We use the hat symbol to denote an estimate.)
• Our main goal is to make predictions on novel inputs,
meaning ones that we have not seen before (this is
called generalization),
• since predicting the response on the training set is
easy (we can just look up the answer).
Generalization is important
Need for probability
To handle ambiguous cases, such as the yellow
circle above, it is desirable to return a probability.
Given the input vector x and training set D by
p(y|x,D).
Eg: What is the probability of the current word
being a noun given that the previous word is an
adjective???
What is the POS tag of the given word??? (MAP
estimate)
Data
examples

Data

11/19/2019 Dr. G. Paavai Anand 24


Supervised learning
examples

label
label1

label3
labeled examples

label4

label5

Supervised learning: given labeled examples


11/19/2019 Dr. G. Paavai Anand 25
Supervised learning

label
label1
model/
label3 predictor

label4

label5

Supervised learning: given labeled examples

11/19/2019 Dr. G. Paavai Anand 26


Supervised learning: classification
label
apple

apple
Classification: a finite set of
labels
banana

banana

Supervised learning: given labeled examples


11/19/2019 Dr. G. Paavai Anand 27
Supervised learning

model/ predicted label


predictor

Supervised learning: learn to predict new example

11/19/2019 Dr. G. Paavai Anand 28


Classification tasks
• Learning Task (Phase-1)
– Given: Expression profiles of leukemia patients and healthy
persons.
– Compute: A model distinguishing if a person has leukemia from
expression data.
• Classification Task (Phase-2)
– Given: Expression profile of a new patient + a learned model
– Determine: If a patient has leukemia or not.
Other real world applications
1. Document classification
– p(y = c|x,D), where x is some representation of the text.
– Most classifiers assume that the input vector x has a fixed size.
– representing variable-length documents in feature-vector format using
bag of words representation.
2. Email Spam filtering
– A special case of document classification, where the classes are spam y
=1 or not y =0
3. Classifying flowers
– Four attributes (sepal length, sepal width, petal length, petal
width) and 3 classes (setosa, versicolor, virginica)
4. Image classification and handwriting recognition
5. Face detection and recognition – detecting objects in an
image
Flower classification
plot

Scatter plot
Image classification

Training examples of a person

Test images

AT&T Laboratories, Cambridge UK


http://www.uk.research.att.com/facedatabase.html

32
Face Detection
Graphical Example of Classification

34
Graphical Example of Classification

35
Graphical Example of Classification

36
Decision Boundaries

37
Descriptive or unsupervised learning
approach
• Here we are only given inputs,
𝐷 = { 𝑥𝑖 }𝑖=1𝑡𝑜𝑁
• and the goal is to find “interesting patterns” in
the data.
• This is sometimes called knowledge discovery.
• This is a much less well-defined problem, since
we are not told what kinds of patterns to look for,
and there is no obvious error metric to use
(unlike supervised learning, where we can
compare our prediction of y for a given x to the
observed value).
Unsupervised Learning
• Learning “what normally happens”
• No output
• Clustering: Grouping similar instances
• Other applications: Summarization,
Association Analysis

39
Unsupervised learning
• Calculate p(xi|θ)
• instead of p(yi|xi,θ)
• Unsupervised learning is more typical of human
and animal learning.
• It is also more widely applicable than supervised
learning, since it does not require a human expert
to manually label the data.
• Labeled data is not only expensive to acquire, but
it also contains relatively little information,
certainly not enough to reliably estimate the
parameters of complex models.
1. Clustering
• Let K denote the number of clusters.
• eg: representing the height and weight of a group of 210 people
• Our first goal is to estimate the distribution over the number of
clusters, p(K|D);
• this tells us if there are subpopulations within the data.
• For simplicity, we often approximate the distribution p(K|D) by its
mode,
• K∗ = argmax K p(K|D).
• In the supervised case, we were told that there are two classes
(male and female), but in the unsupervised case, we are free to
choose as many or few clusters as we like.
• Our second goal is to estimate which cluster each point belongs to.
Let zi ∈{ 1,...,K}
Clustering
• model based clustering, which means we fit a probabilistic model to
the data, rather than running some ad hoc algorithm.
• The advantages of the model-based approach are that one can
compare different kinds of models in an objective way (in terms of
the likelihood they assign to the data), we can combine them
together into larger systems, etc.
• real world applications of clustering.
• In astronomy, the autoclass system (Cheeseman et al. 1988) discovered
a new type of star, based on clustering astrophysical measurements.
• In e-commerce, it is common to cluster users into groups, based on
their purchasing or web-surfing behavior, and then to send customized
targeted advertising to each group (see e.g., (Berkhin 2006)).
• In biology, it is common to cluster flow-cytometry data into groups, to
discover different sub-populations of cells (see e.g., (Lo et al. 2009)).
2. Discovering latent factors
• When dealing with high dimensional data, reduce the
dimensionality by projecting the data to a lower dimensional
subspace which captures the “essence” of the data.
• This is called dimensionality reduction.
• The data may appear high dimensional but there may only
be a small number of degrees of variability, corresponding to
latent factors.
• For example, when modeling the appearance of face images,
there may only be a few underlying latent factors which
describe most of the variability, such as lighting, pose,
identity, etc
• Another example – text processing (stop word removal,
stemming)
Discovering latent factors
• low dimensional representations
– results in better predictive accuracy, because they
focus on the “essence” of the object, filtering out
inessential features.
– are useful for enabling fast nearest neighbor searches
– and two dimensional projections are very useful for
visualizing high dimensional data.
• The most common approach to dimensionality
reduction is called principal components analysis
or PCA.
Dimensionality reduction and PCA
applications
• In computer graphics, it is common to project motion
capture data to a low dimensional space, and use it to create
animations.
• In biology, it is common to use PCA to interpret gene
microarray data, to account for the fact that each
measurement is usually the result of many genes which are
correlated in their behavior by the fact that they belong to
different biological pathways.
• In natural language processing, it is common to use a
variant of PCA called latent semantic analysis (LSA) for
document retrieval
• In signal processing (e.g., of acoustic or neural signals), it is
common to use Independent Component Analysis (ICA -
which is a variant of PCA) to separate signals into their
different sources.
3. Discovering graph structure
• Sometimes we measure a set of correlated
variables, and we would like to discover which
ones are most correlated with which others.
• This can be represented by a graph G, in which
nodes represent variables, and edges
represent direct dependence between
variables
Discovering graph structure
• Another example is predicting traffic jams on
the freeway.
• Horvitz et al. (2005) describe a deployed
system called JamBayes for predicting traffic
flow in the Seattle area;
• predictions are made using a graphical model
whose structure was learned from data.
Other applications
Matrix completion
- conduct a survey – missing values – predict those values
Image inpainting
- The goal is to “fill in” holes (e.g., due to scratches or occlusions) in
an image with realistic texture.
Market Basket Analysis
- Many items are purchased together (e.g., bread and butter), so there
will be correlations amongst the bits.
Collaborative filtering
• A common example of this concerns predicting which movies
people will want to watch based on how they, and other people,
have rated movies which they have already seen.
• The key idea is that the prediction is not based on features of the
movie or user (although it could be), but merely on a ratings matrix.
Image inpainting
Collaborative filtering
Reinforcement Learning
• Topics:
– Policies: what actions should an agent take in a particular
situation
– Utility estimation: how good is a state (used by policy)
• No supervised output but delayed reward
• Credit assignment problem (what was responsible for
the outcome)
• Applications:
– Game playing
– Robot in a maze
– Multiple agents, partial observability, ...

52
Parametric
Vs
Non-Parametric
Models
Parametric Models
Learning a Function

Machine learning can be summarized as


learning a function (f) that maps input variables
(X) to output variables (Y).
Y = f(x)
An algorithm learns this target mapping function
from training data.
Parametric Models
 A learning model that summarizes data with a set of parameters of fixed
size (independent of the number of training examples) is called a parametric
model.
 No matter how much data you throw at a parametric model, it won’t
change its mind about how many parameters it needs.
Parametric machine learning algorithms are often called as “Linear machine
learning algorithms”
The algorithms involve two steps:
 Select a form for the function.
 Learn the coefficients for the function from the training data.
Parametric Models
Examples for Parametric Models :

 Linear Regression

Logistic Regression

Linear Discriminant Analysis

Perceptron

Naive Bayes

Simple Neural Networks


Parametric Models
Benefits :
• Simpler: These methods are easier to understand and interpret results.
• Speed: Parametric models are very fast to learn from data.
• Less Data: They do not require as much training data and can work well
even if the fit to the data is not perfect.
Limitations :
• Constrained: By choosing a functional form these methods are highly
constrained to the specified form.
• Limited Complexity: The methods are more suited to simpler problems.
• Poor Fit: In practice the methods are unlikely to match the underlying
mapping function.
LINEAR REGRESSION
• Response is a linear function of the input.
• Y(x)=𝑤 𝑇 𝑥 + Ɛ

• 𝑤 𝑇 𝑥 scalar product between input vector x


and the models weight vector w.
• Ɛ residual error between our linear predictions
and the true response.
• Residual error= estimated - observed value
Non-Parametric Models
 Algorithms that do not make strong assumptions about the form of the
mapping function are called nonparametric machine learning algorithms.
 By not making assumptions, they are free to learn any functional form from
the training data.
 Nonparametric methods are good when you have a lot of data and no prior
knowledge, and when you don’t want to worry too much about choosing just
the right features.
Examples :
 k-Nearest Neighbors
 Decision Trees like CART and C4.5
 Support Vector Machines
Non-Parametric Models
Benefits :
• Flexibility: Capable of fitting a large number of functional forms.
• Power: No assumptions (or weak assumptions) about the underlying
function.
• Performance: Can result in higher performance models for prediction.
Limitations :
• More data: Require a lot more training data to estimate the mapping
function.
• Slower: A lot slower to train as they often have far more parameters to
train.
• Overfitting: More of a risk to overfit the training data and it is harder to
explain why specific predictions are made.
KNN where K=1

all points in V (xi) are closer to xi than to any other point


K NEAREST NEIGHBOR
CLASSIFICATION
Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

11/19/2019 Dr. G. Paavai Anand 68


KNN-STEPS
1. Determine ‘K’ the number of neighbors to be considered.
2. Calculate the distance between the test-data and each of the
training data (Use any distance measure for this - next slide).
3. Sort the distance and determine the nearest neighbor based
on the value of ‘k’.
4. Gather the category of ‘k’ nearest neighbors.
5. Use the majority voting method to determine the class of
the test data.

11/19/2019 Dr. G. Paavai Anand 69


Distance Metrics
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp n

Gene_A (X) x1 x2 x3 x4 x5 xn

Gene_B (Y) y1 y2 y3 y4 y5 yn

• Given two n-dimensional vectors x=(x1, x2,…,xn) and y=(y1, y2,…,yn) , the
distance between x and y can be computed according to:

 Euclidean distance Cosine similarity (Angle)


• squared Correlation distance
• standardized Mahalanobis distance
 Manhattan distance Minkowski distance
 Chebychev distance

11/19/2019 Dr. G. Paavai Anand 70


Nearest neighbor method
• Majority vote within the k nearest neighbors

new

K= 1: blue
K= 3: green

11/19/2019 Dr. G. Paavai Anand 71


Nearest-Neighbor Classifiers
Unknown record

l Requires three things


– The set of stored records
– Distance Metric to compute distance between
records
– The value of k, the number of nearest
neighbors to retrieve

l To classify an unknown record:


– Compute distance to other training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to
determine the class label of unknown record
(e.g., by taking majority vote)

11/19/2019 Dr. G. Paavai Anand 72


Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor


K-nearest neighbors of a record x are data points that have the k smallest distance to x
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes
Choose an odd value for k, to eliminate ties
11/19/2019 Dr. G. Paavai Anand 73
KNN
• Scaling issues
– Attributes may have to be scaled to prevent distance measures from being
dominated by one of the attributes.
– Examples
• Height of a person may vary from 4’ to 6’
• Weight of a person may vary from 100lbs to 300lbs
• Income of a person may vary from $10k to $500k
• Nearest Neighbor classifiers are lazy learners
– No pre-constructed models for classification

11/19/2019 Dr. G. Paavai Anand 74


KNN - Advantages
• Simple technique that is easily implemented
• Building model is inexpensive
• Extremely flexible classification scheme
– does not involve preprocessing
• Well suited for
– Multi-modal classes (classes of multiple forms)
– Records with multiple class labels
• Can sometimes be the best method

11/19/2019 Dr. G. Paavai Anand 75


KNN - Disadvantages
• Classifying unknown records are relatively expensive
– Requires distance computation of k-nearest neighbors
– Computationally intensive, especially when the size of the training set
grows
• Accuracy can be severely degraded by the presence of noisy
or irrelevant features

11/19/2019 Dr. G. Paavai Anand 76


Problem on K-NN
The following problem is to classify whether a special paper tissue is good or
not. Given below are four training examples:
X1=Acid Durabilility X2=Strength Classification
(seconds) (kg/square meter)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good

Now there is a new type of tissue paper that has the following values X1=3
and X2=7.
i) Graph method
ii) Use KNN method to determine the classification of this tissue paper. (Use
K=1, and K=3). Use Euclidean distance measure for distance computation.

11/19/2019 Dr. G. Paavai Anand 77


Curse of Dimensionality
• The KNN classifier is simple and can work quite well, provided it is
given a good distance metric and has enough labeled training data.
• In fact, it can be shown that the KNN classifier can come within a
factor of 2 of the best possible performance if N →∞
• However, the main problem with KNN classifiers is that they do not
work well with high dimensional inputs.
• The poor performance in high dimensional settings is due to the
curse of dimensionality.
• Apply a KNN classifier to data where the inputs are uniformly
distributed in the D-dimensional unit cube.
• Suppose we estimate the density of class labels around a test point
x by “growing” a hyper-cube around x until it contains a desired
fraction f of the data points.
• The expected edge length of this cube will be 𝑒𝐷 (f) = 𝑓 1/𝐷 . If D = 10,
and we want to base our estimate on 10% of the data, we have
e10(0.1) = 0.8, so we need to extend the cube 80% along each
dimension around x.
• Even if we only use 1% of the data, we find e10(0.01) = 0.63
3. CURSE OF DIMENSIONALITY
• Complexity increases with dimension d.
• The complexity of methods are represented as
O(nd^2) .
• So as d increases, computation becomes
costly.
• Thus for accurate estimation n should be
much bigger than d^2. Otherwise the model is
said to be overfitting.
4. OVER FITTNG

• Example: In KNN
• If k=1, no errors (correctly fits).
• If k=5,smoother prediction because average
over a larger neighborhood.
• As K increases prediction becomes over-
smoother K-N,
Selection of Model to avoid over
fitting
• Using Linear model, quadratic and Join the
dots.
A Regression Problem
y = f(x) + noise
Can we learn f from this data?

Let’s consider three methods…


y

x
Linear Regression

x
Quadratic Regression

x
Join-the-dots
Also known as piecewise
linear nonparametric
regression if that makes you
feel better

x
Which is best?

y y

x x

Why not choose the method with the


best fit to the data?
What do we really want?

y y

x x

Why not choose the method with the


best fit to the data?

“How well are you going to predict future


data drawn from the same distribution?”
The test set method
1. Randomly choose 30%
of the data to be in a test
set
2. The remainder is a
training set

x
The test set method
1. Randomly choose 30%
of the data to be in a test
set
2. The remainder is a
training set
3. Perform your regression
y on the training set

(Linear regression example)


The test set method
1. Randomly choose 30%
of the data to be in a test
set
2. The remainder is a
training set
3. Perform your regression
y on the training set
4. Estimate your future
performance with the test
set
x

(Linear regression example)


Mean Squared Error = 2.4
The test set method
1. Randomly choose 30%
of the data to be in a test
set
2. The remainder is a
training set
3. Perform your regression
y on the training set
4. Estimate your future
performance with the test
set
x

(Quadratic regression example)


Mean Squared Error = 0.9
The test set method
1. Randomly choose 30%
of the data to be in a test
set
2. The remainder is a
training set
3. Perform your regression
y on the training set
4. Estimate your future
performance with the test
set
x

(Join the dots example)


Mean Squared Error = 2.2
Advantages:
•Very very simple
•Can then simply choose the method with the best
test-set score

Disadvantages:
•Wastes data: we get an estimate of the best
method to apply to 30% less data
•If we don’t have much data, our test-set might just
be lucky or unlucky
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record

x
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record
2. Temporarily
remove (xk,yk) from
y the dataset

x
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record
2. Temporarily
remove (xk,yk) from
y the dataset
3. Train on the
remaining R-1
x datapoints
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record
2. Temporarily
remove (xk,yk) from
y the dataset
3. Train on the
remaining R-1
x datapoints
4. Note your error
(xk,yk)
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record
2. Temporarily remove
(xk,yk) from the dataset
y 3. Train on the remaining R-
1 datapoints
4. Note your error (xk,yk)
x
When you’ve done all points,
report the mean error.
LOOCV (Leave-one-out Cross
Validation) For k=1 to R
1. Let
(xk,yk)
be the
kth
y y y record
2.
Temp
x x x orarily
remov
e
(xk,yk)
from
the
y y y datas
et
3. Train on
x x x the
remai
MSELOOCV = ning
R-1
2.12 datap
oints

y y y 4. Note
your
error
(xk,yk)
x x x
When you’ve
done all
LOOCV for Quadratic
Regression For k=1 to R
1. Let
(xk,yk)
be the
kth
y y y record
2.
Temp
x x x orarily
remov
e
(xk,yk)
from
the
y y y datas
et
3. Train on
x x x the
remai
MSELOOCV
ning=
R-1
0.962 datap
oints

y y y 4. Note
your
error
(xk,yk)
x x x
When you’ve
done all
LOOCV for Join The Dots
For k=1 to R
1. Let
(xk,yk)
be the
kth
y y y record
2.
Temp
x x x orarily
remov
e
(xk,yk)
from
the
y y y datas
et
3. Train on
x x x the
remai
MSELOOCV
ning=
R-1
3.33 datap
oints

y y y 4. Note
your
error
(xk,yk)
x x x
When you’ve
done all
Which kind of Cross Validation?
Disadvantage Advantag
e
Test-set Variance: Cheap
unreliable
estimate of future
performance
Leave- Expensive. Doesn’t
one-out Has some weird waste data
behavior
..can we get the best of both worlds?
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)

x
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.

x
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
For the blue partition: Train
on all the points not in the
blue partition. Find the
test-set sum of errors on
the blue points.
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
Linear Regression
MSE3FOLD=2.05 For the blue partition: Train
on all the points not in the
blue partition. Find the
test-set sum of errors on
the blue points.
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
Quadratic Regression
MSE3FOLD=1.11 For the blue partition: Train
on all the points not in the
blue partition. Find the
test-set sum of errors on
the blue points.
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
Joint-the-dots MSE3FOLD=2.93
For the blue partition: Train
on all the points not in the
blue partition. Find the
test-set sum of errors on
the blue points.
Which kind of Cross Validation?
Disadvantage Advantage
Test-set Variance: unreliable estimate Cheap
of future performance

Leave- Expensive. Doesn’t waste data


one-out Has some weird behavior

10-fold Wastes 10% of the data. 10 Only wastes 10%. Only 10


times more expensive than times more expensive
test set instead of R times.

3-fold Wastier than 10-fold. Slightly better than test-set


Expensivier than test set

R-fold Identical to Leave-one-out


5. BIAS –VARIANCE TRADEOFF
Any learning algorithm has error that comes
from two sources:
• Bias
• Variance

error(X) = noise(X) + bias(X) + variance(X)


Bias is the difference between the average of
predicted values (also called the expected
value) of an estimator and the actual (true)
value at that point.
Variance on the other hand quantifies how
scattered or how much variation there is from
the true value.
Bias –variance
Bias is the algorithm's tendency to consistently
learn the wrong thing by not taking into account
all the information in the data (underfitting).

Variance is the algorithm's tendency to learn


random things irrespective of the real signal by
fitting highly flexible models that follow the
error/noise in the data too closely (overfitting).
Bias –variance
• ‘Bias' is also used to denote by how much the
average accuracy of the algorithm changes as
input/training data changes.
• High bias can cause an algorithm to miss the
relevant relations between features and target
outputs

• ‘Variance' is used to denote how sensitive the


algorithm is to the chosen input data.
• High variance can cause an algorithm to model
the random noise in the training data, rather than
the intended outputs
Underfitting/overfitting
A model that is underfit will have high training
and high testing error
while an overfit model will have extremely low
training error but a high testing error.
Bias-variance tradeoff
• The bias-variance tradeoff is a central problem in supervised learning.
• Ideally, one wants to choose a model that both accurately captures the
regularities in its training data, but also generalizes well to unseen data.
• Unfortunately, it is typically impossible to do both simultaneously.
• High-variance learning methods may be able to represent their training set
well but are at risk of overfitting to noisy or unrepresentative training data.
• In contrast, algorithms with low variance typically produce simpler models
that don't tend to overfit but may underfit their training data, failing to
capture important regularities.
• Models with low bias are usually more complex (e.g. higher-order
regression polynomials), enabling them to represent the training set more
accurately. In the process, however, they may also represent a
large noise component in the training set, making their predictions less
accurate - despite their added complexity.
• In contrast, models with higher bias tend to be relatively simple (low-order
or even linear regression polynomials) but may produce lower variance
predictions when applied beyond the training set.
The Bias-Variance Tradeoff. In the KNN algorithm the parameter K
controls the achieved (model) complexity
• A larger bias variance is preferred when data are
few/noisy to achieve a better control of the
variance,
• whereas the bias can be decreased as more data
become available, hence reducing the variance.
• For any given training set, the best choice for K
would be the one striking the optimal trade-off
between bias and variance (that is the value
minimizing their sum).
6. LEARNING CURVE
• Graph that compares the performance of a model
on training and testing data over a varying
number of training instances.
• Performance improve as the number of training
points increases.
• When we separate training and testing sets and
graph them individually
– We can get an idea of how well the model can
generalize to new data
• Learning curve allows us to verify when a model
has learning as much as it can about the data
LEARNING CURVE
• When it occurs
– The performances on the training and testing sets
reach a plateau
– There is a consistent gap between the two error rates
• The key is to find the sweet spot that minimizes
bias and variance by finding the right level of
model complexity
• Of course with more data any model can
improve, and different models may be optimal.
Types of learning curves

• Bad Learning Curve: High


Bias
– When training and testing
errors converge and are high
• No matter how much data we
feed the model, the model
cannot represent the underlying
relationship and has high
systematic errors
• Poor fit
• Poor generalization
• Bad Learning Curve: High Variance
– When there is a large gap between
the errors
• Require data to improve
• Can simplify the model with fewer or
less complex features

• Ideal Learning Curve


– Model that generalizes to new data
– Testing and training learning curves
converge at similar values
– Smaller the gap, the better our
model generalizes

You might also like