Machine Learning Notes
Machine Learning Notes
4
What We Talk About When We Talk
About“Learning”
• Learning general models from a data of particular
examples
• Data is cheap and abundant (data warehouses,
data marts); knowledge is expensive and scarce.
• Example in retail: Customer transactions to
consumer behavior:
People who bought “Da Vinci Code” also bought “The
Five People You Meet in Heaven” (www.amazon.com)
• Build a model that is a good and useful
approximation to the data.
5
What is Machine Learning?
• Machine Learning
– Study of algorithms that
– improve their performance
– at some task
– with experience
• Optimize a performance criterion using example data
or past experience.
• Role of Statistics: Inference from a sample
• Role of Computer science: Efficient algorithms to
– Solve the optimization problem
– Representing and evaluating the model for inference
6
What is Machine Learning?
Definition by Kevin Murphy
• we define machine learning as a set of
methods that can
– automatically detect patterns in data,
– and then use the uncovered patterns to predict
future data,
– or to perform other kinds of decision making
under uncertainty (such as planning how to collect
more data!).
Applications
• Association Analysis
• Supervised Learning
– Classification
– Regression/Prediction
• Unsupervised Learning
• Reinforcement Learning
8
11/19/2019 Dr. G. Paavai Anand 9
Learning Associations
• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y
where X and Y are products/services.
11
Predictive or supervised learning
approach
• In the predictive or supervised learning
approach, the goal is to learn a mapping from
inputs x to outputs y
• given a labeled set of input-output pairs
𝐷 = { 𝑥𝑖 , 𝑦𝑖 }𝑖=1𝑡𝑜𝑁
• Here D is called the training set, and N is the
number of training examples.
• 𝑥𝑖 is called as feature, attribute or covariate.
• output or response variable can in principle be anything,
• but most methods assume that yi is a categorical or
nominal variable from some finite set,
• yi ∈{ 1,...,C} (such as male or female),
• or that yi is a real-valued scalar (such as income level).
When yi is categorical, the problem is known as
classification or pattern recognition,
• and when yi is real-valued, the problem is known as
regression.
• Another variant, known as ordinal regression, occurs
where label space Y has some natural ordering, such as
grades A–F.
Prediction: Regression
• Example: Price of a
used car
y = wx+w0
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
14
Regression
• we have a single real-valued input xi ∈ R
• and a single real-valued response yi ∈R
Data
label
label1
label3
labeled examples
label4
label5
label
label1
model/
label3 predictor
label4
label5
apple
Classification: a finite set of
labels
banana
banana
Scatter plot
Image classification
Test images
32
Face Detection
Graphical Example of Classification
34
Graphical Example of Classification
35
Graphical Example of Classification
36
Decision Boundaries
37
Descriptive or unsupervised learning
approach
• Here we are only given inputs,
𝐷 = { 𝑥𝑖 }𝑖=1𝑡𝑜𝑁
• and the goal is to find “interesting patterns” in
the data.
• This is sometimes called knowledge discovery.
• This is a much less well-defined problem, since
we are not told what kinds of patterns to look for,
and there is no obvious error metric to use
(unlike supervised learning, where we can
compare our prediction of y for a given x to the
observed value).
Unsupervised Learning
• Learning “what normally happens”
• No output
• Clustering: Grouping similar instances
• Other applications: Summarization,
Association Analysis
39
Unsupervised learning
• Calculate p(xi|θ)
• instead of p(yi|xi,θ)
• Unsupervised learning is more typical of human
and animal learning.
• It is also more widely applicable than supervised
learning, since it does not require a human expert
to manually label the data.
• Labeled data is not only expensive to acquire, but
it also contains relatively little information,
certainly not enough to reliably estimate the
parameters of complex models.
1. Clustering
• Let K denote the number of clusters.
• eg: representing the height and weight of a group of 210 people
• Our first goal is to estimate the distribution over the number of
clusters, p(K|D);
• this tells us if there are subpopulations within the data.
• For simplicity, we often approximate the distribution p(K|D) by its
mode,
• K∗ = argmax K p(K|D).
• In the supervised case, we were told that there are two classes
(male and female), but in the unsupervised case, we are free to
choose as many or few clusters as we like.
• Our second goal is to estimate which cluster each point belongs to.
Let zi ∈{ 1,...,K}
Clustering
• model based clustering, which means we fit a probabilistic model to
the data, rather than running some ad hoc algorithm.
• The advantages of the model-based approach are that one can
compare different kinds of models in an objective way (in terms of
the likelihood they assign to the data), we can combine them
together into larger systems, etc.
• real world applications of clustering.
• In astronomy, the autoclass system (Cheeseman et al. 1988) discovered
a new type of star, based on clustering astrophysical measurements.
• In e-commerce, it is common to cluster users into groups, based on
their purchasing or web-surfing behavior, and then to send customized
targeted advertising to each group (see e.g., (Berkhin 2006)).
• In biology, it is common to cluster flow-cytometry data into groups, to
discover different sub-populations of cells (see e.g., (Lo et al. 2009)).
2. Discovering latent factors
• When dealing with high dimensional data, reduce the
dimensionality by projecting the data to a lower dimensional
subspace which captures the “essence” of the data.
• This is called dimensionality reduction.
• The data may appear high dimensional but there may only
be a small number of degrees of variability, corresponding to
latent factors.
• For example, when modeling the appearance of face images,
there may only be a few underlying latent factors which
describe most of the variability, such as lighting, pose,
identity, etc
• Another example – text processing (stop word removal,
stemming)
Discovering latent factors
• low dimensional representations
– results in better predictive accuracy, because they
focus on the “essence” of the object, filtering out
inessential features.
– are useful for enabling fast nearest neighbor searches
– and two dimensional projections are very useful for
visualizing high dimensional data.
• The most common approach to dimensionality
reduction is called principal components analysis
or PCA.
Dimensionality reduction and PCA
applications
• In computer graphics, it is common to project motion
capture data to a low dimensional space, and use it to create
animations.
• In biology, it is common to use PCA to interpret gene
microarray data, to account for the fact that each
measurement is usually the result of many genes which are
correlated in their behavior by the fact that they belong to
different biological pathways.
• In natural language processing, it is common to use a
variant of PCA called latent semantic analysis (LSA) for
document retrieval
• In signal processing (e.g., of acoustic or neural signals), it is
common to use Independent Component Analysis (ICA -
which is a variant of PCA) to separate signals into their
different sources.
3. Discovering graph structure
• Sometimes we measure a set of correlated
variables, and we would like to discover which
ones are most correlated with which others.
• This can be represented by a graph G, in which
nodes represent variables, and edges
represent direct dependence between
variables
Discovering graph structure
• Another example is predicting traffic jams on
the freeway.
• Horvitz et al. (2005) describe a deployed
system called JamBayes for predicting traffic
flow in the Seattle area;
• predictions are made using a graphical model
whose structure was learned from data.
Other applications
Matrix completion
- conduct a survey – missing values – predict those values
Image inpainting
- The goal is to “fill in” holes (e.g., due to scratches or occlusions) in
an image with realistic texture.
Market Basket Analysis
- Many items are purchased together (e.g., bread and butter), so there
will be correlations amongst the bits.
Collaborative filtering
• A common example of this concerns predicting which movies
people will want to watch based on how they, and other people,
have rated movies which they have already seen.
• The key idea is that the prediction is not based on features of the
movie or user (although it could be), but merely on a ratings matrix.
Image inpainting
Collaborative filtering
Reinforcement Learning
• Topics:
– Policies: what actions should an agent take in a particular
situation
– Utility estimation: how good is a state (used by policy)
• No supervised output but delayed reward
• Credit assignment problem (what was responsible for
the outcome)
• Applications:
– Game playing
– Robot in a maze
– Multiple agents, partial observability, ...
52
Parametric
Vs
Non-Parametric
Models
Parametric Models
Learning a Function
Linear Regression
Logistic Regression
Perceptron
Naive Bayes
Gene_A (X) x1 x2 x3 x4 x5 xn
Gene_B (Y) y1 y2 y3 y4 y5 yn
• Given two n-dimensional vectors x=(x1, x2,…,xn) and y=(y1, y2,…,yn) , the
distance between x and y can be computed according to:
new
K= 1: blue
K= 3: green
X X X
Now there is a new type of tissue paper that has the following values X1=3
and X2=7.
i) Graph method
ii) Use KNN method to determine the classification of this tissue paper. (Use
K=1, and K=3). Use Euclidean distance measure for distance computation.
• Example: In KNN
• If k=1, no errors (correctly fits).
• If k=5,smoother prediction because average
over a larger neighborhood.
• As K increases prediction becomes over-
smoother K-N,
Selection of Model to avoid over
fitting
• Using Linear model, quadratic and Join the
dots.
A Regression Problem
y = f(x) + noise
Can we learn f from this data?
x
Linear Regression
x
Quadratic Regression
x
Join-the-dots
Also known as piecewise
linear nonparametric
regression if that makes you
feel better
x
Which is best?
y y
x x
y y
x x
x
The test set method
1. Randomly choose 30%
of the data to be in a test
set
2. The remainder is a
training set
3. Perform your regression
y on the training set
Disadvantages:
•Wastes data: we get an estimate of the best
method to apply to 30% less data
•If we don’t have much data, our test-set might just
be lucky or unlucky
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record
x
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record
2. Temporarily
remove (xk,yk) from
y the dataset
x
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record
2. Temporarily
remove (xk,yk) from
y the dataset
3. Train on the
remaining R-1
x datapoints
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record
2. Temporarily
remove (xk,yk) from
y the dataset
3. Train on the
remaining R-1
x datapoints
4. Note your error
(xk,yk)
LOOCV (Leave-one-out Cross
Validation)
For k=1 to R
1. Let (xk,yk) be the kth
record
2. Temporarily remove
(xk,yk) from the dataset
y 3. Train on the remaining R-
1 datapoints
4. Note your error (xk,yk)
x
When you’ve done all points,
report the mean error.
LOOCV (Leave-one-out Cross
Validation) For k=1 to R
1. Let
(xk,yk)
be the
kth
y y y record
2.
Temp
x x x orarily
remov
e
(xk,yk)
from
the
y y y datas
et
3. Train on
x x x the
remai
MSELOOCV = ning
R-1
2.12 datap
oints
y y y 4. Note
your
error
(xk,yk)
x x x
When you’ve
done all
LOOCV for Quadratic
Regression For k=1 to R
1. Let
(xk,yk)
be the
kth
y y y record
2.
Temp
x x x orarily
remov
e
(xk,yk)
from
the
y y y datas
et
3. Train on
x x x the
remai
MSELOOCV
ning=
R-1
0.962 datap
oints
y y y 4. Note
your
error
(xk,yk)
x x x
When you’ve
done all
LOOCV for Join The Dots
For k=1 to R
1. Let
(xk,yk)
be the
kth
y y y record
2.
Temp
x x x orarily
remov
e
(xk,yk)
from
the
y y y datas
et
3. Train on
x x x the
remai
MSELOOCV
ning=
R-1
3.33 datap
oints
y y y 4. Note
your
error
(xk,yk)
x x x
When you’ve
done all
Which kind of Cross Validation?
Disadvantage Advantag
e
Test-set Variance: Cheap
unreliable
estimate of future
performance
Leave- Expensive. Doesn’t
one-out Has some weird waste data
behavior
..can we get the best of both worlds?
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
x
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
x
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
For the blue partition: Train
on all the points not in the
blue partition. Find the
test-set sum of errors on
the blue points.
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
Linear Regression
MSE3FOLD=2.05 For the blue partition: Train
on all the points not in the
blue partition. Find the
test-set sum of errors on
the blue points.
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
Quadratic Regression
MSE3FOLD=1.11 For the blue partition: Train
on all the points not in the
blue partition. Find the
test-set sum of errors on
the blue points.
k-fold Cross Randomly break the dataset
into k partitions (in our
Validation example we’ll have k=3
partitions colored Red Green
and Blue)
For the red partition: Train
on all the points not in the
red partition. Find the
test-set sum of errors on
y the red points.
For the green partition: Train
on all the points not in the
green partition. Find the
x test-set sum of errors on
the green points.
Joint-the-dots MSE3FOLD=2.93
For the blue partition: Train
on all the points not in the
blue partition. Find the
test-set sum of errors on
the blue points.
Which kind of Cross Validation?
Disadvantage Advantage
Test-set Variance: unreliable estimate Cheap
of future performance