Machine Learning Crashcourse
Machine Learning Crashcourse
Adila Krisnadhi
[email protected]
December 12, 2020
Faculty of Computer Science
Universitas Indonesia
Acknowledgements
• Longer version of these slides were given in the Machine Learning course taught by
me, Ari Saptawijaya, and Raja Oktovin at the Faculty of Computer Science,
Universitas Indonesia.
• Parts of these slides are taken from Pedro Domingos’ slides (Introduction and
Inductive Learning) at
https://courses.cs.washington.edu/courses/cse446/15sp/
• Some figures are taken from Marsland’s “Machine Learning: An Algorithmic
Perspective” (MLAP) and Bishop’s “Pattern Recognition and Machine Learning”
(PRML).
1/147
Outline
Supervised Learning
Decision Trees
Linear Regression
Logistic Regression
2/147
Outline (cont.)
Nearest Neighbor Methods
K-Means Clustering
3/147
What is Machine Learning?
Artificial Intelligence
4/147
What is machine learning?
5/147
Traditional programming vs. machine learning
Traditional programming
Machine learning
6/147
Key elements of machine learning
7/147
Key elements of ML: Representation
8/147
Key elements of ML: Evaluation
9/147
Key elements of ML: Optimization
• “how do you get to the most preferable area”, i.e., search the space of represented
models to obtain better evaluations
• Examples:
• Combinatorial optimization, e.g., greedy search, genetic algorithm
• Convex optimization, e.g., gradient descent
• Constrained optimization, e.g., linear programming, quadratic programming
• Study theory of optimization for deeper discussion.
10/147
Types of Machine Learning
Question: how does a machine know that it is getting better at performing its task?
Different answers lead to different categories below:
11/147
Learning problems
• Classification aims to assign each input to one of a finite number of categories (target values
are discrete).
• Regression aims to assign each input to a value in a continuous set of possible target values
• Probability estimation is a special case of regression where target values range between 0
and 1 and represent probability values.
• Clustering aims to discover groups of similar examples within the input space.
• Density estimation aims to determine the distribution of data within the input space.
• Projection/dimensionality reduction aims to obtain a representation of data in a dimension
different (typically lower) than its original dimension.
• Credit assignment aims to determine a way to reward (or punish) every action the algorithm
provides so that at the end of the action sequence, it arrives at the best/correct answer.
12/147
Machine learning applications
• Web search: present which web pages you’ll likely click on, based on your past browsing
activities.
• Computational biology: predict a drug design (e.g., how likely the design will give a certain
pharmacological effect) based on various properties of the constituting molecules.
• Commerce/Finance: predict if an individual’s credit risk is high (so not to approve his credit card
application) based on information of his profile.
• Space exploration/astronomy: predict, from a picture of the surface of planet Venus, whether it
has a volcano.
• Robotics: self-driving car – predict whether it should accelerate or brake based on the data
about its surroundings from its sensors.
• Social networks: predict if you’ll likely send a friend request to an individual based on your
social network profile and the activities you do in the social network application.
• Debugging: based on the traces of the program writing and how people debug in the past,
predict the most likely locations of the bugs.
• Your example?
13/147
Machine learning pipeline
1. Data collection.
• Need to understand domain, prior knowledge, and goals
• Data may not exist.
• Large amount of relevant data exist, but are hard to collect.
• Supervised learning also requires target data.
• More data is better, preferably without too much noise. But, large dataset means
increased computational cost.
2. Data preparation
• May include data integration, selection, cleaning, preprocessing, etc.
• Preprocessing may also involve feature selection, extraction, or engineering.
• Not all features are useful for the machine learning task at hand. Some may be redundant
or irrelevant.
• Collecting features should not be costly. Features should be robust to noise.
• Some machine learning representation may not even need feature engineering.
3. Algorithm choice.
14/147
Machine learning pipeline (cont.)
15/147
Supervised Learning
Supervised learning
16/147
Example Applications
17/147
When is supervised learning appropriate?
18/147
When is supervised learning appropriate? (cont.)
19/147
Example
22/147
Hypothesis spaces
22/147
Hypothesis spaces
22/147
Hypothesis spaces
22/147
Hypothesis spaces
23/147
Hypothesis spaces
23/147
Hypothesis spaces
23/147
Hypothesis spaces
23/147
Hypothesis spaces
23/147
Two observations about learning
Note that the following hypotheses are also consistent with the training data:
• x4 ∧ 1-of{x1 , x3 } ⇒ y
• x4 ∧ ¬x2 ⇒ y
If either of these is the unknown function, while our guess is the m-of-n one, what will
happen when new x values are given?
24/147
So, what are the strategies?
In either case, machine learning is about developing algorithms for finding hypothesis
that fits the data.
25/147
Key issues
• What are good hypothesis spaces? Which ones have been useful in practice and
why?
• What algorithms can work in these spaces? Are there general design principles?
• How do we optimize accuracy on future/unseen data points? Also called the
overfitting problem. This distinguishes machine learning from the usual theory of
optimization.
• How can we have confidence in the results? The statistical question: how much
training data is needed to find accurate hypotheses?
• Are some learning problems computationally intractable? The computational
question: does my learning algorithm scale?
• How do we formulate application problems as machine learning problems?
(The engineering question)
26/147
Decision Trees
Example
Easy to understand (i.e., transparent), since it can be turned into a set of if-then rules
27/147
How to create decision tree: Basic principle
In decision tree literature, features are often called (input) attributes. Example algorithms:
• Quinlan’s ID3
• C4.5
• Classification and regression trees (CART)
28/147
Quantifying information content: Entropy
• Let S be a set of labeled examples whose label are coming from the set
L = {a1 , a2 , . . . , aC }, then the entropy of S is
X
Entropy(S) = − P (x) log2 P (x)
x∈L
29/147
Example
30/147
From entropy to information gain
• Information gain: how much the entropy of the training set decreases if a particular
attribute is chosen for the next classification step.
• Let S be a set of examples, F be an attribute of S, v(F ) the set of possible values of
F , and Sf the subset of S containing all examples in S for which the value of F is f .
Also, |S| denotes the size of S and Entropy(S) the entropy of the set S.
• Average entropy of a split based on F w.r.t. a set S of examples:
X |Sf |
I(S, F ) = Entropy(Sf )
|S|
f ∈v(F )
31/147
The ID3 algorithm
Let I be a set of input attributes, C the set of all labels, and S the set of labeled examples.
32/147
Example
Construct a decision tree using ID3 algorithm on the following weather data to decide whether to
play outdoor. Should we play outdoor if the outlook is overcast, the temperature is cool, the humidity
is high, and it is windy out there?
outlook temperature humidity windy play
33/147
Issues in Decision Tree Learning
34/147
Linear Regression
Motivation: House pricing
Area (m2 ) Price (1000 USD) Area (m2 ) Price (1000 USD) Area (m2 ) Price (1000 USD)
158.9 208.5 79.3 132 122.4 40
117.2 181.5 93.3 149 114.1 149.35
165.9 223.5 120.4 90 114.6 179.9
159.5 140 103.5 159 157.9 165.5
204.2 250 124.4 139 145.0 277.5
126.5 143 220.7 325.3 227.8 309
157.4 307 102.9 139.4 101.9 145
194.2 200 166.8 230 120.5 153
164.8 129.9 98.5 129.9 98.2 109
100.1 118 98.5 154 107.0 82
96.6 129.5 148.6 256.3 123.0 160
215.9 345 83.6 134.8 123.4 170
84.7 144 158.3 306 82.1 144
138.8 279.5 148.6 207.5 87.1 130.25
116.4 157 48.3 68.5 106.8 141
• Task: based on the above data, predict the house price given its size/area.
35/147
Regression problem
36/147
Regression problem
36/147
Regression problem
36/147
Regression problem
36/147
Let’s look at the data ...
37/147
Hypothesis
38/147
Hypothesis
38/147
Two hypotheses and errors at each example
39/147
Linear regression
h(x1 , . . . , xn ) = w0 + w1 x1 + · · · + wn xn
where the xi ’s are the inputs, wi ’s are the parameters or weights of the hypothesis,
and w0 is called the intercept term.
• Linear regression aims to learn the parameters wi ’s in the above linear form.
T
• Using vector notation (see later), let x = (x1 , . . . , xn ) with x0 = 1, and
T
w = (w1 , . . . , wn ) . Then, the hypotheses can be written as a matrix operation:
n
x1
X . T
hw (x) = .. = w x
wi xi = (w1 , . . . , wn )
i=0
xn
40/147
Gradient descent
• How do we learn w? Choose a linear hyperplane closest to all given data points
• Particular parameters w correspond to a particular linear hyperplane.
• Need a cost function J(w) of w that measures how close the linear hyperplane to all data,
i.e., J(w) is minimal when w makes the linear hyperplane closest to the data.
• We assume J is a paraboloid, convex function: with suitable axes, J always has a
global minima.
• Finding linear hyperplane closest to given data points ≡ finding minima of J.
41/147
Deriving gradient descent rule
• For simplicity, let w = (w0 , w1 ), i.e., J(w) = J(w0 , w1 ) is convex paraboloid in 3D whose
vertical axis is parallel to the z-axis..
• Let A = (w0 , w1 , J(w0 , w1 )) be any point in the paraboloid. Which direction should we move
from A so that we get a maximum decrease in J?
42/147
Deriving gradient descent rule
• For simplicity, let w = (w0 , w1 ), i.e., J(w) = J(w0 , w1 ) is convex paraboloid in 3D whose
vertical axis is parallel to the z-axis..
• Let A = (w0 , w1 , J(w0 , w1 )) be any point in the paraboloid. Which direction should we move
from A so that we get a maximum decrease in J?
• Let O be the origin point and B = (w00 , w10 , J(w00 , w10 )) the destination of our move.
• Let P = (w0 , w1 ) and Q = (w00 , w10 ) be resp. projections of A and B on the xy-plane.
−−→
• The vector P Q is a projection of the vector of our move and moving from A to B in the
paraboloid is equivalent to moving from P to Q in the xy-plane.
42/147
Deriving gradient descent rule
• For simplicity, let w = (w0 , w1 ), i.e., J(w) = J(w0 , w1 ) is convex paraboloid in 3D whose
vertical axis is parallel to the z-axis..
• Let A = (w0 , w1 , J(w0 , w1 )) be any point in the paraboloid. Which direction should we move
from A so that we get a maximum decrease in J?
• Let O be the origin point and B = (w00 , w10 , J(w00 , w10 )) the destination of our move.
• Let P = (w0 , w1 ) and Q = (w00 , w10 ) be resp. projections of A and B on the xy-plane.
−−→
• The vector P Q is a projection of the vector of our move and moving from A to B in the
paraboloid is equivalent to moving from P to Q in the xy-plane.
• The largest decrease of J happens in the direction of
42/147
Deriving gradient descent rule
• For simplicity, let w = (w0 , w1 ), i.e., J(w) = J(w0 , w1 ) is convex paraboloid in 3D whose
vertical axis is parallel to the z-axis..
• Let A = (w0 , w1 , J(w0 , w1 )) be any point in the paraboloid. Which direction should we move
from A so that we get a maximum decrease in J?
• Let O be the origin point and B = (w00 , w10 , J(w00 , w10 )) the destination of our move.
• Let P = (w0 , w1 ) and Q = (w00 , w10 ) be resp. projections of A and B on the xy-plane.
−−→
• The vector P Q is a projection of the vector of our move and moving from A to B in the
paraboloid is equivalent to moving from P to Q in the xy-plane.
• The largest decrease of J happens in the direction of −∇J.
• Thus, we choose Q in the direction of −∇J.
−
−→ − −→ − −→ ∇J
OQ = OP + P Q ⇔ (w00 , w10 )T = (w0 , w1 )T + δ −
k∇Jk
where δ is a step size (guaranteed convergence if δ is sufficiently small)
∇J
• − k∇Jk is the unit vector opposite direction of ∇J (we only care about the direction).
−
−→
• P Q is a small shift in the direction of −∇J, hence a multiple of the unit vector.
42/147
Gradient descent rule
The previous equation embodies the gradient descent algorithm, often written:
w := w − α∇J(w)
Here, α is called the learning rate and like δ, it is enough if we choose α to be sufficiently
small.
43/147
Gradient descent for linear regression
• The gradient descent for linear regression uses the sum of squares as the cost
function over m training examples (x(i) , y (i) ), i = 1, . . . , m:
m
1X
J(w) = (hw (x(i) ) − y (i) )2
2 i=1
44/147
Gradient descent for linear regression
• The gradient descent for linear regression uses the sum of squares as the cost
function over m training examples (x(i) , y (i) ), i = 1, . . . , m:
m
1X
J(w) = (hw (x(i) ) − y (i) )2
2 i=1
44/147
Gradient descent for linear regression
• The gradient descent for linear regression uses the sum of squares as the cost
function over m training examples (x(i) , y (i) ), i = 1, . . . , m:
m
1X
J(w) = (hw (x(i) ) − y (i) )2
2 i=1
44/147
Concrete update rule for linear regression
Parameter update in gradient descent requires computing ∇J. We can derive a concrete
update rule such that the ∇J expression no longer appears.
46/147
Batch gradient descent
Applying the previous partial derivative to the gradient descent algorithm, we obtain for
each parameter wj :
m
∂ X (i)
wj := wj − α J(w) ⇒ wj := wj + α y (i) − hw (x(i) ) xj
∂wj i=1
The previous partial derivative allows us to derive the following batch gradient descent
algorithm:
47/147
Gradient descent contour
The ellipses are the contours of the quadratic function J. Also shown are the trajectory
taken by gradient descent initialized at (48,30)
Image source: Andrew Ng, CS229 Lecture Notes for CS229 Fall 2018 course, Part I: Linear Regression, page 6.
48/147
Termination condition
• Fixed number of iterations: for a fixed step size and under certain conditions, if w(k) is
w at the k-th iteration and w∗ is the optimal parameters, then J(w(k) ) − J(w∗ ) 6
after O(1/) iterations.
• kw0 − wk (change in parameter values) or kJ(w0 ) − J(w)k (change in error value) is
small enough where w0 is w after a single update.
• k∇J(w)k is small enough: the closer it is to the minima, the smaller the gradient is.
49/147
Stochastic/incremental gradient descent
• Batch gradient descent must scan through the entire training set before making a
single update.
• May be expensive if the size of training set is very large.
• Alternative: stochastic gradient descent that performs update to the parameters
according to the gradient of the error/loss at each single example.
• SGD may get close to the minima faster than batch gradient descent, but may never
converge to the minimum (the parameters keep oscilating around the minima).
50/147
Closed form solution to linear regression
• Besides using an iterative algorithm, we can find a closed-form solution for the
optimal weight w.
• Most regression problem (esp. nonlinear ones) don’t have a closed-form solution.
• Let X be the design matrix containing all training examples as its rows and y the
vector containing all target values from the training set.
T
— x(1) — y (1)
T (2)
— x(2) — y
X=
.. y=
..
.
.
(m) T y (m)
— x —
• If J(w) is the sum of squares cost function, then it can be shown that
∇J = X T Xw − X T y, which is minimized when X T Xw = X T y
• Thus, the optimal w is w = (X T X)−1 X T y
51/147
Why not just use the closed-form solution?
52/147
Logistic Regression
Motivation
• Linear regression algorithms can be used, but it is quite simple to construct examples
where the algorithm performs very poorly.
• Intuitively, it makes no sense to have target values larger than 1 or smaller than 0.
• Can we still use the same linear form, but ensure the output is neither below 0 nor
above 1?
53/147
Motivation
• Linear regression algorithms can be used, but it is quite simple to construct examples
where the algorithm performs very poorly.
• Intuitively, it makes no sense to have target values larger than 1 or smaller than 0.
• Can we still use the same linear form, but ensure the output is neither below 0 nor
above 1?
• Idea: scale the output of the linear form, which is R, into the range [0, 1] using some
function (called the link function).
• Intuition: the weights/parameters transform the input into a point in an intermediate space
and the link function transform that point into a probability value that the input belongs to
the class (given by the label).
53/147
Logistic function
Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?
54/147
Logistic function
Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?
54/147
Logistic function
Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?
• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
54/147
Logistic function
Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?
• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
54/147
Logistic function
Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?
• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
e−z e−z +1−1 1
For every z ∈ R, g(−z) = e−z +1
= e−z +1
=1− e−z +1
= 1 − g(z). So, let h(z) = g(z) − 0.5.
Then, h(−z) = g(−z) − 0.5 = 1 − g(z) − 0.5 = 0.5 − g(z) = −(g(z) − 0.5) = −h(z), i.e., h(z) is
an odd function, which always has a rotational symmety with respect to (0,0). Thus,
g(z) = h(z) + 0.5 must have a rotational symmetry with respect to (0,0.5)
54/147
Logistic function
Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?
• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
e−z e−z +1−1 1
For every z ∈ R, g(−z) = e−z +1
= e−z +1
=1− e−z +1
= 1 − g(z). So, let h(z) = g(z) − 0.5.
Then, h(−z) = g(−z) − 0.5 = 1 − g(z) − 0.5 = 0.5 − g(z) = −(g(z) − 0.5) = −h(z), i.e., h(z) is
an odd function, which always has a rotational symmety with respect to (0,0). Thus,
g(z) = h(z) + 0.5 must have a rotational symmetry with respect to (0,0.5)
• g(z) is strictly increasing. Why?
54/147
Logistic function
Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?
• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
e−z e−z +1−1 1
For every z ∈ R, g(−z) = e−z +1
= e−z +1
=1− e−z +1
= 1 − g(z). So, let h(z) = g(z) − 0.5.
Then, h(−z) = g(−z) − 0.5 = 1 − g(z) − 0.5 = 0.5 − g(z) = −(g(z) − 0.5) = −h(z), i.e., h(z) is
an odd function, which always has a rotational symmety with respect to (0,0). Thus,
g(z) = h(z) + 0.5 must have a rotational symmetry with respect to (0,0.5)
• g(z) is strictly increasing. Why?
For z1 , z2 ∈ R, if z1 < z2 , then e−z1 > e−z2 . Thus, g(z1 ) = 1
1+e−z1
< 1
1+e−z2
= g(z2 ).
54/147
Logistic function
Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?
• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
e−z e−z +1−1 1
For every z ∈ R, g(−z) = e−z +1
= e−z +1
=1− e−z +1
= 1 − g(z). So, let h(z) = g(z) − 0.5.
Then, h(−z) = g(−z) − 0.5 = 1 − g(z) − 0.5 = 0.5 − g(z) = −(g(z) − 0.5) = −h(z), i.e., h(z) is
an odd function, which always has a rotational symmety with respect to (0,0). Thus,
g(z) = h(z) + 0.5 must have a rotational symmetry with respect to (0,0.5)
• g(z) is strictly increasing. Why?
For z1 , z2 ∈ R, if z1 < z2 , then e−z1 > e−z2 . Thus, g(z1 ) = 1
1+e−z1
< 1
1+e−z2
= g(z2 ).
• The derivative g 0 (z) of g(z) satisfies the following: g 0 (z) = g(z)(1 − g(z)). (Prove it!)
54/147
Logistic function
Image source: Andrew Ng, CS229 Lecture Notes for CS229 Fall 2018 course, Part II: Classification and Logistic Regression, page 17.
55/147
Logistic regression
• Logistic function is a natural choice (Why? Read the literature on Generalized Linear
Model) for our classification setting (though other functions that smoothly increase
from 0 to 1 can also be used).
• With input x of n features (x0 , x1 , . . . , xn ), we have a hypothesis of the form:
1
hw (x) = g(wT x) =
1 + e−wT x
• The model parameter is w like in linear regression.
• Classification is then simply done by taking the threshold of 0.5, i.e., x is labeled 1 if
hw (x) > 0.5 and 0 otherwise.
56/147
Likelihood function for logistic regression
How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?
57/147
Likelihood function for logistic regression
How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?
• The data is given by the design matrix X (all training inputs as its roows) and the vector y (all
target values in the training set).
• For a particular value of w, the likelihood function LX,y (w) gives the probability (or probability
density) of the data X, y to occur when the parameter of the model is w.
57/147
Likelihood function for logistic regression
How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?
• The data is given by the design matrix X (all training inputs as its roows) and the vector y (all
target values in the training set).
• For a particular value of w, the likelihood function LX,y (w) gives the probability (or probability
density) of the data X, y to occur when the parameter of the model is w.
• Our model represents the (probabilistic) relationship between the known input X and target y.
That is, the conditional distribution pw (y|X) of y given X, which is parameterized by w.
57/147
Likelihood function for logistic regression
How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?
• The data is given by the design matrix X (all training inputs as its roows) and the vector y (all
target values in the training set).
• For a particular value of w, the likelihood function LX,y (w) gives the probability (or probability
density) of the data X, y to occur when the parameter of the model is w.
• Our model represents the (probabilistic) relationship between the known input X and target y.
That is, the conditional distribution pw (y|X) of y given X, which is parameterized by w.
• Thus, the likelihood of w given the data: LX,y (w) = pw (y, X) = pw (y|X) where the last
equality is because in our model the input (for which X is a sample) is always known.
57/147
Likelihood function for logistic regression
How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?
• The data is given by the design matrix X (all training inputs as its roows) and the vector y (all
target values in the training set).
• For a particular value of w, the likelihood function LX,y (w) gives the probability (or probability
density) of the data X, y to occur when the parameter of the model is w.
• Our model represents the (probabilistic) relationship between the known input X and target y.
That is, the conditional distribution pw (y|X) of y given X, which is parameterized by w.
• Thus, the likelihood of w given the data: LX,y (w) = pw (y, X) = pw (y|X) where the last
equality is because in our model the input (for which X is a sample) is always known.
Our aim is thus finding the value of w that maximizes the conditional probability above,
i.e., maximizes the likelihood.
57/147
Maximizing likelihood for logistic regression
For logistic regression, we assume that the result of prediction hw (x) is the expected value
of the target variable y given the data X = x, which follows the conditional distribution
pw (y|x). Since the target value can only be 1 or 0 (binary classification), we write:
58/147
Maximizing likelihood for logistic regression
For logistic regression, we assume that the result of prediction hw (x) is the expected value
of the target variable y given the data X = x, which follows the conditional distribution
pw (y|x). Since the target value can only be 1 or 0 (binary classification), we write:
• This corresponds to our earlier intuition that w brings x to wT x and then the logistic
function brings wT x to the appropriate probability value.
58/147
Maximizing likelihood for logistic regression (2)
• Next, we assume that X and y give m training examples (x(i) , y (i) ), each generated
independently.
• Then, the likelihood LX,y (w) corresponds to joint distribution over all training
examples (x(i) , y (i) ).
LX,y (w) =
59/147
Maximizing likelihood for logistic regression (2)
• Next, we assume that X and y give m training examples (x(i) , y (i) ), each generated
independently.
• Then, the likelihood LX,y (w) corresponds to joint distribution over all training
examples (x(i) , y (i) ).
m m
Y Y y(i) 1−y(i)
LX,y (w) = pw (y (i) |x(i) ) = hw (x(i) ) 1 − hw (x(i) )
i=1 i=1
59/147
Maximizing likelihood for logistic regression (2)
• Next, we assume that X and y give m training examples (x(i) , y (i) ), each generated
independently.
• Then, the likelihood LX,y (w) corresponds to joint distribution over all training
examples (x(i) , y (i) ).
m m
Y Y y(i) 1−y(i)
LX,y (w) = pw (y (i) |x(i) ) = hw (x(i) ) 1 − hw (x(i) )
i=1 i=1
59/147
Maximizing likelihood for logistic regression (2)
• Next, we assume that X and y give m training examples (x(i) , y (i) ), each generated
independently.
• Then, the likelihood LX,y (w) corresponds to joint distribution over all training
examples (x(i) , y (i) ).
m m
Y Y y(i) 1−y(i)
LX,y (w) = pw (y (i) |x(i) ) = hw (x(i) ) 1 − hw (x(i) )
i=1 i=1
59/147
Maximizing likelihood for logistic regression (3)
w := w + α∇`(w)
60/147
Maximizing likelihood for logistic regression (3)
w := w + α∇`(w)
60/147
Maximizing likelihood for logistic regression (3)
w := w + α∇`(w)
60/147
Maximizing likelihood for logistic regression (3)
w := w + α∇`(w)
60/147
Maximizing likelihood for logistic regression (3)
w := w + α∇`(w)
60/147
Maximizing likelihood for logistic regression (4)
∂
`(w)
∂wj
61/147
Maximizing likelihood for logistic regression (4)
61/147
Maximizing likelihood for logistic regression (4)
61/147
Maximizing likelihood for logistic regression (4)
61/147
Maximizing likelihood for logistic regression (4)
61/147
Batch & stochastic gradient ascent for logistic regression
The gradient ascent update turned out to be the same as the gradient descent update in
linear regression, except that hw is no longer linear in x(i) . This is actually not coincidence
– for deeper discussion, read the topic of Generalized Linear Model.
m
(i)
X
wj := wj + α (y (i) − hw (x(i) ))xj
i=1
62/147
Naive Bayes Classifier
Roles of Bayesian Method
63/147
Bayes Theorem
P (D|h)P (h)
P (h|D) =
P (D)
P (h|D):
64/147
Naive Bayes Classifier
Along with decision trees, neural networks, nearest neighbor, one of the most practical
learning methods.
When to use:
Successful applications:
• Diagnosis
• Classifying text documents
65/147
Naive Bayes Classifier
which gives
Y
Naive Bayes classifier: vN B = arg max P (vj ) P (ai |vj )
vj ∈V
i 66/147
Naive Bayes Algorithm
Naive_Bayes_Learn(examples):
Classify_New_Instance(x):
Y
vN B = arg max P̂ (vj ) P̂ (ai |vj )
vj ∈V
ai ∈x
67/147
Example
Determine what to do in the evening if you have deadlines looming, but none are
particularly urgent, that there is no party currently going on, and that you are currently
lazy, based on the following data.
68/147
Naive Bayes: Subtleties
Consider logistic regression with hw (x) = g(wT x) with g the logistic function.
70/147
Motivation (2)
We consider linear classifiers (those with linear decision boundary) for binary classification
problem
• Here, we directly predict a “1” or “-1”, instead of employing the logistic regression
approach of estimating the probability that y = 1
72/147
Which classification line is better than the others?
Which decision boundary is better than the others? Why? (MLAP, Fig. 8.1.)
73/147
Margin: Optimal separation
• Measure the distance from the line before hitting a datapoint: imagine a “(wide)
street” around the line.
• Any point on the street is too close to the line to be accurately clasified.
• Symmetric about the line: region in 2D, cuboid in 3D, hyperrectangle D > 3.
• The width of a separating hyperplane is called margin.
Which of the three examples of separating hyperplanes in the previous slide seem to
capture the maximum margin, i.e., optimal separating hyperplane?
74/147
Margin: Optimal separation
• Measure the distance from the line before hitting a datapoint: imagine a “(wide)
street” around the line.
• Any point on the street is too close to the line to be accurately clasified.
• Symmetric about the line: region in 2D, cuboid in 3D, hyperrectangle D > 3.
• The width of a separating hyperplane is called margin.
Which of the three examples of separating hyperplanes in the previous slide seem to
capture the maximum margin, i.e., optimal separating hyperplane?
74/147
Support vectors
• Assuming linearly separable data, the classifier with the largest margin is called the
maximum margin (linear) classifier.
• Datapoints of each class that lie closest to the classification line: support vectors.
• Maximum margin classifier is one that goes through the middle of the “street”
• margin should be as large as possible;
• support vectors are most useful datapoints since they are the ones that might be
misclassified
• once maximum margin is found, classification of a test point only depends on the support
vectors.
• Support vector machine (SVM) [Vapnik, 1992] performs binary classification by
finding a maximum margin classifier in:
• the original input space if the data is linearly separable
• a higher dimensional space to which data is transformed (using kernel functions) if the
data is not linearly separable.
75/147
Linear decision boundary (1)
Assume data is linearly separable into two classes (region Ry>0 and Ry<0 ). The decision
boundary is a linear hyperplane: y = y(x) = wT x + b = 0 where w: weight vector of the
features, b: intercept term. E.g., for 2 features: the line w1 x1 + w2 x2 + b = 0.
76/147
Linear decision boundary (2)
Assume data is linearly separable into two classes (region Ry>0 and Ry<0 ). The decision
boundary is a linear hyperplane: y = y(x) = wT x + b = 0 where w: weight vector of the
features, b: intercept term. E.g., for 2 features: the line w1 x1 + w2 x2 + b = 0.
77/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (1)
78/147
Distance from decision hyperplane to any point (2)
79/147
Distance from decision hyperplane to any point (2)
79/147
Distance from decision hyperplane to any point (2)
79/147
Distance from decision hyperplane to any point (2)
79/147
Distance from decision hyperplane to any point (2)
79/147
Distance from decision hyperplane to any point (2)
79/147
Distance from decision hyperplane to any point (2)
79/147
Distance from decision hyperplane to any point (3)
80/147
Distance from decision hyperplane to any point (4)
81/147
Distance from decision hyperplane to any point (4)
81/147
Distance from decision hyperplane to any point (4)
81/147
Distance from decision hyperplane to any point (4)
81/147
Distance from decision hyperplane to any point (4)
81/147
Maximum margin solution: Optimization (1)
• Recall that for every input x(i) we have a target t(i) ∈ {−1, 1}.
• If data is linearly separable, there is at least one choice of w and b such that y(x(i) ) > 0
whenever t(i) = +1 and y(x(i) ) < 0 whenever t(i) = −1. Thus, for every point x(i) ,
t(i) y(x(i) ) > 0.
• The unsigned distance between a point x(i) and decision hyperplane is:
Margin is the unsigned distance between the decision hyperplane and the closest points to it.
• We want to maximize the margin, thus we solve
82/147
Maximum margin solution: Optimization (2)
• Suppose the “inner” minimization gives a solution mini [t(i) (wT x(i) + b)] = γ̂. Then, the
maximum margin solution can be written:
γ̂
arg max subject to constraints t(i) (wT x(i) + b) > γ̂, for i = 1, . . . , m
w,b kwk
t(i) ((cw)T x(i) + cb) ct(i) (wT x(i) + b) t(i) (wT x(i) + b)
= =
kcwk ckwk kwk
83/147
Graph of 1/kwk with w = (w1 , w2 )T
Maximum margin solution: Optimization (3)
1
• Maximizing kwk is equivalent to minimizing kwk2 , which is convex.
• Margin maximization thus reduces to solving a minimization problem called quadratic
programming problem with linear constraints
1
arg min kwk2 subject to t(i) (w T x(i) + b) > 1, for all i = 1, . . . , m
w,b 2
85/147
Maximum margin solution: Optimization (4)
86/147
Brief intro on Lagrangian (1)
The Lagrangian combines objective function and constraints into single objective function:
m
X
L(x, λ) = f (x) + λi gi (x)
i=1
87/147
Brief intro on Lagrangian (2)
• For each feasible x (those that satisfy all constraints), L(x, λ) is a lower bound of f (x) for all λ:
• The 2nd term of the Lagrangian satisfies m
P
i=1 λi gi (x) 6 0 due to the fact that for all i,
λi > 0 and gi (x) 6 0.
• Thus, f (x) > f (x) + m
P
i=1 λi gi (x) = L(x, λ, µ)
Pm
• The greater i=1 λi gi (x) (i.e., the closer to 0), the closer L(x, λ) is to f (x). So we can write
f (x) if g (x) 6 0 for i = 1, . . . , m
i
max L(x, λ) =
λ ∞ otherwise
Note that the max operator is over λ, not x. Also, the 2nd case is for x that is not feasible: we
penalize L(x, λ) by defining it to be very large (hence won’t be returned as a minima over x).
• The primal minimization problem could then be written in terms of Lagrangian as:
m
X
min f (x) = min max L(x, λ) = min max f (x) + λi gi (x) subject to λi > 0, i = 1, . . . , m
x x λ x λ
i=1
88/147
Brief intro on Lagrangian (3)
• Computing min max L(x, λ) may be hard. What if we compute max min L(x, λ)
x λ λ x
instead?
• We define the function `(λ) = minx L(x, λ), called the Lagrange dual function. Hence,
we focus on solving the Langrange dual problem:
• The solution of the above corresponds to the tightest/greatest lower bound of f (x)
• What is the relation between min max L(x, λ) and max min L(x, λ)?
x λ λ x
⇒ max `(λ) = max min L(x, λ) 6 max L(x, λ) (for all feasible x)
λ λ x λ
⇒ max `(λ) = max min L(x, λ) 6 min max L(x, λ) = min f (x)
λ λ x x λ x
89/147
• The problem: min f (x) = x3 + 3x2 − x + 1 subject to −1 6 x 6 2, or equivalently in
standard form, subject to (x + 1)(x − 2) 6 0.
• Lagrangian: L(x, λ) = x2 + 3x2 − x + 1 + λ(x2 − x − 2), will be below f (x) in the
feasible region −1 6 x 6 2 when λ > 0.
• The corresponding dual function `(λ) never exceeds min f (x) whenever x is feasible.
Brief intro on Lagrangian (4)
• Our inequality maxλ `(λ) 6 minx f (x) is called weak duality, which holds even when f
is non-convex.
• The difference minx f (x) − maxλ `(λ) > 0 gives the duality gap between the optimal
value of the primal problem and the best/greatest lower bound on it that can be
obtained from the dual function.
• In general, the gap may be positive/nonzero. However, under certain conditions, the
gap is zero, i.e., maxλ `(λ) = minx f (x), which implies that a solution to the dual
problem becomes exactly a solution to the primal problem.
91/147
Brief intro on Lagrangian (5)
For a primal optimization problem of minimizing not necessarily convex f (x) subject to
gi (x) 6 0 for i = 1, . . . , m, assume that gi ’s are all differentiable.
Let `(λ) is the dual function for the above primal problem. Further, suppose x∗ , λ∗ be
such that f (x∗ ) is minimum and `(λ) is maximum, resp. That is, x∗ is primal optimal and
`(λ∗ ) is dual optimal.
Theorem
If x∗ and `(λ∗ ) are primal and dual optimal with zero duality gap, then all of the following
conditions, called the Karush-Kuhn-Tucker (KKT) conditions, hold:
1. gi (x∗ ) 6 0 for i = 1, . . . , m
2. λ∗i > 0 for i = 1, . . . , m
3. λ∗i gi (x∗ ) = 0 for i = 1, . . . , m
Pm
4. ∇x f (x∗ ) + i=1 λ∗i ∇x gi (x∗ ) = 0
That is, KKT conditions are necessary conditions for the primal and dual optimal points to
bring minima (of the primal problem) and maxima (of the dual problem) at the same value, 92/147
i.e., min f (x) = max `(λ).
Brief intro to Lagrangian (6)
That is, KKT conditions are also sufficient conditions for the primal and dual optimal points
to give minima and maxima at the same value.
• The last theorems means that to solve the primal problem, we can simply solve the
dual problem and ensure that the KKT conditions are satisfied.
93/147
Dual representation (1)
94/147
Dual representation (2)
Next, we minimize L(w, b, λ) over w, b, i.e., computing its derivative w.r.t. w and b. Any solution for
w and b obtained by setting the derivatives to zero give us w and b that satisfy the fourth KKT
condition.
Pm
Thus, w = i=1 λi t(i) x(i) .
Pm (i)
Thus, i=1 λi t = 0.
95/147
Dual representation (3)
m
X m
X X
kwk2 = wT w = λj t(j) (x(j) )T λi t(i) x(i) = λj λi t(j) t(i) (x(j) )T x(i)
j=1 i=1 16i,j6m
m
X m
X
wT x(i) = λj t(j) (x(j) )T x(i) = λj t(j) (x(j) )T x(i)
j=1 j=1
Pm Pm
We substitute w = i=1 λi t(i) x(i) and i=1 λi t(i) = 0 to the Lagrangian function to obtain the dual form.
m m m m
1 X 1 X X X
W (λ) = L(w, b, λ) = kwk2 − λi [t(i) (wT x(i) + b) − 1] = kwk2 + λi − λi t(i) wT x(i) − b λi t(i)
2 i=1
2 i=1 i=1 i=1
m m Xm
X 1 X X
= λi + λj λi t(j) t(i) (x(j) )T x(i) − λi t(i) λj t(j) (x(j) )T x(i)
i=1
2 16i,j6m i=1 j=1
m
X 1 X X
= λi + λj λi t(j) t(i) (x(j) )T x(i) − λj λi t(j) t(i) (x(j) )T x(i)
i=1
2 16i,j6m 16i,j6m
m
X 1 X
= λi − λj λi t(j) t(i) (x(j) )T x(i)
i=1
2 16i,j6m
96/147
Dual representation (4)
Our dual form is then maximized (to get the tightest lower bound of the original objective
function) giving us the following dual optimization problem:
m
X 1 X
max W (λ) = λi − λi λj t(i) t(j) (x(i) )T x(j)
λ
i=1
2
16i,j6m
m
X 1 X
= λi − λi λj t(i) t(j) hx(i) , x(j) i
i=1
2
16i,j6m
m
X
subject to λi > 0, i = 1, . . . , m and λi t(i) = 0
i=1
where we express the objective function using inner product, which will be useful when
working with kernel functions later.
97/147
Satisfaction of KKT conditions
Our primal optimization problem satisfies KKT conditions for all points x(i) i = 1, . . . , m
Pm Pm
when w = i=1 λi t(i) x(i) and i=1 λi t(i) = 0:
98/147
Support vectors
Because w, b, and λ satisfy the third KKT condition: λi (1 − t(i) (wT x(i) + b)) = 0, it follows that for
every data point x(i) , i = 1, . . . , m, either
• λi = 0; or
• t(i) (wT x(i) + b) = t(i) y(x(i) ) = 1.
• Solving the dual problem gives us λ and from it, w = m (i) (i)
P
i=1 λi t x , which is optimal for the
primal problem since KKT conditions are satisfied.
Pm (i) (i) T Pm
• So, for any x, y(x) = wT x + b = (i) (i) T
i=1 λi t x x+b= i=1 λi t (x ) x + b
• Points x(i) for which λi = 0 will not appear in the sum.
• Remaining data points x(k) must satisfy t(k) (wT x(k) + b) = t(k) y(x(k) ) = 1 and will appear in
the sum. Such points are the support vectors, which lie closest to the decision hyperplane
because they satisfy t(i) y(x(i) ) = 1.
• After the SVM model is trained, only the support vectors need to be kept; the rest are are
(k)
(x(k) )T x + b where S
P
discarded, and the hypothesis for any x becomes y(x) = k∈S λk t
is the set of indices of support vectors.
99/147
The intercept term
• We can solve for b above using any arbitrarily chosen support vector, but a more numerically
stable solution is obtained as follows:
X
t(r) b + λk t(k) (x(k) )T x(r) = 1
k∈S
X
⇔ (t(r) )2 b + λk t(k) (x(k) )T x(r) = t(r)
k∈S
X
⇔b+ λk t(k) (x(k) )T x(r) = t(r) because (t(r) )2 = 1 for all r ∈ S
k∈S
X X XX X (r)
⇔ b+ λk t(k) (x(k) )T x(r) = |S|b + λk t(k) (x(k) )T x(r) = t
r∈S k∈S r∈S k∈S r∈S
1 X (r) X
⇔b= t − λk t(k) (x(k) )T x(r)
|S| r∈S
k∈S
100/147
SVM: How to classify a new point with trained model
Given a new datapoint x, trained model (i.e., after w, b, and the support vectors are found)
can classify x by evaluating the sign of y(x) given by
m
X m
X X
y(x) = wT (x) + b = b + λi t(i) (x(i) )T x = b + λi t(i) hx(i) , xi = b + λk t(k) hx(k) , xi
i=1 i=1 k∈S
101/147
SVM algorithm (linearly separable case)
Initialization: Assume data are column vectors x(i) of X and let K = X T X. Then Ki,j contains
the result of the inner product hx(i) , x(j) i.
Training
• Assemble matrices from the dual optimization and pass them to a quadratic programming
solver:
m
X 1 X
arg max W (λ) = λi − λi λj t(i) t(j) hx(i) , x(j) i
λ
i=1
2 16i,j6m
m
X
subject to λi > 0, i = 1, . . . , m and λi t(i) = 0
i=1
The right sum can be written as a matrix operation, while λi ’s and t(i) ’s can be written as
vectors (how?).
• Identify the support vectors.
• Compute optimal value of b.
102/147
SVM algorithm (linearly separable case) (cont.)
Test/prediction
• For the given test data z, use the support vectors to classify the data.
• Compute y(z) = b + k∈S λk t(k) hx(k) , zi where the indices are for the support vectors
P
only
• Return the label corresponding to the sign of the above expression.
103/147
What if data is not linearly separable?
104/147
What if data is not linearly separable?
If data is not linearly separable, then the constraints of the primal problem cannot be
satisfied.
104/147
What if data is not linearly separable?
If data is not linearly separable, then the constraints of the primal problem cannot be
satisfied.
• Relax the (hard) constraints into a soft one using slack variables. [Read Bishop’s
PRML, Section 7.1]
• Transform the data into a higher dimensional space using kernel functions (with the
hope that the data becomes linearly separable in the chosen higher dimensional
space).
104/147
Kernel functions for SVM
If we wish to transform the data into a higher dimensional space, we wish to retain the
inner product in the hypothesis, namely the hypothesis should be of the form:
X
y(x) = b + λk t(k) hφ(x(k) ), φ(x)i
k∈S
105/147
Kernel functions for SVM
Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.
φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T
with 1 6 i < j 6 n.
106/147
Kernel functions for SVM
Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.
φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T
with 1 6 i < j 6 n.
√
• If n = 1, φ(x) = (1, x1 2, x21 )T
106/147
Kernel functions for SVM
Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.
φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T
with 1 6 i < j 6 n.
√
• If n = 1, φ(x) = (1, x1 2, x21 )T
√ √ √ √ √ √
• If n = 3, φ(x) = (1, x1 2, x2 2, x3 2, x21 , x22 , x23 , x1 x2 2, x1 x3 2, x2 x3 2)T
106/147
Kernel functions for SVM
Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.
φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T
with 1 6 i < j 6 n.
√
• If n = 1, φ(x) = (1, x1 2, x21 )T
√ √ √ √ √ √
• If n = 3, φ(x) = (1, x1 2, x2 2, x3 2, x21 , x22 , x23 , x1 x2 2, x1 x3 2, x2 x3 2)T
106/147
Kernel functions for SVM
Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.
φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T
with 1 6 i < j 6 n.
√
• If n = 1, φ(x) = (1, x1 2, x21 )T
√ √ √ √ √ √
• If n = 3, φ(x) = (1, x1 2, x2 2, x3 2, x21 , x22 , x23 , x1 x2 2, x1 x3 2, x2 x3 2)T
Unfortunately, φ(x) has 1 + 2n + n2 = O(n2 ) elements, which makes hφ(x(i) ), φ(x(j) )i.
106/147
Kernel functions for SVM
Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.
φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T
with 1 6 i < j 6 n.
√
• If n = 1, φ(x) = (1, x1 2, x21 )T
√ √ √ √ √ √
• If n = 3, φ(x) = (1, x1 2, x2 2, x3 2, x21 , x22 , x23 , x1 x2 2, x1 x3 2, x2 x3 2)T
Unfortunately, φ(x) has 1 + 2n + n2 = O(n2 ) elements, which makes hφ(x(i) ), φ(x(j) )i. Is
106/147
Kernel trick
• So, the inner product in the higher dimensional space need not be actually computed
in the higher dimensional space!
• Simply compute the inner product in the original input space, and perform the
transformation afterwards.
• This is called the kernel trick.
107/147
Examples of kernels
108/147
Example
From PRML Fig. 7.2. The support vectors are the circled points.
109/147
Multiclass SVMs
110/147
Multiclass SVMs (cont.)
111/147
Multiclass SVMs (cont.)
• (Weston and Watkins, 1999): Define a single objective function for training K SVMs
simultaenously based on maximizing margin from each to remaining classes.
• Slower training: instead of O(KN 2 ) cost of solving K optimization problems over N data
points, we solve single optimization problem of size (K − 1)/N with cost O(K 2 N 2 ).
• One-versus-one approach. Train K(K − 1)/2 binary SVMs on all possible pairs of
classes and classify datapoints according to which class has the highest number of
“votes”.
• Problem 1: ambiguities in resulting classification.
• For large K, more training time than one-versus-the-rest approach.
• More computation required for evaluating test points; can be somewhat alleviated by
arranging pairwise classifiers into a directed acyclic graph (DAGSVM) with K(K − 1)/2
classifiers and only K − 1 pairwise classifiers evaluated when testing points.
112/147
Nearest Neighbor Methods
Motivation
113/147
Histogram approach
114/147
Histogram approach: Lesson learned
115/147
Histogram approach: Lesson learned (cont.)
116/147
General form of density estimate
N!
Bin(K|N, P ) = P K (1 − P )1−K
K!(N − K)!
117/147
General form of density estimate (cont.)
• For large N , the binomial distribution peaks sharply a round the mean, i.e., K ' N P
• But if R is assumed to be very small that p(x) is roughly constant in it, then
P ' p(x)V where V is the (spherical) volume of R
• So, for sufficiently large N , density estimate is :
K
p(x) =
NV
• Hence, estimating density can be approached in two ways:
• Fixing K and determine V from data K-nearest neighbor
• Fixing V and determine K from data kernel estimator
118/147
Nearest neighbor methods
• We estimate p(x) by considering a small sphere V centered on point x and letting the
radius of the sphere to grow until it covers K data points.
• K governs degree of smoothing (see figure): there is an optimum K.
• Model produced by K-nearest neighbor is not a true density model (integral diverges).
119/147
K-nearest neighbor for classification
P
• Let: Nk points of class Ck with N = k Nk .
• To classify a point x, draw a sphere centered on x containing precisely K points
(irrespective of their class).
• The sphere form depends on the distance measure used, e.g., Euclidean distance,
Mahalanobis, etc.
• Suppose the sphere has volume V and contains Kk points from class Ck . Then,
density estimate of each class:
Kk
p(x|Ck ) =
Nk V
• Unconditional density:
K
p(x) =
NV
• Class priors:
Nk
p(Ck ) =
N
120/147
K-nearest neighbor for classification (cont.)
p(x|Ck )p(Ck ) Kk
p(Ck |x) = =
p(x) K
121/147
Distance measure
122/147
Decision boundary
(a) New point (black diamond) classified according to majority class membership of 3
nearest neighbor.
(b) Decision boundary of 1-nearest neighbor is composed of hyperplanes forming
perpendicular bisector of pairs of points from different classes
123/147
Example plot
124/147
Dimensionality Reduction:
Principal Component Analysis
Dimensionality reduction: Why?
• When plotting and inspecting data, can never go beyond 3D, and 2D is easier to
interpret.
• Curse of dimensionality: more dimension → more training data.
• Dimensionality is explicit factor in computational cost.
• Can remove noise.
• Significantly improve performance of learning.
• Make dataset easier to work with.
• Make results easier to understand.
125/147
Dimensionality reduction: How?
• Feature selection: looking through available features and seeing if they are actually
useful (i.e., correlated to output variables).
• Even when using neural network (which was to avoid “getting your hands dirty”), results
can be much improved with feature selection.
• Feature derivation: deriving new features from the old ones through applying some
transform(s) to the dataset that simply change the axes (moving/rotating).
• Allows combining features and identifying useful ones.
• Clustering: group similar datapoints and see if this allows fewer features to be used.
126/147
Principal Component Analysis (PCA)
127/147
Intuitive reading of PCA algorithm
• Intuition:
• Center the data by substracting off the mean.
• Choose direction with the largest variation
• Place an axis in that direction.
• Look at the remaining variation and find another axis that is orthogonal to the first axis and
covers as much of the remaining variation as possible.
• Iterate until no possible axes found.
• End result: all variation is along the axes, the covariance matrix is diagonal (each new
variable is uncorrelated with every variable except itself).
• Some axes found last have very little variation and can be removed without affecting
variability of data.
128/147
Formal derivation
Definition
Given a M × M matrix A. An eigenvector of A is a column vector v such that Av = λv
for some scalar λ. The just mentioned equation is called the eigenvalue equation for the
matrix A and the scalar λ is the corresponding eigenvalue.
Eigenvalues and eigenvectors appear in many fields of science, e.g., classical and
quantum mechanics, geology, epidemiology, any science that uses principal component
analysis (bioinformatics, data mining, chemistry, psychology, image and signal
processing).
130/147
Eigenvectors and eigenvalues (cont.)
The eigenvalue equation of A can be written (A − λI)v = 0 where I is the identity matrix.
• The equation has nonzero solution in v if and only if the determinant |A − λI| = 0
• An eigenvalue is thus a root of the polynomial of degree n induced by the determinant
expression. The polynomial is called characteristic polynomial of A.
a b
For 2x2 matrix c d , its determinant ad − bc.
" # " #
2 3 2−λ 3
If A = , then A − λI =
2 1 2 1−λ
131/147
Eigenvectors and eigenvalues (cont.)
To find the eigenvector, we substitute the eigenvalue back to the eigenvalue equation:
" #" # " #
2 3 x x
=4
2 1 y y
giving us two linear equations 2x + 3y = 4x and 2x + y = 4y. Solving these equations yield
x = 23 y. In fact, there is an infinite number of solutions of the form x = 3t and y = 2t for
any real number t. So, there are infinitely many eigenvectors for the eigenvalue λ = 4,
including [ 32 ].
132/147
Properties of eigenvalues and eigenvectors
• Eigenvectors are found only on square matrices, but not all square matrices have
eigenvectors.
• An N × N matrix can have at least d and at most N linearly independent eigenvectors
where d is the number of distinct eigenvalues of the matrix. Note that d may be
smaller than N .
• Each eigenvalue λ of a matrix has at least one corresponding eigenvector.
• Eigenvectors with distinct eigenvalues are linearly independent.
133/147
Eigendecomposition
A = EDE −1
where
• E is an N × N matrix whose i-th column is the eigenvector qi of A
• D a diagonal matrix whose diagonal elements Dii = λi is the eigenvalue
corresponding to qi .
The eigendecomposition above can also be written: E −1 AE = D. Also, the eigenvectors
can be chosen so that they are normalized to have length one.
Special case: for every real symmetric matrix, the eigenvalues are real numbers and the
eigenvectors can be chosen so that they are mutually orthogonal.
• The decomposition becomes A = EDE T with E orthogonal
• Note that orthogonality implies linear independence (but not vice versa).
134/147
Eigenvectors and eigenvalues for PCA
135/147
PCA algorithm
136/147
Example
Original 2D-data Reconstructed after PCA using only the first component
137/147
Kernel PCA
• Problem with PCA: assume the directions of all variations are straight lines (often not
true).
• Idea: use a possibly nonlinear function Φ(·) as kernel. So steps are:
• Choose a kernel and apply it to all pairs of data points to get matrix K of distances
between pairs of points in the transformed space.
• Compute eigenvalues and eigenvectors of K
• Normalize the eigenvectors by the square root of eigenvalues.
• Retain the eigenvectors corresponding to the largest eigenvalues.
138/147
K-Means Clustering
Unsupervised learning
139/147
Error criterion for unsupervised learning
• Unsupervised learning cannot use external error criterion that relies on targets or
other ouside information.
• Need to find internal measure
• Measure must be task-independent (otherwise, algorithm must be changed every time
task is changed).
• Note: error criterion in supervised learning is task-specific due to the targets provided.
• Which internal measure?
• Distance between inputs: small means similar, hence should be in the same cluster.
140/147
K-means: Overview
141/147
K-means algorithm
Intialization: choose value of K, choose K random positions in input space, and assign
cluster centers µk , k = 1, . . . , K, to those positions.
Learning:
Repeat until cluster centers stop moving:
• For each datapoint xn , compute distances to each cluster center and assign it to the
nearest cluster center.
• For each cluster Ck , move the position of its center µk to the mean of points in that
cluster:
1 X
µk = x
Nk
x∈Ck
Testing/Usage: for each testing point, compute its distance to each cluster center and
assign it to the nearest cluster center.
142/147
K-means as optimization
143/147
Illustration
144/147
Computational consideration
145/147
K-medoids algorithm
146/147
Partitioning around medoids (PAM) algorithm
Training:
Repeat until there is no change in assignment:
147/147