0% found this document useful (0 votes)
202 views233 pages

Machine Learning Crashcourse

Machine learning involves developing algorithms that can learn from data to make predictions or decisions without being explicitly programmed. This document provides an overview of machine learning, including the key elements of representation, evaluation, and optimization. It describes the main types of machine learning problems like supervised learning, unsupervised learning, and reinforcement learning. Finally, it outlines the typical machine learning pipeline from data collection and preparation to algorithm selection, training, evaluation, and deployment.

Uploaded by

Hendrick ST., MT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
202 views233 pages

Machine Learning Crashcourse

Machine learning involves developing algorithms that can learn from data to make predictions or decisions without being explicitly programmed. This document provides an overview of machine learning, including the key elements of representation, evaluation, and optimization. It describes the main types of machine learning problems like supervised learning, unsupervised learning, and reinforcement learning. Finally, it outlines the typical machine learning pipeline from data collection and preparation to algorithm selection, training, evaluation, and deployment.

Uploaded by

Hendrick ST., MT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 233

Machine Learning - A Crash Course

Adila Krisnadhi
[email protected]
December 12, 2020
Faculty of Computer Science
Universitas Indonesia
Acknowledgements

• Longer version of these slides were given in the Machine Learning course taught by
me, Ari Saptawijaya, and Raja Oktovin at the Faculty of Computer Science,
Universitas Indonesia.
• Parts of these slides are taken from Pedro Domingos’ slides (Introduction and
Inductive Learning) at
https://courses.cs.washington.edu/courses/cse446/15sp/
• Some figures are taken from Marsland’s “Machine Learning: An Algorithmic
Perspective” (MLAP) and Bishop’s “Pattern Recognition and Machine Learning”
(PRML).

1/147
Outline

What is Machine Learning?

Supervised Learning

Decision Trees

Linear Regression

Logistic Regression

Naive Bayes Classifier

Support Vector Machine

2/147
Outline (cont.)
Nearest Neighbor Methods

Dimensionality Reduction: Principal Component Analysis

K-Means Clustering

3/147
What is Machine Learning?
Artificial Intelligence

Image tweeted by Shawn Riley (@CyberIntelSME) Feb 11, 2019, https://twitter.com/CyberIntelSME/status/1094962204669964288

4/147
What is machine learning?

• Learning is one of the characteristics of intelligence.


• Important parts of learning:
• remembering
• adapting
• generalizing
• Machine learning methods are subsymbolic because no symbols or symbolic
manipulation is involved, as opposed to symbolic methods in AI such as reasoning
and logical deduction.
• Machine learning is about automating automation: making computers modify or adapt
their actions (e.g., making predictions, controlling a robot, etc.) by capturing patterns
in the data so that these actions are as “good” as possible.
• goodness: how well the chosen actions reflect the most appropriate ones.

5/147
Traditional programming vs. machine learning

Traditional programming

Machine learning

6/147
Key elements of machine learning

7/147
Key elements of ML: Representation

Representation refers to the type of model under consideration. Could be understood as


the set/space/landscape of all possible models with respect to a given formal language.

• Different types of model mean different representation.


• 3-layer feedforward neural networks form one type, while support vector machines
form another.
• Examples: decision trees, rules/logic programs, instances, graphical models
(Bayes/Markov nets), neural networks, SVM, model ensembles, etc.

8/147
Key elements of ML: Evaluation

Evaluation concerns judgement or preference of one model over the others.


• Expressed in terms of some metrics such as utility function, loss function, scoring
function, or fitness function.
• Examples: accuracy, precision and recall, squared error, likelihood, posterior
probability, cost/utility, margin, entropy, etc.
• For each given model, evaluation gives the height of the landscape.
• Which height is preferred depends on the evaluation functions, e.g., using squared
error means going to lower areas, while likelihood may mean going to higher areas.

9/147
Key elements of ML: Optimization

Optimization refers to the strategy to satisfy the preference.

• “how do you get to the most preferable area”, i.e., search the space of represented
models to obtain better evaluations
• Examples:
• Combinatorial optimization, e.g., greedy search, genetic algorithm
• Convex optimization, e.g., gradient descent
• Constrained optimization, e.g., linear programming, quadratic programming
• Study theory of optimization for deeper discussion.

10/147
Types of Machine Learning

Question: how does a machine know that it is getting better at performing its task?
Different answers lead to different categories below:

• Supervised (inductive) learning.


• Given training examples, correct responses (targets) are given to the algorithm.
• The algorithm generalizes to respond correctly to all possible inputs, i.e., learning from
examples.
• Unsupervised learning.
• Given training examples, no correct response is given to the algorithm.
• The algorithm identifies similarities between inputs so that inputs that have something in
common are categorized/clustered together.
• Reinforcement learning.
• Given training examples, the algorithm gets told when the answer is wrong, but not how to
correct it, i.e., learning with a critic.
• The algorithm explores different possibilities until it knows how to give a correct answer.

11/147
Learning problems

Supervised learning problems:

• Classification aims to assign each input to one of a finite number of categories (target values
are discrete).
• Regression aims to assign each input to a value in a continuous set of possible target values
• Probability estimation is a special case of regression where target values range between 0
and 1 and represent probability values.

Unsupervised learning problems:

• Clustering aims to discover groups of similar examples within the input space.
• Density estimation aims to determine the distribution of data within the input space.
• Projection/dimensionality reduction aims to obtain a representation of data in a dimension
different (typically lower) than its original dimension.

Reinforcement learning problem:

• Credit assignment aims to determine a way to reward (or punish) every action the algorithm
provides so that at the end of the action sequence, it arrives at the best/correct answer.

12/147
Machine learning applications

• Web search: present which web pages you’ll likely click on, based on your past browsing
activities.
• Computational biology: predict a drug design (e.g., how likely the design will give a certain
pharmacological effect) based on various properties of the constituting molecules.
• Commerce/Finance: predict if an individual’s credit risk is high (so not to approve his credit card
application) based on information of his profile.
• Space exploration/astronomy: predict, from a picture of the surface of planet Venus, whether it
has a volcano.
• Robotics: self-driving car – predict whether it should accelerate or brake based on the data
about its surroundings from its sensors.
• Social networks: predict if you’ll likely send a friend request to an individual based on your
social network profile and the activities you do in the social network application.
• Debugging: based on the traces of the program writing and how people debug in the past,
predict the most likely locations of the bugs.
• Your example?

13/147
Machine learning pipeline

1. Data collection.
• Need to understand domain, prior knowledge, and goals
• Data may not exist.
• Large amount of relevant data exist, but are hard to collect.
• Supervised learning also requires target data.
• More data is better, preferably without too much noise. But, large dataset means
increased computational cost.
2. Data preparation
• May include data integration, selection, cleaning, preprocessing, etc.
• Preprocessing may also involve feature selection, extraction, or engineering.
• Not all features are useful for the machine learning task at hand. Some may be redundant
or irrelevant.
• Collecting features should not be costly. Features should be robust to noise.
• Some machine learning representation may not even need feature engineering.
3. Algorithm choice.

14/147
Machine learning pipeline (cont.)

4. Hyperparameter and model selection.


• In many algorithms, parameters have to be set manually or requires experiments to
determine their appropriate values.
5. Training. Learn the models
6. Evaluation.
• The trained model should be evaluated on data it was not trained on.
• May involve comparison with human experts and the selection of appropriate metrics for
such comparison.
7. Model deployment on the actual system.
8. Loop if necessary.

15/147
Supervised Learning
Supervised learning

• Input x is a vector of values called features and the


length of x is called the dimension
• Output y is a target variable/vector
• x comes from the space X of input values, while y
comes from the space Y of output values.
• We are given a training set consisting of finitely
many pairs input-output variables (x(i) , y (i) ),
i = 1, . . . , m, of training examples such that
y (i) = f (x(i) ) for some unknown target function f .
• Aim is to find a function h : X → Y such that h is a
good approximation of f . The function h is called a
hypothesis

16/147
Example Applications

• Credit risk assessment


x: properties of customer and proposed purchase
f (x): approve purchase or not
• Disease diagnosis
x: properties of patient (symptoms, lab tests)
f (x): disease (or maybe, a recommended therapy)
• Face recognition
x: bitmap picture of person’s face
f (x): name of the person
• Automatic steering
x: bitmap picture of road surface in front of a car
f (x): degrees to turn the steering wheel

17/147
When is supervised learning appropriate?

1. ... when there is no human expert


• i.e., when nobody knows how to do it
• machine has a lot of power that human brains don’t possess
• Example: Given bond graph for a new molecule, predict the binding strength of AIDS
protease molecule; strong binding can prevent AIDS protease to cleave hence preventing
maturation of HIV virion into their infectious form.
2. ... when humans can perform the task but are unable to describe how to do it
• i.e., we know how to do it, but don’t know how to program it.
• Example: Given a bitmap picture of hand-written character, predict the ASCII code of the
character.
• Example: learning how to ride a bike.
• Ideal for supervised learning: we can generate the supervision by having people do it,
e.g., in car driving task, simply ask someone to drive and record both the road during
driving (video), and the degree the steering turns every time.

18/147
When is supervised learning appropriate? (cont.)

3. ... when the desired function is changing frequently


• We know how to do it and can program it, but it is too expensive to program it because the
program has to change very frequently.
• Example: given information about stock prices and trades in the last 10 days, predict a
good stock transactions to be done.
4. ... when every user requires a customized function
• The target function doesn’t change (i.e., stable) for a user, but different users need
different functions.
• Example: given an incoming mail message, predict how important the mail for presenting
to user (or deleting it without presenting).

19/147
Example

Suppose you are asked to predict an unknown Boolean function f (output y is 0 or 1)


from inputs consisting of 4 Boolean features (x1 , x2 , x3 , x4 ). That is, y = f (x1 , x2 , x3 , x4 ).
• The Boolean function f is called concept and our task here is concept learning. Input
x for which f (x) = 1 is called a positive instance, and if f (x) = 0, x is called a
negative instance.
• Concept is a special case of classifier. A classifier is a discrete-valued function and
the possible target values f (x) ∈ {1, . . . , K} are the classes or class labels.
• Given examples below, how would you induce an approximation of f so that when
new examples of (x1 , x2 , x3 , x4 ) come along in the future the approximation can also
give you the correct result according to the new examples?
Example x1 x2 x3 x4 y
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 1 1 1
4 1 0 0 1 1
5 0 1 1 0 0
6 1 1 0 0 0 20/147
7 0 1 0 1 0
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in


0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0
0 0 1 1 1
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 ?
1 0 0 0 ?
1 0 0 1 1
1 0 1 0 ?
1 0 1 1 ?
1 1 0 0 0
1 1 0 1 ?
1 1 1 0 ?
1 1 1 1 ?
21/147
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in


0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0 216 = 65536 (how?)
0 0 1 1 1
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 ?
1 0 0 0 ?
1 0 0 1 1
1 0 1 0 ?
1 0 1 1 ?
1 1 0 0 0
1 1 0 1 ?
1 1 1 0 ?
1 1 1 1 ?
21/147
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in


0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0 216 = 65536 (how?)
0 0 1 1 1 • Version space: the space of all hypotheses that have not yet
0 1 0 0 0 been ruled out by a training example.
0 1 0 1 0 • If 7 examples are known, how many possible Boolean
0 1 1 0 0 functions are possible?
0 1 1 1 ?
1 0 0 0 ?
1 0 0 1 1
1 0 1 0 ?
1 0 1 1 ?
1 1 0 0 0
1 1 0 1 ?
1 1 1 0 ?
1 1 1 1 ?
21/147
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in


0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0 216 = 65536 (how?)
0 0 1 1 1 • Version space: the space of all hypotheses that have not yet
0 1 0 0 0 been ruled out by a training example.
0 1 0 1 0 • If 7 examples are known, how many possible Boolean
0 1 1 0 0 functions are possible? Still 29 possibilities.
0 1 1 1 ?
1 0 0 0 ?
1 0 0 1 1
1 0 1 0 ?
1 0 1 1 ?
1 1 0 0 0
1 1 0 1 ?
1 1 1 0 ?
1 1 1 1 ?
21/147
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in


0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0 216 = 65536 (how?)
0 0 1 1 1 • Version space: the space of all hypotheses that have not yet
0 1 0 0 0 been ruled out by a training example.
0 1 0 1 0 • If 7 examples are known, how many possible Boolean
0 1 1 0 0 functions are possible? Still 29 possibilities.
0 1 1 1 ? • In machine learning, we’re dealing with a double-exponential
1 0 0 0 ? problem: if you have n features, you need to search for a correct
n
1 0 0 1 1 function from 22 possibilities.
1 0 1 0 ?
1 0 1 1 ?
1 1 0 0 0
1 1 0 1 ?
1 1 1 0 ?
1 1 1 1 ?
21/147
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in


0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0 216 = 65536 (how?)
0 0 1 1 1 • Version space: the space of all hypotheses that have not yet
0 1 0 0 0 been ruled out by a training example.
0 1 0 1 0 • If 7 examples are known, how many possible Boolean
0 1 1 0 0 functions are possible? Still 29 possibilities.
0 1 1 1 ? • In machine learning, we’re dealing with a double-exponential
1 0 0 0 ? problem: if you have n features, you need to search for a correct
n
1 0 0 1 1 function from 22 possibilities.
1 0 1 0 ? • If we’re in the state of complete ignorance: we don’t know which
1 0 1 1 ? function is correct until we know every possble input-output pair,
1 1 0 0 0 can you do much better than random guessing?
1 1 0 1 ?
1 1 1 0 ?
1 1 1 1 ?
21/147
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in


0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0 216 = 65536 (how?)
0 0 1 1 1 • Version space: the space of all hypotheses that have not yet
0 1 0 0 0 been ruled out by a training example.
0 1 0 1 0 • If 7 examples are known, how many possible Boolean
0 1 1 0 0 functions are possible? Still 29 possibilities.
0 1 1 1 ? • In machine learning, we’re dealing with a double-exponential
1 0 0 0 ? problem: if you have n features, you need to search for a correct
n
1 0 0 1 1 function from 22 possibilities.
1 0 1 0 ? • If we’re in the state of complete ignorance: we don’t know which
1 0 1 1 ? function is correct until we know every possble input-output pair,
1 1 0 0 0 can you do much better than random guessing?
1 1 0 1 ? • Lesson learned: bias-free induction (i.e., complete ignorance)
1 1 1 0 ? has no rational basis to classify any unseen example.
1 1 1 1 ?
21/147
Hypothesis spaces

So, maybe introduce some bias by choosing


a language for representation,
e.g., (conjunction over “true” features) ⇔
output.
• How many hypotheses, i.e., conjunctive
rules are possible?

22/147
Hypothesis spaces

So, maybe introduce some bias by choosing


a language for representation,
e.g., (conjunction over “true” features) ⇔
output.
• How many hypotheses, i.e., conjunctive
rules are possible? 16.

22/147
Hypothesis spaces

So, maybe introduce some bias by choosing Rule Counterexample no.


a language for representation, ⇔y 3, 4
e.g., (conjunction over “true” features) ⇔ x1 ⇔ y 3, 6
output. x2 ⇔ y 2, 3, 4, 5, 6, 7
x3 ⇔ y 1, 4, 5
• How many hypotheses, i.e., conjunctive
x4 ⇔ y 7
rules are possible? 16.
x1 ∧ x2 ⇔ y 3, 4, 6
Example x1 x2 x3 x4 y x1 ∧ x3 ⇔ y 3, 4
1 0 0 1 0 0 x1 ∧ x4 ⇔ y 3
2 0 1 0 0 0 x2 ∧ x3 ⇔ y 3, 4, 5
3 0 0 1 1 1 x2 ∧ x4 ⇔ y 3, 4, 7
x3 ∧ x4 ⇔ y 4
4 1 0 0 1 1
x1 ∧ x2 ∧ x3 ⇔ y 3, 4
5 0 1 1 0 0
x1 ∧ x2 ∧ x4 ⇔ y 3, 4
6 1 1 0 0 0
x1 ∧ x3 ∧ x4 ⇔ y 3, 4
7 0 1 0 1 0 x2 ∧ x3 ∧ x4 ⇔ y 3, 4
x1 ∧ x2 ∧ x3 ∧ x4 ⇔ y 3, 4

22/147
Hypothesis spaces

So, maybe introduce some bias by choosing Rule Counterexample no.


a language for representation, ⇔y 3, 4
e.g., (conjunction over “true” features) ⇔ x1 ⇔ y 3, 6
output. x2 ⇔ y 2, 3, 4, 5, 6, 7
x3 ⇔ y 1, 4, 5
• How many hypotheses, i.e., conjunctive
x4 ⇔ y 7
rules are possible? 16.
x1 ∧ x2 ⇔ y 3, 4, 6
No hypothesis fits the whole of given data! x1 ∧ x3 ⇔ y 3, 4
x1 ∧ x4 ⇔ y 3
x2 ∧ x3 ⇔ y 3, 4, 5
x2 ∧ x4 ⇔ y 3, 4, 7
x3 ∧ x4 ⇔ y 4
x1 ∧ x2 ∧ x3 ⇔ y 3, 4
x1 ∧ x2 ∧ x4 ⇔ y 3, 4
x1 ∧ x3 ∧ x4 ⇔ y 3, 4
x2 ∧ x3 ∧ x4 ⇔ y 3, 4
x1 ∧ x2 ∧ x3 ∧ x4 ⇔ y 3, 4

22/147
Hypothesis spaces

Try m-of-n rules, e.g., for features


{x1 , x2 , x3 }:
• 1-of: if at least one of x1 , x2 , x3 is
true then y is true, else y is false.
• 2-of: if at least two of x1 , x2 , x3 are
true then y is true, else y is false.
How many hypotheses are there?

23/147
Hypothesis spaces

Try m-of-n rules, e.g., for features


{x1 , x2 , x3 }:
• 1-of: if at least one of x1 , x2 , x3 is
true then y is true, else y is false.
• 2-of: if at least two of x1 , x2 , x3 are
true then y is true, else y is false.
How many hypotheses are there? 32

23/147
Hypothesis spaces

Try m-of-n rules, e.g., for features Counterexample no.


{x1 , x2 , x3 }: Variables 1-of 2-of 3-of 4-of
• 1-of: if at least one of x1 , x2 , x3 is {x1 } 3,6 - - -
{x2 } 2,3,4,5,6,7 - - -
true then y is true, else y is false.
{x3 } 1,4,5 - - -
• 2-of: if at least two of x1 , x2 , x3 are {x4 } 7 - - -
true then y is true, else y is false. {x1 , x2 } 2,3,5,6,7 3,4,6 - -
How many hypotheses are there? 32 {x1 , x3 } 1,5,6 3,4 - -
{x1 , x4 } 6,7 3 - -
Example x1 x2 x3 x4 y
{x2 , x3 } 1,2,4,5,6,7 3,4,5 - -
1 0 0 1 0 0
{x2 , x4 } 2,5,6,7 3,4,7 - -
2 0 1 0 0 0
{x3 , x4 } 1,5,7 4 - -
3 0 0 1 1 1
{x1 , x2 , x3 } 1,2,5,6,7 3,4,5,6 3,4 -
4 1 0 0 1 1
{x1 , x2 , x4 } 2,5,6,7 3,6,7 3,4 -
5 0 1 1 0 0
{x1 , x3 , x4 } 1,5,6,7 *** 3,4 -
6 1 1 0 0 0
{x2 , x3 , x4 } 1,2,5,6,7 4,5,7 3,4 -
7 0 1 0 1 0
{x1 , x2 , x3 , x4 } 1,2,5,6,7 5,6,7 3,4 3,4

23/147
Hypothesis spaces

Try m-of-n rules, e.g., for features Counterexample no.


{x1 , x2 , x3 }: Variables 1-of 2-of 3-of 4-of
• 1-of: if at least one of x1 , x2 , x3 is {x1 } 3,6 - - -
{x2 } 2,3,4,5,6,7 - - -
true then y is true, else y is false.
{x3 } 1,4,5 - - -
• 2-of: if at least two of x1 , x2 , x3 are {x4 } 7 - - -
true then y is true, else y is false. {x1 , x2 } 2,3,5,6,7 3,4,6 - -
How many hypotheses are there? 32 {x1 , x3 } 1,5,6 3,4 - -
{x1 , x4 } 6,7 3 - -
• Aha! We found one hypothesis that {x2 , x3 } 1,2,4,5,6,7 3,4,5 - -
fits all the given data. {x2 , x4 } 2,5,6,7 3,4,7 - -
{x3 , x4 } 1,5,7 4 - -
{x1 , x2 , x3 } 1,2,5,6,7 3,4,5,6 3,4 -
{x1 , x2 , x4 } 2,5,6,7 3,6,7 3,4 -
{x1 , x3 , x4 } 1,5,6,7 *** 3,4 -
{x2 , x3 , x4 } 1,2,5,6,7 4,5,7 3,4 -
{x1 , x2 , x3 , x4 } 1,2,5,6,7 5,6,7 3,4 3,4

23/147
Hypothesis spaces

Try m-of-n rules, e.g., for features Counterexample no.


{x1 , x2 , x3 }: Variables 1-of 2-of 3-of 4-of
• 1-of: if at least one of x1 , x2 , x3 is {x1 } 3,6 - - -
{x2 } 2,3,4,5,6,7 - - -
true then y is true, else y is false.
{x3 } 1,4,5 - - -
• 2-of: if at least two of x1 , x2 , x3 are {x4 } 7 - - -
true then y is true, else y is false. {x1 , x2 } 2,3,5,6,7 3,4,6 - -
How many hypotheses are there? 32 {x1 , x3 } 1,5,6 3,4 - -
{x1 , x4 } 6,7 3 - -
• Aha! We found one hypothesis that {x2 , x3 } 1,2,4,5,6,7 3,4,5 - -
fits all the given data. {x2 , x4 } 2,5,6,7 3,4,7 - -
{x3 , x4 } 1,5,7 4 - -
• But, does that hypothesis fit the
{x1 , x2 , x3 } 1,2,5,6,7 3,4,5,6 3,4 -
unseen examples (not in the given
{x1 , x2 , x4 } 2,5,6,7 3,6,7 3,4 -
data)? {x1 , x3 , x4 } 1,5,6,7 *** 3,4 -
{x2 , x3 , x4 } 1,2,5,6,7 4,5,7 3,4 -
{x1 , x2 , x3 , x4 } 1,2,5,6,7 5,6,7 3,4 3,4

23/147
Two observations about learning

• Learning = removal of the remaining uncertainty.


• If we “already” knew that the unknown function is some m-of-n function, then the training
examples can be used to infer which function it is.
• But, our prior knowledge could be wrong: what are the odds that some m-of-n function is
actually the unknown function?
• Learning requires guessing a good, small hypothesis class.
• Our guess could be wrong: good guess often depends on domain knowledge or intuition.
• Smaller hypothesis class: easier to learn, but less likely to contain the correct hypothesis.
• Start with a small hypothesis class, then enlarge it until it contains the correct hypothesis
(e.g., this is how decision tree learning works).

Note that the following hypotheses are also consistent with the training data:

• x4 ∧ 1-of{x1 , x3 } ⇒ y
• x4 ∧ ¬x2 ⇒ y

If either of these is the unknown function, while our guess is the m-of-n one, what will
happen when new x values are given?
24/147
So, what are the strategies?

• Develop languages for expressing prior knowledge


• Guiding criteria: which representation makes it easiest to encode prior knowledge?
• E.g., should we use rules or stochastic models?
• Develop flexible hypothesis spaces.
• E.g., decision trees learning

In either case, machine learning is about developing algorithms for finding hypothesis
that fits the data.

25/147
Key issues

• What are good hypothesis spaces? Which ones have been useful in practice and
why?
• What algorithms can work in these spaces? Are there general design principles?
• How do we optimize accuracy on future/unseen data points? Also called the
overfitting problem. This distinguishes machine learning from the usual theory of
optimization.
• How can we have confidence in the results? The statistical question: how much
training data is needed to find accurate hypotheses?
• Are some learning problems computationally intractable? The computational
question: does my learning algorithm scale?
• How do we formulate application problems as machine learning problems?
(The engineering question)

26/147
Decision Trees
Example

• if there is party, go to party


• if there is no party and deadline
is urgent, then study
• if there is no party and no
deadline, then go to mall
• if there is no party and deadline
is near and you’re lazy, then
watch TV
• if there is no party and deadline
is near and you’re not lazy,
then study

Easy to understand (i.e., transparent), since it can be turned into a set of if-then rules

27/147
How to create decision tree: Basic principle

Basic principle (followed by most decision tree algorithms)


Build tree in a greedy manner, starting at the root, choosing the most informative feature
at each step.

In decision tree literature, features are often called (input) attributes. Example algorithms:

• Quinlan’s ID3
• C4.5
• Classification and regression trees (CART)

28/147
Quantifying information content: Entropy

• From Claude Shannon’s 1948 paper: “A Mathematical Theory of Communication”


• Information entropy describes the amount of impurity in a set of attributes.
• Given a random variable x, the amount of information received when a value of this
variable is observed can be seen as the “degree of surprise” on learning it.
• Suppose a sender wishes to transmit the value of random variable x to a receiver.
The entropy is the average amount of information transmitted, which is the
expectation w.r.t. p(x) (where we take p(x) log2 p(x) = 0 whenever p(x) = 0):
X
Entropy[x] = − p(x) log2 p(x)
x

• Let S be a set of labeled examples whose label are coming from the set
L = {a1 , a2 , . . . , aC }, then the entropy of S is
X
Entropy(S) = − P (x) log2 P (x)
x∈L

where P (x) is the proportion of S labeled as x.

29/147
Example

Given the following weather data S to decide whether to play outdoor.


outlook temperature humidity windy play
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny cool normal false yes
rainy mild high false yes
sunny mild high false no

9 9 5 5
Entropy(S) = − log2 − log2
rainy mild normal false yes 14 14 14 14
sunny mild normal true yes = 0.94029
overcast mild high true yes
rainy mild high true no
sunny hot high false no
sunny hot high true no
overcast hot high false yes
overcast hot normal false yes

30/147
From entropy to information gain

• Information gain: how much the entropy of the training set decreases if a particular
attribute is chosen for the next classification step.
• Let S be a set of examples, F be an attribute of S, v(F ) the set of possible values of
F , and Sf the subset of S containing all examples in S for which the value of F is f .
Also, |S| denotes the size of S and Entropy(S) the entropy of the set S.
• Average entropy of a split based on F w.r.t. a set S of examples:
X |Sf |
I(S, F ) = Entropy(Sf )
|S|
f ∈v(F )

• Information gain G(S, F ) of S when attribute F is chosen is defined by

G(S, F ) = Entropy(S) − I(S, F )

31/147
The ID3 algorithm

Let I be a set of input attributes, C the set of all labels, and S the set of labeled examples.

Function ID3(I, C, S):

• If S is empty, return a single node with value “Failure”


• If all examples s ∈ S have the same label o, then return a single node with value o
• If I is empty, then return a single node with value the most frequent label in S.
• Compute information gain G(S, F ) for each attribute F ∈ I w.r.t. S
• Let F̂ be attribute with the largest gain
• Let f1 , . . . , fm be possible values for attribute F̂
• Let Sf with f ∈ {f1 , . . . , fm } be subsets of S when S is partitioned according to the
value of F̂ .
• Return a tree with root node labeled F̂ and edges labeled f1 , . . . , fm where the edges
go to m trees obtained from ID3(I − {F̂ }, C, Sf1 ), ID3(I − {F̂ }, C, Sf2 ), . . . ,
ID3(I − {F̂ }, C, Sfm ),

32/147
Example

Construct a decision tree using ID3 algorithm on the following weather data to decide whether to
play outdoor. Should we play outdoor if the outlook is overcast, the temperature is cool, the humidity
is high, and it is windy out there?
outlook temperature humidity windy play

rainy cool normal false yes


rainy cool normal true no
overcast cool normal true yes
sunny cool normal false yes
rainy mild high false yes
sunny mild high false no
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
rainy mild high true no
sunny hot high false no
sunny hot high true no
overcast hot high false yes
overcast hot normal false yes

33/147
Issues in Decision Tree Learning

• Overfitting: pre-pruning or post-pruning of tree


• Random forest and XGBoost.
• Highly-branching attributes
• Numeric (continuous-valued) attributes
• Unknown attribute values

34/147
Linear Regression
Motivation: House pricing

Area (m2 ) Price (1000 USD) Area (m2 ) Price (1000 USD) Area (m2 ) Price (1000 USD)
158.9 208.5 79.3 132 122.4 40
117.2 181.5 93.3 149 114.1 149.35
165.9 223.5 120.4 90 114.6 179.9
159.5 140 103.5 159 157.9 165.5
204.2 250 124.4 139 145.0 277.5
126.5 143 220.7 325.3 227.8 309
157.4 307 102.9 139.4 101.9 145
194.2 200 166.8 230 120.5 153
164.8 129.9 98.5 129.9 98.2 109
100.1 118 98.5 154 107.0 82
96.6 129.5 148.6 256.3 123.0 160
215.9 345 83.6 134.8 123.4 170
84.7 144 158.3 306 82.1 144
138.8 279.5 148.6 207.5 87.1 130.25
116.4 157 48.3 68.5 106.8 141

• Task: based on the above data, predict the house price given its size/area.

35/147
Regression problem

• Is the previous task a supervised learning problem?

36/147
Regression problem

• Is the previous task a supervised learning problem?


• Yes, since we have (input, output) pairs as the given examples.

36/147
Regression problem

• Is the previous task a supervised learning problem?


• Yes, since we have (input, output) pairs as the given examples.
• Can we use decision tree to learn the prediction?

36/147
Regression problem

• Is the previous task a supervised learning problem?


• Yes, since we have (input, output) pairs as the given examples.
• Can we use decision tree to learn the prediction?
• The target value can be any (real) number, hence task is regression
• The usual decision tree may be inappropriate.

36/147
Let’s look at the data ...

37/147
Hypothesis

Which representation gives a reasonable hypothesis?

38/147
Hypothesis

Which representation gives a reasonable hypothesis? Straight lines? Which one?

38/147
Two hypotheses and errors at each example

Dashed/dotted lines represent the magnitude of error at each example.

39/147
Linear regression

• For n features, the hypotheses h are linear hyperplanes of n-dimension within n + 1


space (if n = 1, it’s a straight line):

h(x1 , . . . , xn ) = w0 + w1 x1 + · · · + wn xn

where the xi ’s are the inputs, wi ’s are the parameters or weights of the hypothesis,
and w0 is called the intercept term.
• Linear regression aims to learn the parameters wi ’s in the above linear form.
T
• Using vector notation (see later), let x = (x1 , . . . , xn ) with x0 = 1, and
T
w = (w1 , . . . , wn ) . Then, the hypotheses can be written as a matrix operation:
 
n
x1
X  .  T
hw (x) =  ..  = w x
wi xi = (w1 , . . . , wn )  
i=0
xn

40/147
Gradient descent

• How do we learn w? Choose a linear hyperplane closest to all given data points
• Particular parameters w correspond to a particular linear hyperplane.
• Need a cost function J(w) of w that measures how close the linear hyperplane to all data,
i.e., J(w) is minimal when w makes the linear hyperplane closest to the data.
• We assume J is a paraboloid, convex function: with suitable axes, J always has a
global minima.
• Finding linear hyperplane closest to given data points ≡ finding minima of J.

Left, middle: paraboloid convex. Right: paraboloid, non-convex


Image source: Ag2gaeh (https://commons.wikimedia.org/wiki/File:Parabol-el-zy-hy-s.svg), https://creativecommons.org/licenses/by-sa/4.0/legalcode

41/147
Deriving gradient descent rule

• For simplicity, let w = (w0 , w1 ), i.e., J(w) = J(w0 , w1 ) is convex paraboloid in 3D whose
vertical axis is parallel to the z-axis..
• Let A = (w0 , w1 , J(w0 , w1 )) be any point in the paraboloid. Which direction should we move
from A so that we get a maximum decrease in J?

42/147
Deriving gradient descent rule

• For simplicity, let w = (w0 , w1 ), i.e., J(w) = J(w0 , w1 ) is convex paraboloid in 3D whose
vertical axis is parallel to the z-axis..
• Let A = (w0 , w1 , J(w0 , w1 )) be any point in the paraboloid. Which direction should we move
from A so that we get a maximum decrease in J?
• Let O be the origin point and B = (w00 , w10 , J(w00 , w10 )) the destination of our move.
• Let P = (w0 , w1 ) and Q = (w00 , w10 ) be resp. projections of A and B on the xy-plane.
−−→
• The vector P Q is a projection of the vector of our move and moving from A to B in the
paraboloid is equivalent to moving from P to Q in the xy-plane.

42/147
Deriving gradient descent rule

• For simplicity, let w = (w0 , w1 ), i.e., J(w) = J(w0 , w1 ) is convex paraboloid in 3D whose
vertical axis is parallel to the z-axis..
• Let A = (w0 , w1 , J(w0 , w1 )) be any point in the paraboloid. Which direction should we move
from A so that we get a maximum decrease in J?
• Let O be the origin point and B = (w00 , w10 , J(w00 , w10 )) the destination of our move.
• Let P = (w0 , w1 ) and Q = (w00 , w10 ) be resp. projections of A and B on the xy-plane.
−−→
• The vector P Q is a projection of the vector of our move and moving from A to B in the
paraboloid is equivalent to moving from P to Q in the xy-plane.
• The largest decrease of J happens in the direction of

42/147
Deriving gradient descent rule

• For simplicity, let w = (w0 , w1 ), i.e., J(w) = J(w0 , w1 ) is convex paraboloid in 3D whose
vertical axis is parallel to the z-axis..
• Let A = (w0 , w1 , J(w0 , w1 )) be any point in the paraboloid. Which direction should we move
from A so that we get a maximum decrease in J?
• Let O be the origin point and B = (w00 , w10 , J(w00 , w10 )) the destination of our move.
• Let P = (w0 , w1 ) and Q = (w00 , w10 ) be resp. projections of A and B on the xy-plane.
−−→
• The vector P Q is a projection of the vector of our move and moving from A to B in the
paraboloid is equivalent to moving from P to Q in the xy-plane.
• The largest decrease of J happens in the direction of −∇J.
• Thus, we choose Q in the direction of −∇J.


−→ − −→ − −→ ∇J
OQ = OP + P Q ⇔ (w00 , w10 )T = (w0 , w1 )T + δ −
k∇Jk
where δ is a step size (guaranteed convergence if δ is sufficiently small)
∇J
• − k∇Jk is the unit vector opposite direction of ∇J (we only care about the direction).

−→
• P Q is a small shift in the direction of −∇J, hence a multiple of the unit vector.

42/147
Gradient descent rule

The previous equation embodies the gradient descent algorithm, often written:

w := w − α∇J(w)

or, in the case of two variables:



w0 := w0 − α J(w0 , w1 )
∂w0

w1 := w1 − α J(w0 , w1 )
∂w1

Here, α is called the learning rate and like δ, it is enough if we choose α to be sufficiently
small.

43/147
Gradient descent for linear regression

• The gradient descent for linear regression uses the sum of squares as the cost
function over m training examples (x(i) , y (i) ), i = 1, . . . , m:
m
1X
J(w) = (hw (x(i) ) − y (i) )2
2 i=1

• Why quadratic form?

44/147
Gradient descent for linear regression

• The gradient descent for linear regression uses the sum of squares as the cost
function over m training examples (x(i) , y (i) ), i = 1, . . . , m:
m
1X
J(w) = (hw (x(i) ) − y (i) )2
2 i=1

• Why quadratic form? It’s a paraboloid convex


• Why multiply by 1/2?

44/147
Gradient descent for linear regression

• The gradient descent for linear regression uses the sum of squares as the cost
function over m training examples (x(i) , y (i) ), i = 1, . . . , m:
m
1X
J(w) = (hw (x(i) ) − y (i) )2
2 i=1

• Why quadratic form? It’s a paraboloid convex


• Why multiply by 1/2? Just for convenience in derivative operation late :-).
• The sum of squares cost function leads to the ordinary least squares regression
model.
• The aim becomes finding w such that J(w) is minimal!.

44/147
Concrete update rule for linear regression

Parameter update in gradient descent requires computing ∇J. We can derive a concrete
update rule such that the ∇J expression no longer appears.

Let m be the number of training examples x. We compute each partial derivative


component of the vector ∇J. Let wj be a component of ∇J. Then:
m X m
∂ ∂ 1X 2 ∂ 1 2
J(w) = hw (x(i) ) − y (i) = hw (x(i) ) − y (i)
∂wj ∂wj 2 i=1 i=1
∂wj 2
m
X 1 ∂
hw (x(i) ) − y (i) · hw (x(i) ) − y (i)

= 2· By chain rule
i=1
2 ∂wj
m
X ∂
hw (x(i) ) − y (i) · hw (x(i) ) − y (i)

=
i=1
∂wj
m n X m
X ∂ X (i) (i)
hw (x(i) ) − y (i) · wk xk − y (i) hw (x(i) ) − y (i) xj

= =
i=1
∂w j k=0 i=1

46/147
Batch gradient descent

Applying the previous partial derivative to the gradient descent algorithm, we obtain for
each parameter wj :
m
∂ X (i)
wj := wj − α J(w) ⇒ wj := wj + α y (i) − hw (x(i) ) xj
∂wj i=1

The previous partial derivative allows us to derive the following batch gradient descent
algorithm:

1. Start with an initial value of w = (w0 , . . . , wn )T .


2. Repeat until convergence:
Pm (i)
• For every j: wj := wj + α i=1 y (i) − hw (x(i) ) xj

47/147
Gradient descent contour

The ellipses are the contours of the quadratic function J. Also shown are the trajectory
taken by gradient descent initialized at (48,30)

Image source: Andrew Ng, CS229 Lecture Notes for CS229 Fall 2018 course, Part I: Linear Regression, page 6.

48/147
Termination condition

Choices for termination condition (convergence):

• Fixed number of iterations: for a fixed step size and under certain conditions, if w(k) is
w at the k-th iteration and w∗ is the optimal parameters, then J(w(k) ) − J(w∗ ) 6
after O(1/) iterations.
• kw0 − wk (change in parameter values) or kJ(w0 ) − J(w)k (change in error value) is
small enough where w0 is w after a single update.
• k∇J(w)k is small enough: the closer it is to the minima, the smaller the gradient is.

49/147
Stochastic/incremental gradient descent

• Batch gradient descent must scan through the entire training set before making a
single update.
• May be expensive if the size of training set is very large.
• Alternative: stochastic gradient descent that performs update to the parameters
according to the gradient of the error/loss at each single example.

1. Start with an initial value of w = (w0 , . . . , wn )T .


2. Loop until convergence
• Loop for every training example x(i) :
(i)
• For every j: wj := wj + α y (i) − hw (x(i) ) xj

• SGD may get close to the minima faster than batch gradient descent, but may never
converge to the minimum (the parameters keep oscilating around the minima).

50/147
Closed form solution to linear regression

• Besides using an iterative algorithm, we can find a closed-form solution for the
optimal weight w.
• Most regression problem (esp. nonlinear ones) don’t have a closed-form solution.
• Let X be the design matrix containing all training examples as its rows and y the
vector containing all target values from the training set.
 T   
— x(1) — y (1)
T  (2) 
 — x(2) —  y 
 
X=
 ..  y=
 .. 

 . 

 . 
(m) T y (m)

— x —

• If J(w) is the sum of squares cost function, then it can be shown that
∇J = X T Xw − X T y, which is minimized when X T Xw = X T y
• Thus, the optimal w is w = (X T X)−1 X T y

51/147
Why not just use the closed-form solution?

• If possible, it’s better to use the closed-form solution.


• But, the closed-form solution may be impractical:
• Suppose X is very large, sparse matrix (e.g., 106 examples with 105 features), but X T X is
fairly dense.
• Note that the dimension of X T X is 105 ×105 . Hence, we have to store 1010 floating point
numbers (at 8 bytes per number, takes roughly 80 gigabytes space) may be impractical.
• Matrix operations in the closed-form solution may have numerical accuracy issues
(refer to the Numerical Analysis course).

52/147
Logistic Regression
Motivation

Let’s consider a classification problem with target y ∈ {0, 1}

• Linear regression algorithms can be used, but it is quite simple to construct examples
where the algorithm performs very poorly.
• Intuitively, it makes no sense to have target values larger than 1 or smaller than 0.
• Can we still use the same linear form, but ensure the output is neither below 0 nor
above 1?

53/147
Motivation

Let’s consider a classification problem with target y ∈ {0, 1}

• Linear regression algorithms can be used, but it is quite simple to construct examples
where the algorithm performs very poorly.
• Intuitively, it makes no sense to have target values larger than 1 or smaller than 0.
• Can we still use the same linear form, but ensure the output is neither below 0 nor
above 1?
• Idea: scale the output of the linear form, which is R, into the range [0, 1] using some
function (called the link function).
• Intuition: the weights/parameters transform the input into a point in an intermediate space
and the link function transform that point into a probability value that the input belongs to
the class (given by the label).

53/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
e−z e−z +1−1 1
For every z ∈ R, g(−z) = e−z +1
= e−z +1
=1− e−z +1
= 1 − g(z). So, let h(z) = g(z) − 0.5.
Then, h(−z) = g(−z) − 0.5 = 1 − g(z) − 0.5 = 0.5 − g(z) = −(g(z) − 0.5) = −h(z), i.e., h(z) is
an odd function, which always has a rotational symmety with respect to (0,0). Thus,
g(z) = h(z) + 0.5 must have a rotational symmetry with respect to (0,0.5)

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
e−z e−z +1−1 1
For every z ∈ R, g(−z) = e−z +1
= e−z +1
=1− e−z +1
= 1 − g(z). So, let h(z) = g(z) − 0.5.
Then, h(−z) = g(−z) − 0.5 = 1 − g(z) − 0.5 = 0.5 − g(z) = −(g(z) − 0.5) = −h(z), i.e., h(z) is
an odd function, which always has a rotational symmety with respect to (0,0). Thus,
g(z) = h(z) + 0.5 must have a rotational symmetry with respect to (0,0.5)
• g(z) is strictly increasing. Why?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
e−z e−z +1−1 1
For every z ∈ R, g(−z) = e−z +1
= e−z +1
=1− e−z +1
= 1 − g(z). So, let h(z) = g(z) − 0.5.
Then, h(−z) = g(−z) − 0.5 = 1 − g(z) − 0.5 = 0.5 − g(z) = −(g(z) − 0.5) = −h(z), i.e., h(z) is
an odd function, which always has a rotational symmety with respect to (0,0). Thus,
g(z) = h(z) + 0.5 must have a rotational symmetry with respect to (0,0.5)
• g(z) is strictly increasing. Why?
For z1 , z2 ∈ R, if z1 < z2 , then e−z1 > e−z2 . Thus, g(z1 ) = 1
1+e−z1
< 1
1+e−z2
= g(z2 ).

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
e−z e−z +1−1 1
For every z ∈ R, g(−z) = e−z +1
= e−z +1
=1− e−z +1
= 1 − g(z). So, let h(z) = g(z) − 0.5.
Then, h(−z) = g(−z) − 0.5 = 1 − g(z) − 0.5 = 0.5 − g(z) = −(g(z) − 0.5) = −h(z), i.e., h(z) is
an odd function, which always has a rotational symmety with respect to (0,0). Thus,
g(z) = h(z) + 0.5 must have a rotational symmetry with respect to (0,0.5)
• g(z) is strictly increasing. Why?
For z1 , z2 ∈ R, if z1 < z2 , then e−z1 > e−z2 . Thus, g(z1 ) = 1
1+e−z1
< 1
1+e−z2
= g(z2 ).
• The derivative g 0 (z) of g(z) satisfies the following: g 0 (z) = g(z)(1 − g(z)). (Prove it!)

54/147
Logistic function

Image source: Andrew Ng, CS229 Lecture Notes for CS229 Fall 2018 course, Part II: Classification and Logistic Regression, page 17.

55/147
Logistic regression

• Logistic function is a natural choice (Why? Read the literature on Generalized Linear
Model) for our classification setting (though other functions that smoothly increase
from 0 to 1 can also be used).
• With input x of n features (x0 , x1 , . . . , xn ), we have a hypothesis of the form:
1
hw (x) = g(wT x) =
1 + e−wT x
• The model parameter is w like in linear regression.
• Classification is then simply done by taking the threshold of 0.5, i.e., x is labeled 1 if
hw (x) > 0.5 and 0 otherwise.

56/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

57/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

• The data is given by the design matrix X (all training inputs as its roows) and the vector y (all
target values in the training set).
• For a particular value of w, the likelihood function LX,y (w) gives the probability (or probability
density) of the data X, y to occur when the parameter of the model is w.

57/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

• The data is given by the design matrix X (all training inputs as its roows) and the vector y (all
target values in the training set).
• For a particular value of w, the likelihood function LX,y (w) gives the probability (or probability
density) of the data X, y to occur when the parameter of the model is w.
• Our model represents the (probabilistic) relationship between the known input X and target y.
That is, the conditional distribution pw (y|X) of y given X, which is parameterized by w.

57/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

• The data is given by the design matrix X (all training inputs as its roows) and the vector y (all
target values in the training set).
• For a particular value of w, the likelihood function LX,y (w) gives the probability (or probability
density) of the data X, y to occur when the parameter of the model is w.
• Our model represents the (probabilistic) relationship between the known input X and target y.
That is, the conditional distribution pw (y|X) of y given X, which is parameterized by w.
• Thus, the likelihood of w given the data: LX,y (w) = pw (y, X) = pw (y|X) where the last
equality is because in our model the input (for which X is a sample) is always known.

57/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

• The data is given by the design matrix X (all training inputs as its roows) and the vector y (all
target values in the training set).
• For a particular value of w, the likelihood function LX,y (w) gives the probability (or probability
density) of the data X, y to occur when the parameter of the model is w.
• Our model represents the (probabilistic) relationship between the known input X and target y.
That is, the conditional distribution pw (y|X) of y given X, which is parameterized by w.
• Thus, the likelihood of w given the data: LX,y (w) = pw (y, X) = pw (y|X) where the last
equality is because in our model the input (for which X is a sample) is always known.

Our aim is thus finding the value of w that maximizes the conditional probability above,
i.e., maximizes the likelihood.

• To realize it, we first need to make some assumption about pw (y|X).

57/147
Maximizing likelihood for logistic regression

For logistic regression, we assume that the result of prediction hw (x) is the expected value
of the target variable y given the data X = x, which follows the conditional distribution
pw (y|x). Since the target value can only be 1 or 0 (binary classification), we write:

pw (y=1|x) = hw (x) pw (y=0|x) = 1 − hw (x)

or more compactly as a single (mass) function:

58/147
Maximizing likelihood for logistic regression

For logistic regression, we assume that the result of prediction hw (x) is the expected value
of the target variable y given the data X = x, which follows the conditional distribution
pw (y|x). Since the target value can only be 1 or 0 (binary classification), we write:

pw (y=1|x) = hw (x) pw (y=0|x) = 1 − hw (x)

or more compactly as a single (mass) function:

pw (y|x) = (hw (x))y (1 − hw (x))1−y

• This corresponds to our earlier intuition that w brings x to wT x and then the logistic
function brings wT x to the appropriate probability value.

58/147
Maximizing likelihood for logistic regression (2)

• Next, we assume that X and y give m training examples (x(i) , y (i) ), each generated
independently.
• Then, the likelihood LX,y (w) corresponds to joint distribution over all training
examples (x(i) , y (i) ).

LX,y (w) =

59/147
Maximizing likelihood for logistic regression (2)

• Next, we assume that X and y give m training examples (x(i) , y (i) ), each generated
independently.
• Then, the likelihood LX,y (w) corresponds to joint distribution over all training
examples (x(i) , y (i) ).
m m
Y Y y(i) 1−y(i)
LX,y (w) = pw (y (i) |x(i) ) = hw (x(i) ) 1 − hw (x(i) )
i=1 i=1

59/147
Maximizing likelihood for logistic regression (2)

• Next, we assume that X and y give m training examples (x(i) , y (i) ), each generated
independently.
• Then, the likelihood LX,y (w) corresponds to joint distribution over all training
examples (x(i) , y (i) ).
m m
Y Y y(i) 1−y(i)
LX,y (w) = pw (y (i) |x(i) ) = hw (x(i) ) 1 − hw (x(i) )
i=1 i=1

• Maximimizing the likelihood is equivalent to maximizing the log-likelihood `(w) whose


form is easier for computing derivatives (the log is natural logarithm):

`(w) = log L(w) =

59/147
Maximizing likelihood for logistic regression (2)

• Next, we assume that X and y give m training examples (x(i) , y (i) ), each generated
independently.
• Then, the likelihood LX,y (w) corresponds to joint distribution over all training
examples (x(i) , y (i) ).
m m
Y Y y(i) 1−y(i)
LX,y (w) = pw (y (i) |x(i) ) = hw (x(i) ) 1 − hw (x(i) )
i=1 i=1

• Maximimizing the likelihood is equivalent to maximizing the log-likelihood `(w) whose


form is easier for computing derivatives (the log is natural logarithm):
m
X
`(w) = log L(w) = y (i) log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
i=1

59/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

60/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

We compute the partial derivative for each weight wj :


m
∂ ∂ X (i)
`(w) = y log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
∂wj ∂wj i=1

60/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

We compute the partial derivative for each weight wj :


m
∂ ∂ X (i)
`(w) = y log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
∂wj ∂wj i=1
m
X ∂
= y (i) log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
i=1
∂wj

60/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

We compute the partial derivative for each weight wj :


m
∂ ∂ X (i)
`(w) = y log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
∂wj ∂wj i=1
m
X ∂
= y (i) log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
i=1
∂wj
m
y (i) 1 − y (i)

X ∂
= (i) )
− (i) ) ∂w
hw (x(i) ) (because ∂
∂x log x = x1 )
i=1
h w (x 1 − hw (x j

60/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

We compute the partial derivative for each weight wj :


m
∂ ∂ X (i)
`(w) = y log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
∂wj ∂wj i=1
m
X ∂
= y (i) log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
i=1
∂wj
m
y (i) 1 − y (i)

X ∂
= (i) )
− (i) ) ∂w
hw (x(i) ) (because ∂
∂x log x = x1 )
i=1
h w (x 1 − hw (x j
m
y (i) 1 − y (i)

X ∂
= T (i)
− T (i)
g(wT x(i) )
i=1
g(w x ) 1 − g(w x ) ∂w j

60/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:


`(w)
∂wj

61/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:


m
y (i) 1 − y (i)

∂ X ∂
`(w) = T x(i) )
− T x(i) )
g(wT x(i) )(1 − g(wT x(i) )) wT x(i)
∂wj i=1
g(w 1 − g(w ∂w j

61/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:


m
y (i) 1 − y (i)

∂ X ∂
`(w) = T x(i) )
− T x(i) )
g(wT x(i) )(1 − g(wT x(i) )) wT x(i)
∂wj i=1
g(w 1 − g(w ∂w j
m
X (i)
= y (i) (1 − g(wT x(i) )) − (1 − y (i) )g(wT x(i) ) xj
i=1

61/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:


m
y (i) 1 − y (i)

∂ X ∂
`(w) = T x(i) )
− T x(i) )
g(wT x(i) )(1 − g(wT x(i) )) wT x(i)
∂wj i=1
g(w 1 − g(w ∂w j
m
X (i)
= y (i) (1 − g(wT x(i) )) − (1 − y (i) )g(wT x(i) ) xj
i=1
m
(i)
X
= (y (i) − g(wT x(i) ))xj
i=1

61/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:


m
y (i) 1 − y (i)

∂ X ∂
`(w) = T x(i) )
− T x(i) )
g(wT x(i) )(1 − g(wT x(i) )) wT x(i)
∂wj i=1
g(w 1 − g(w ∂w j
m
X (i)
= y (i) (1 − g(wT x(i) )) − (1 − y (i) )g(wT x(i) ) xj
i=1
m
(i)
X
= (y (i) − g(wT x(i) ))xj
i=1
m
(i)
X
= (y (i) − hw (x(i) ))xj
i=1

61/147
Batch & stochastic gradient ascent for logistic regression

The gradient ascent update turned out to be the same as the gradient descent update in
linear regression, except that hw is no longer linear in x(i) . This is actually not coincidence
– for deeper discussion, read the topic of Generalized Linear Model.
m
(i)
X
wj := wj + α (y (i) − hw (x(i) ))xj
i=1

Thus, analogous to linear regression we have two version of gradient ascents.

Batch gradient ascent from an initial value of w = (w0 , . . . , wn )T .


• Repeat until convergence:
Pm (i)
• For every j: wj := wj + α i=1 y (i) − hw (x(i) ) xj

Stochastic gradient ascent from an initial value of w = (w0 , . . . , wn )T .


• Loop until convergence
• Loop for every training example x(i) :
(i)
• For every j: wj := wj + α y (i) − hw (x(i) ) xj

62/147
Naive Bayes Classifier
Roles of Bayesian Method

• Provides practical learning algorithms


• Naive Bayes learning
• Bayesian belief network learning
• Combine prior knowledge (prior probabilities) with observed data
• Requires prior probabilities

• Provides useful conceptual framework for evaluating other learning algorithms


• maximum likelihood and least-squared error hypotheses (e.g., in linear regression)
• maximun likelihood for predicting probabilities (e.g., in logistic regression)

63/147
Bayes Theorem

P (D|h)P (h)
P (h|D) =
P (D)

• P (h) = prior probability of hypothesis h


• P (D) = prior probability of training data D
• P (h|D) = probability of h given D
• P (D|h) = probability of D given h

P (h|D):

• increases as P (D|h) and P (h) increases


• decreases as P (D) increases: the more probable D is observed independently of h,
the less evidence D provides in support of h.

64/147
Naive Bayes Classifier

Along with decision trees, neural networks, nearest neighbor, one of the most practical
learning methods.

When to use:

• Moderate or large training set available


• Attributes that describe instances are conditionally independent given classification

Successful applications:

• Diagnosis
• Classifying text documents

65/147
Naive Bayes Classifier

Assume target function f : X → V , where each instance x described by attributes


ha1 , a2 . . . an i.
Most probable value of f (x) is:
vM AP = arg max P (vj |a1 , a2 . . . an )
vj ∈V

P (a1 , a2 . . . an |vj )P (vj )


vM AP = arg max
vj ∈V P (a1 , a2 . . . an )
= arg max P (a1 , a2 . . . an |vj )P (vj )
vj ∈V

Naive Bayes assumption:


Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i

which gives

Y
Naive Bayes classifier: vN B = arg max P (vj ) P (ai |vj )
vj ∈V
i 66/147
Naive Bayes Algorithm

Naive_Bayes_Learn(examples):

For each target value vj :


P̂ (vj ) ← estimate P (vj )

For each attribute value ai of each attribute a:


P̂ (ai |vj ) ← estimate P (ai |vj )

Classify_New_Instance(x):
Y
vN B = arg max P̂ (vj ) P̂ (ai |vj )
vj ∈V
ai ∈x

67/147
Example

Determine what to do in the evening if you have deadlines looming, but none are
particularly urgent, that there is no party currently going on, and that you are currently
lazy, based on the following data.

Deadline? Is there a party? Lazy? Activity


Urgent Yes Yes Party
Urgent No Yes Study
Near Yes Yes Party
None Yes No Party
None No Yes Go to mall
None Yes No Party
Near No No Study
Near No Yes Watch TV
Near Yes Yes Party
Urgent No No Study

68/147
Naive Bayes: Subtleties

1. Conditional independence assumption is often violated


Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i

...but it works surprisingly well anyway.


2. What if none of the training instances with target value vj have attribute value ai ?
Then Y
P̂ (ai |vj ) = 0, and P̂ (vj ) P̂ (ai |vj ) = 0
i

Typical solution is Bayesian estimate for P̂ (ai |vj )


nc + mp
P̂ (ai |vj ) ←
n+m

• n is number of training examples for which v = vj ,


• nc number of examples for which v = vj and a = ai
1
• p is prior estimate for P̂ (ai |vj ), typically uniform priors, i.e., p = k
for k possible values of
attribute
• m is weight given to prior (i.e. number of “virtual” examples)
69/147
Support Vector Machine
Motivation

Consider logistic regression with hw (x) = g(wT x) with g the logistic function.

• The prediction hw (x) models the probability pw (y=1|x).


• We predict “1” for x iff hw (x) > 0.5, i.e., wT x > 0.
• For a positive training example (y = 1), the larger wT x is, the larger is the probability
pw (y=1|x), hence the higher degree of “confidence” that the label is 1.
• w is a good fit for the training set if wT x(i) 0 whenever y (i) = 1 and wT x(i) 0
whenever y (i) = 0
• is read “much greater than” and is read “much less than”.

70/147
Motivation (2)

• Class × vs ◦ (data with 2 features).


• Linear decision boundary (the line) satisfies
wT x = 0
• A is very far from decision boundary: very
high confidence that A is ×
• C is very close to boundary: predicting that
C is × has low confidence; small change to
w can change it to ◦
• B is in between the above two cases.
• The farther the point from the decision
boundary, the more confident our prediction
is.
• Aim: find decision boundary such that all
predictions are correct and confident on the
training examples.
71/147
Notation

We consider linear classifiers (those with linear decision boundary) for binary classification
problem

• Input features x with n dimension


• Labels t ∈ {−1, 1} (instead of {0, 1})
• Parameter w with the intercept term b, i.e., w = [w1 , . . . , wn ]T .
• We drop the extra constant feature x0 = 1, i.e., x = [x1 , . . . , xn ]T .
• Hypotheses of the form:
(
T 1 if wT x + b > 0,
yw,b (x) = g(w x + b) =
−1 otherwise

• Here, we directly predict a “1” or “-1”, instead of employing the logistic regression
approach of estimating the probability that y = 1

72/147
Which classification line is better than the others?

Which decision boundary is better than the others? Why? (MLAP, Fig. 8.1.)

73/147
Margin: Optimal separation

• Measure the distance from the line before hitting a datapoint: imagine a “(wide)
street” around the line.
• Any point on the street is too close to the line to be accurately clasified.
• Symmetric about the line: region in 2D, cuboid in 3D, hyperrectangle D > 3.
• The width of a separating hyperplane is called margin.
Which of the three examples of separating hyperplanes in the previous slide seem to
capture the maximum margin, i.e., optimal separating hyperplane?

74/147
Margin: Optimal separation

• Measure the distance from the line before hitting a datapoint: imagine a “(wide)
street” around the line.
• Any point on the street is too close to the line to be accurately clasified.
• Symmetric about the line: region in 2D, cuboid in 3D, hyperrectangle D > 3.
• The width of a separating hyperplane is called margin.
Which of the three examples of separating hyperplanes in the previous slide seem to
capture the maximum margin, i.e., optimal separating hyperplane?

MLAP, Fig. 8.2.

74/147
Support vectors

• Assuming linearly separable data, the classifier with the largest margin is called the
maximum margin (linear) classifier.
• Datapoints of each class that lie closest to the classification line: support vectors.
• Maximum margin classifier is one that goes through the middle of the “street”
• margin should be as large as possible;
• support vectors are most useful datapoints since they are the ones that might be
misclassified
• once maximum margin is found, classification of a test point only depends on the support
vectors.
• Support vector machine (SVM) [Vapnik, 1992] performs binary classification by
finding a maximum margin classifier in:
• the original input space if the data is linearly separable
• a higher dimensional space to which data is transformed (using kernel functions) if the
data is not linearly separable.

75/147
Linear decision boundary (1)

Assume data is linearly separable into two classes (region Ry>0 and Ry<0 ). The decision
boundary is a linear hyperplane: y = y(x) = wT x + b = 0 where w: weight vector of the
features, b: intercept term. E.g., for 2 features: the line w1 x1 + w2 x2 + b = 0.

76/147
Linear decision boundary (2)

Assume data is linearly separable into two classes (region Ry>0 and Ry<0 ). The decision
boundary is a linear hyperplane: y = y(x) = wT x + b = 0 where w: weight vector of the
features, b: intercept term. E.g., for 2 features: the line w1 x1 + w2 x2 + b = 0.

77/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?

PRML, Fig. 4.1.

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.

PRML, Fig. 4.1.

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = lies on the origin.

PRML, Fig. 4.1.

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.

PRML, Fig. 4.1.

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.
• So, y(xA ) = and y(xB ) =

PRML, Fig. 4.1.

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.
• So, y(xA ) = 0 and y(xB ) = 0 since A
and B are on the decision hyperplane

PRML, Fig. 4.1.

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.
• So, y(xA ) = 0 and y(xB ) = 0 since A
and B are on the decision hyperplane
• Inner product of w and xB − xA :
hw, xB − xA i =

PRML, Fig. 4.1.

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.
• So, y(xA ) = 0 and y(xB ) = 0 since A
and B are on the decision hyperplane
• Inner product of w and xB − xA :
hw, xB − xA i = wT (xB − xA )
=
PRML, Fig. 4.1.

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.
• So, y(xA ) = 0 and y(xB ) = 0 since A
and B are on the decision hyperplane
• Inner product of w and xB − xA :
hw, xB − xA i = wT (xB − xA )
= wT xB − wT xA
PRML, Fig. 4.1.
=

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.
• So, y(xA ) = 0 and y(xB ) = 0 since A
and B are on the decision hyperplane
• Inner product of w and xB − xA :
hw, xB − xA i = wT (xB − xA )
= wT xB − wT xA
PRML, Fig. 4.1.
= wT xB + b − (wT xA + b)
=

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.
• So, y(xA ) = 0 and y(xB ) = 0 since A
and B are on the decision hyperplane
• Inner product of w and xB − xA :
hw, xB − xA i = wT (xB − xA )
= wT xB − wT xA
PRML, Fig. 4.1.
= wT xB + b − (wT xA + b)
= y(xB ) − y(xA ) =

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.
• So, y(xA ) = 0 and y(xB ) = 0 since A
and B are on the decision hyperplane
• Inner product of w and xB − xA :
hw, xB − xA i = wT (xB − xA )
= wT xB − wT xA
PRML, Fig. 4.1.
= wT xB + b − (wT xA + b)
= y(xB ) − y(xA ) = 0

78/147
Distance from decision hyperplane to any point (1)

w is orthogonal to every vector lying on the


decision hyperplane y = 0. Why?
• For any two points A and B on the
decision hyperplane, let xA , xB be
vectors from origin to A and B, resp.
−−→
• Vector AB = xB − xA lies on the origin.
• So, y(xA ) = 0 and y(xB ) = 0 since A
and B are on the decision hyperplane
• Inner product of w and xB − xA :
hw, xB − xA i = wT (xB − xA )
= wT xB − wT xA
PRML, Fig. 4.1.
= wT xB + b − (wT xA + b)
= y(xB ) − y(xA ) = 0
• Thus, w is orthogonal with xB − xA .

78/147
Distance from decision hyperplane to any point (2)

We compute distance from any point x to


decision hyperplane
• Let x⊥ be the projection of x to decision
hyperplane
• Distance vector from x⊥ to x is

PRML, Fig. 4.1.

79/147
Distance from decision hyperplane to any point (2)

We compute distance from any point x to


decision hyperplane
• Let x⊥ be the projection of x to decision
hyperplane
• Distance vector from x⊥ to x is x − x⊥

PRML, Fig. 4.1.

79/147
Distance from decision hyperplane to any point (2)

We compute distance from any point x to


decision hyperplane
• Let x⊥ be the projection of x to decision
hyperplane
• Distance vector from x⊥ to x is x − x⊥
• x − x⊥ is to w

PRML, Fig. 4.1.

79/147
Distance from decision hyperplane to any point (2)

We compute distance from any point x to


decision hyperplane
• Let x⊥ be the projection of x to decision
hyperplane
• Distance vector from x⊥ to x is x − x⊥
• x − x⊥ is parallel to w

PRML, Fig. 4.1.

79/147
Distance from decision hyperplane to any point (2)

We compute distance from any point x to


decision hyperplane
• Let x⊥ be the projection of x to decision
hyperplane
• Distance vector from x⊥ to x is x − x⊥
• x − x⊥ is parallel to w
• So, x − x⊥ = where

PRML, Fig. 4.1.

79/147
Distance from decision hyperplane to any point (2)

We compute distance from any point x to


decision hyperplane
• Let x⊥ be the projection of x to decision
hyperplane
• Distance vector from x⊥ to x is x − x⊥
• x − x⊥ is parallel to w
w
• So, x − x⊥ = r kwk where
w
• kwk unit vector along w
• r is the signed distance from the
decision hyperplane (x⊥ ) to x.

PRML, Fig. 4.1.

79/147
Distance from decision hyperplane to any point (2)

We compute distance from any point x to


decision hyperplane
• Let x⊥ be the projection of x to decision
hyperplane
• Distance vector from x⊥ to x is x − x⊥
• x − x⊥ is parallel to w
w
• So, x − x⊥ = r kwk where
w
• kwk unit vector along w
• r is the signed distance from the
decision hyperplane (x⊥ ) to x.
• So we can write:
w
PRML, Fig. 4.1.
x = x⊥ + (x − x⊥ ) = x⊥ + r
kwk

79/147
Distance from decision hyperplane to any point (3)

What’s the relation between y(x) and r?


w
y(x) = wT x + b = wT x⊥ + r + b (*)
kwk
wT w
= wT x⊥ + b + r
kwk
2
kwk
= y(x⊥ ) + r
w
= rkwk (**)

(*) due to result from previous slide


(**) since y(x⊥ ) = wT x⊥ = 0 as x⊥ lies on
the decision hyperplane
PRML, Fig. 4.1.

80/147
Distance from decision hyperplane to any point (4)

• Signed distance from the decision


hyperplane to x is r = .

PRML, Fig. 4.1.

81/147
Distance from decision hyperplane to any point (4)

• Signed distance from the decision


hyperplane to x is r = y(x)
kwk .

PRML, Fig. 4.1.

81/147
Distance from decision hyperplane to any point (4)

• Signed distance from the decision


hyperplane to x is r = y(x)
kwk .
• x is in the region Ry>0 iff r > 0
• x is in the region Ry<0 iff r < 0

PRML, Fig. 4.1.

81/147
Distance from decision hyperplane to any point (4)

• Signed distance from the decision


hyperplane to x is r = y(x)
kwk .
• x is in the region Ry>0 iff r > 0
• x is in the region Ry<0 iff r < 0
• Note: Signed distance from x to the
decision hyperplane is −r.

PRML, Fig. 4.1.

81/147
Distance from decision hyperplane to any point (4)

• Signed distance from the decision


hyperplane to x is r = y(x)
kwk .
• x is in the region Ry>0 iff r > 0
• x is in the region Ry<0 iff r < 0
• Note: Signed distance from x to the
decision hyperplane is −r.
• In particular, if x is the origin, its signed
distance to the decision hyperplane is
− y(x) b
kwk = − kwk , which is negative when
the origin is in Ry>0 and positive when it
is in Ry<0 .
PRML, Fig. 4.1.

81/147
Maximum margin solution: Optimization (1)

• Recall that for every input x(i) we have a target t(i) ∈ {−1, 1}.
• If data is linearly separable, there is at least one choice of w and b such that y(x(i) ) > 0
whenever t(i) = +1 and y(x(i) ) < 0 whenever t(i) = −1. Thus, for every point x(i) ,
t(i) y(x(i) ) > 0.
• The unsigned distance between a point x(i) and decision hyperplane is:

t(i) y(x(i) ) t(i) (wT x(i) + b)


=
kwk kwk

Margin is the unsigned distance between the decision hyperplane and the closest points to it.
• We want to maximize the margin, thus we solve

t(i) (wT x(i) + b)



1
arg max min = arg max min[t(i) (wT x(i) + b)]
w,b i kwk w,b kwk i
1
with kwk
is outside optimization over i since it is independent of i.

82/147
Maximum margin solution: Optimization (2)

• Suppose the “inner” minimization gives a solution mini [t(i) (wT x(i) + b)] = γ̂. Then, the
maximum margin solution can be written:
γ̂
arg max subject to constraints t(i) (wT x(i) + b) > γ̂, for i = 1, . . . , m
w,b kwk

where m is the number of data points.


• Moreover, if we change w to cw and b to cb for some constant c,

t(i) ((cw)T x(i) + cb) ct(i) (wT x(i) + b) t(i) (wT x(i) + b)
= =
kcwk ckwk kwk

the distance of x(i) to the decision hyperplane remains the same.


• So, we can restate the maximum margin solution by “scaling” w and b by a factor of 1/γ̂
1
arg max subject to t(i) (wT x(i) + b) > 1, for i = 1, . . . , m
w,b kwk
1
• The above is a constrained optimization problem, but unfortunately, not convex, because kwk
is
not convex.

83/147
Graph of 1/kwk with w = (w1 , w2 )T
Maximum margin solution: Optimization (3)

1
• Maximizing kwk is equivalent to minimizing kwk2 , which is convex.
• Margin maximization thus reduces to solving a minimization problem called quadratic
programming problem with linear constraints
1
arg min kwk2 subject to t(i) (w T x(i) + b) > 1, for all i = 1, . . . , m
w,b 2

The expression 21 kwk2 is the objective function to be optimized (minimized or


maximized), while t(i) (wT x(i) + b) > 1, i = 1, . . . , m are the constraints, which
determine the feasible set C containing all possible values of the variables w and b
that satisfy all of the constraints.
• b is implicit in the objective function since changes to kwk need to be compensated by
changes to b.
• This particular problem is convex, since both the objective and constraint functions
are convex. In convex optimization, a local optima is also a global optima.

85/147
Maximum margin solution: Optimization (4)

• Our previous minimization problem can be rewritten into a convex optimization


standard form:
1
arg min kwk2 subject to −t(i) (wT x(i) + b) + 1 6 0, for all i = 1, . . . , m
w,b 2

• We employ a technique using Lagrangian multipliers commonly used to solve convex


constrained optimization problems.
• Lagrangian allows us to incorporate b, which only appears in the constraints, into the
objective function.
• As a bonus, this also allows us to obtain an efficient way of working in a higher dimension.

86/147
Brief intro on Lagrangian (1)

Consider a (primal) optimization problem without equality constraint in a standard form:

min f (x) subject to gi (x) 6 0 for i = 1, . . . , m


x∈Rn

The Lagrangian combines objective function and constraints into single objective function:
m
X
L(x, λ) = f (x) + λi gi (x)
i=1

where λ = (λ1 , . . . , λm )T ∈ Rm is a vector of Lagrangian multipliers, and λi > 0 for


i = 1, . . . , m.

87/147
Brief intro on Lagrangian (2)

• For each feasible x (those that satisfy all constraints), L(x, λ) is a lower bound of f (x) for all λ:
• The 2nd term of the Lagrangian satisfies m
P
i=1 λi gi (x) 6 0 due to the fact that for all i,
λi > 0 and gi (x) 6 0.
• Thus, f (x) > f (x) + m
P
i=1 λi gi (x) = L(x, λ, µ)
Pm
• The greater i=1 λi gi (x) (i.e., the closer to 0), the closer L(x, λ) is to f (x). So we can write

f (x) if g (x) 6 0 for i = 1, . . . , m
i
max L(x, λ) =
λ ∞ otherwise

Note that the max operator is over λ, not x. Also, the 2nd case is for x that is not feasible: we
penalize L(x, λ) by defining it to be very large (hence won’t be returned as a minima over x).
• The primal minimization problem could then be written in terms of Lagrangian as:
m
X
min f (x) = min max L(x, λ) = min max f (x) + λi gi (x) subject to λi > 0, i = 1, . . . , m
x x λ x λ
i=1

88/147
Brief intro on Lagrangian (3)

• Computing min max L(x, λ) may be hard. What if we compute max min L(x, λ)
x λ λ x
instead?
• We define the function `(λ) = minx L(x, λ), called the Lagrange dual function. Hence,
we focus on solving the Langrange dual problem:

max min L(x, λ) = max `(λ) subject to λi > 0, i = 1, . . . , m


λ x λ

• The solution of the above corresponds to the tightest/greatest lower bound of f (x)
• What is the relation between min max L(x, λ) and max min L(x, λ)?
x λ λ x

`(λ) = min L(x, λ) 6 L(x, λ) (for all feasible x, λ)


x

⇒ max `(λ) = max min L(x, λ) 6 max L(x, λ) (for all feasible x)
λ λ x λ

⇒ max `(λ) = max min L(x, λ) 6 min max L(x, λ) = min f (x)
λ λ x x λ x

89/147
• The problem: min f (x) = x3 + 3x2 − x + 1 subject to −1 6 x 6 2, or equivalently in
standard form, subject to (x + 1)(x − 2) 6 0.
• Lagrangian: L(x, λ) = x2 + 3x2 − x + 1 + λ(x2 − x − 2), will be below f (x) in the
feasible region −1 6 x 6 2 when λ > 0.
• The corresponding dual function `(λ) never exceeds min f (x) whenever x is feasible.
Brief intro on Lagrangian (4)

• Our inequality maxλ `(λ) 6 minx f (x) is called weak duality, which holds even when f
is non-convex.
• The difference minx f (x) − maxλ `(λ) > 0 gives the duality gap between the optimal
value of the primal problem and the best/greatest lower bound on it that can be
obtained from the dual function.
• In general, the gap may be positive/nonzero. However, under certain conditions, the
gap is zero, i.e., maxλ `(λ) = minx f (x), which implies that a solution to the dual
problem becomes exactly a solution to the primal problem.

91/147
Brief intro on Lagrangian (5)

For a primal optimization problem of minimizing not necessarily convex f (x) subject to
gi (x) 6 0 for i = 1, . . . , m, assume that gi ’s are all differentiable.

Let `(λ) is the dual function for the above primal problem. Further, suppose x∗ , λ∗ be
such that f (x∗ ) is minimum and `(λ) is maximum, resp. That is, x∗ is primal optimal and
`(λ∗ ) is dual optimal.
Theorem
If x∗ and `(λ∗ ) are primal and dual optimal with zero duality gap, then all of the following
conditions, called the Karush-Kuhn-Tucker (KKT) conditions, hold:

1. gi (x∗ ) 6 0 for i = 1, . . . , m
2. λ∗i > 0 for i = 1, . . . , m
3. λ∗i gi (x∗ ) = 0 for i = 1, . . . , m
Pm
4. ∇x f (x∗ ) + i=1 λ∗i ∇x gi (x∗ ) = 0

That is, KKT conditions are necessary conditions for the primal and dual optimal points to
bring minima (of the primal problem) and maxima (of the dual problem) at the same value, 92/147
i.e., min f (x) = max `(λ).
Brief intro to Lagrangian (6)

Moreover, the following also holds:


Theorem
If f and all gi are convex, and x, λ are any points that satisfy the KKT conditions, then x
and λ are primal and dual optimal with zero duality gap.

That is, KKT conditions are also sufficient conditions for the primal and dual optimal points
to give minima and maxima at the same value.

• The last theorems means that to solve the primal problem, we can simply solve the
dual problem and ensure that the KKT conditions are satisfied.

93/147
Dual representation (1)

Let’s return to our original minimization problem for SVM:


1
arg min kwk2 subject to t(i) (wT x(i) + b) > 1, for all i = 1, . . . , m
w,b 2

corresponds to a standard form:


1
min kwk2 subject to −t(i) (wT x(i) + b) + 1 6 0, for all i = 1, . . . , m
w,b 2

The Lagrangian function:


m
1 X
L(w, b, λ) = kwk2 − λi [t(i) (wT x(i) + b) − 1]
2 i=1

where λ = (λ1 , . . . , λm )T give m Lagrangian multipliers λi > 0.


Note that the objective and constraints are all convex. Our aim is to ensure that the KKT conditions
are satisfied. Trivially, the first and second KKT conditions are satisfied.

94/147
Dual representation (2)

Next, we minimize L(w, b, λ) over w, b, i.e., computing its derivative w.r.t. w and b. Any solution for
w and b obtained by setting the derivatives to zero give us w and b that satisfy the fourth KKT
condition.

Setting the derivative of L w.r.t. w to 0:


m m
1 X X
0 = ∇w L(w, b, λ) = ∇w kwk2 − λi [t(i) (wT x(i) + b) − 1] = w − λi t(i) x(i)
2 i=1 i=1

Pm
Thus, w = i=1 λi t(i) x(i) .

Setting the derivative of L w.r.t. b to 0:


m m
∂ ∂ 1 X X
0= L(w, b, λ) = kwk2 − λi [t(i) (wT x(i) + b) − 1] =− λi t(i)
∂b ∂b 2 i=1 i=1

Pm (i)
Thus, i=1 λi t = 0.

95/147
Dual representation (3)

m
X m
X X
kwk2 = wT w = λj t(j) (x(j) )T λi t(i) x(i) = λj λi t(j) t(i) (x(j) )T x(i)
j=1 i=1 16i,j6m
m
X m
X
wT x(i) = λj t(j) (x(j) )T x(i) = λj t(j) (x(j) )T x(i)
j=1 j=1
Pm Pm
We substitute w = i=1 λi t(i) x(i) and i=1 λi t(i) = 0 to the Lagrangian function to obtain the dual form.
m m m m
1 X 1 X X X
W (λ) = L(w, b, λ) = kwk2 − λi [t(i) (wT x(i) + b) − 1] = kwk2 + λi − λi t(i) wT x(i) − b λi t(i)
2 i=1
2 i=1 i=1 i=1
m m Xm
X 1 X X
= λi + λj λi t(j) t(i) (x(j) )T x(i) − λi t(i) λj t(j) (x(j) )T x(i)
i=1
2 16i,j6m i=1 j=1
m
X 1 X X
= λi + λj λi t(j) t(i) (x(j) )T x(i) − λj λi t(j) t(i) (x(j) )T x(i)
i=1
2 16i,j6m 16i,j6m
m
X 1 X
= λi − λj λi t(j) t(i) (x(j) )T x(i)
i=1
2 16i,j6m

96/147
Dual representation (4)

Our dual form is then maximized (to get the tightest lower bound of the original objective
function) giving us the following dual optimization problem:
m
X 1 X
max W (λ) = λi − λi λj t(i) t(j) (x(i) )T x(j)
λ
i=1
2
16i,j6m
m
X 1 X
= λi − λi λj t(i) t(j) hx(i) , x(j) i
i=1
2
16i,j6m
m
X
subject to λi > 0, i = 1, . . . , m and λi t(i) = 0
i=1

where we express the objective function using inner product, which will be useful when
working with kernel functions later.

97/147
Satisfaction of KKT conditions

Our primal optimization problem satisfies KKT conditions for all points x(i) i = 1, . . . , m
Pm Pm
when w = i=1 λi t(i) x(i) and i=1 λi t(i) = 0:

• KKT condition (1): gi (x) = −t(i) (wT x(i) + b) + 1 6 0 for i = 1, . . . , m by definition of


the primal problem.
• KKT condition (2): λi > 0 for i = 1, . . . , m by definition of primal problem
• KKT condition (3): Exercise!
• KKT condition (4): The solution for w above is obtained exactly when
Pm
∇w,b 12 kwk2 + i=1 λi ∇w,b [−t(i) (wT x(i) + b) + 1] = 0

98/147
Support vectors

Because w, b, and λ satisfy the third KKT condition: λi (1 − t(i) (wT x(i) + b)) = 0, it follows that for
every data point x(i) , i = 1, . . . , m, either

• λi = 0; or
• t(i) (wT x(i) + b) = t(i) y(x(i) ) = 1.

Recall our original hypothesis form: y(x) = wT x + b

• Solving the dual problem gives us λ and from it, w = m (i) (i)
P
i=1 λi t x , which is optimal for the
primal problem since KKT conditions are satisfied.
Pm (i) (i) T Pm
• So, for any x, y(x) = wT x + b = (i) (i) T

i=1 λi t x x+b= i=1 λi t (x ) x + b
• Points x(i) for which λi = 0 will not appear in the sum.
• Remaining data points x(k) must satisfy t(k) (wT x(k) + b) = t(k) y(x(k) ) = 1 and will appear in
the sum. Such points are the support vectors, which lie closest to the decision hyperplane
because they satisfy t(i) y(x(i) ) = 1.
• After the SVM model is trained, only the support vectors need to be kept; the rest are are
(k)
(x(k) )T x + b where S
P
discarded, and the hypothesis for any x becomes y(x) = k∈S λk t
is the set of indices of support vectors.

99/147
The intercept term

• Let S be the set of indices of support vectors.


• Each support vector x(r) satisfies t(r) y(x(r) ) = t(r) b + λk t(k) (x(k) )T x(r) = 1
P
k∈S

• We can solve for b above using any arbitrarily chosen support vector, but a more numerically
stable solution is obtained as follows:
X
t(r) b + λk t(k) (x(k) )T x(r) = 1
k∈S
X
⇔ (t(r) )2 b + λk t(k) (x(k) )T x(r) = t(r)
k∈S
X
⇔b+ λk t(k) (x(k) )T x(r) = t(r) because (t(r) )2 = 1 for all r ∈ S
k∈S
X X XX X (r)
⇔ b+ λk t(k) (x(k) )T x(r) = |S|b + λk t(k) (x(k) )T x(r) = t
r∈S k∈S r∈S k∈S r∈S

1 X (r) X
⇔b= t − λk t(k) (x(k) )T x(r)
|S| r∈S
k∈S

100/147
SVM: How to classify a new point with trained model

Given a new datapoint x, trained model (i.e., after w, b, and the support vectors are found)
can classify x by evaluating the sign of y(x) given by
m
X m
X X
y(x) = wT (x) + b = b + λi t(i) (x(i) )T x = b + λi t(i) hx(i) , xi = b + λk t(k) hx(k) , xi
i=1 i=1 k∈S

where S is the set of indices of the support vectors.

101/147
SVM algorithm (linearly separable case)

Initialization: Assume data are column vectors x(i) of X and let K = X T X. Then Ki,j contains
the result of the inner product hx(i) , x(j) i.

Training

• Assemble matrices from the dual optimization and pass them to a quadratic programming
solver:
m
X 1 X
arg max W (λ) = λi − λi λj t(i) t(j) hx(i) , x(j) i
λ
i=1
2 16i,j6m
m
X
subject to λi > 0, i = 1, . . . , m and λi t(i) = 0
i=1

The right sum can be written as a matrix operation, while λi ’s and t(i) ’s can be written as
vectors (how?).
• Identify the support vectors.
• Compute optimal value of b.

102/147
SVM algorithm (linearly separable case) (cont.)

Test/prediction

• For the given test data z, use the support vectors to classify the data.

• Compute y(z) = b + k∈S λk t(k) hx(k) , zi where the indices are for the support vectors
P

only
• Return the label corresponding to the sign of the above expression.

103/147
What if data is not linearly separable?

If data is not linearly separable, then

104/147
What if data is not linearly separable?

If data is not linearly separable, then the constraints of the primal problem cannot be
satisfied.

104/147
What if data is not linearly separable?

If data is not linearly separable, then the constraints of the primal problem cannot be
satisfied.

Two (not mutually exclusive) ways to deal with this problem.

• Relax the (hard) constraints into a soft one using slack variables. [Read Bishop’s
PRML, Section 7.1]
• Transform the data into a higher dimensional space using kernel functions (with the
hope that the data becomes linearly separable in the chosen higher dimensional
space).

We focus on the second case.

104/147
Kernel functions for SVM

Recall the hypothesis is of the form:


m
X X
y(x) = b + wT x = b + λi t(i) hx(i) , xi = b + λk t(k) hx(k) , xi
i=1 k∈S

where S contains indices of the support vectors only.

If we wish to transform the data into a higher dimensional space, we wish to retain the
inner product in the hypothesis, namely the hypothesis should be of the form:
X
y(x) = b + λk t(k) hφ(x(k) ), φ(x)i
k∈S

where φ is the transformation function that takes x to a higher dimensional space.

105/147
Kernel functions for SVM

Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.

φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T

with 1 6 i < j 6 n.

106/147
Kernel functions for SVM

Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.

φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T

with 1 6 i < j 6 n.

• If n = 1, φ(x) = (1, x1 2, x21 )T

106/147
Kernel functions for SVM

Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.

φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T

with 1 6 i < j 6 n.

• If n = 1, φ(x) = (1, x1 2, x21 )T
√ √ √ √ √ √
• If n = 3, φ(x) = (1, x1 2, x2 2, x3 2, x21 , x22 , x23 , x1 x2 2, x1 x3 2, x2 x3 2)T

106/147
Kernel functions for SVM

Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.

φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T

with 1 6 i < j 6 n.

• If n = 1, φ(x) = (1, x1 2, x21 )T
√ √ √ √ √ √
• If n = 3, φ(x) = (1, x1 2, x2 2, x3 2, x21 , x22 , x23 , x1 x2 2, x1 x3 2, x2 x3 2)T

106/147
Kernel functions for SVM

Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.

φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T

with 1 6 i < j 6 n.

• If n = 1, φ(x) = (1, x1 2, x21 )T
√ √ √ √ √ √
• If n = 3, φ(x) = (1, x1 2, x2 2, x3 2, x21 , x22 , x23 , x1 x2 2, x1 x3 2, x2 x3 2)T

Unfortunately, φ(x) has 1 + 2n + n2 = O(n2 ) elements, which makes hφ(x(i) ), φ(x(j) )i.

106/147
Kernel functions for SVM

Let n be the dimension of the input x. Then, the following is an example of φ, called a
polynomial kernel of degree 2.

φ(x) = φ(x1 , . . . , xn )
√ √ √ √ √ √
= (1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xi xj 2, . . . , xn−1 xn 2)T

with 1 6 i < j 6 n.

• If n = 1, φ(x) = (1, x1 2, x21 )T
√ √ √ √ √ √
• If n = 3, φ(x) = (1, x1 2, x2 2, x3 2, x21 , x22 , x23 , x1 x2 2, x1 x3 2, x2 x3 2)T

Unfortunately, φ(x) has 1 + 2n + n2 = O(n2 ) elements, which makes hφ(x(i) ), φ(x(j) )i. Is

there a way to lessen the cost?

106/147
Kernel trick

Consider a polynomial kernel function φ of degree 2 applied to input of of n dimension.


Note that
√ √ √ √ √ T
hφ(x), φ(z)i = 1, x1 2, . . . , xn 2, x21 , . . . , x2n , x1 x2 2, x1 x3 2, . . . , xn−1 xn 2
√ √ √ √ √ T
· 1, z1 2, · · · , zn 2, z12 , · · · , zn2 , z1 z2 2, z1 z3 2, · · · , zn−1 zn 2
n
X n
X X
=1+2 xi zi + x2i zi2 + 2 xi xj zi zj
i=1 i=1 16i<j6n

= (1 + xT z)2 = (1 + hx, zi)2

• So, the inner product in the higher dimensional space need not be actually computed
in the higher dimensional space!
• Simply compute the inner product in the original input space, and perform the
transformation afterwards.
• This is called the kernel trick.

107/147
Examples of kernels

• Linear kernel: k(x, y) = xT y or k(x, y) = 1 + xT y


• Polynomial kernel of degree k: k(x, y) = (c + xT y)k with c > 0.
• Sigmoid kernel with parameter κ and δ: k(x, y) = tanh(κxT y − δ)
• Gaussian or radial basis function (RBF) kernel with parameter σ:
k(x, y) = exp(−kx − yk2 /2σ 2 ) where kx − yk2 is squared of Euclidean distance
between x and y. The resulting feature space is infinite-dimensional.
• and many others...

Which one to choose? experiments!

The prediction/testing for x uses:


X X
y(x) = b + λk t(k) hφ(x(k) ), φ(x)i = b + λk t(k) k(x(k) , x)
k∈S k∈S

108/147
Example

From PRML Fig. 7.2. The support vectors are the circled points.

109/147
Multiclass SVMs

SVM is binary classifier. So need different approaches for K > 2 classes:

• (Vapnik, 1998) One-versus-the-rest approach. Construct K separate SVMs:


• The kth SVM model yk (x) is trained using data from class Ck as positive examples and
the rest as negative examples.
• Predictions of new datapoint x uses y(x) = maxk yk (x) to overcome inconsistent results
of an input assigned to multiple classes simultaneously
• Problem 1: different classifiers trained on different tasks; no guarantee real valued yk (x)
has appropriate scales for different classifiers.
• Problem 2: imbalanced training sets (lost symmetry)
• Proposed variant (Lee et al. 2001): modify target values to +1 for positive class and
−1/(K − 1) for the negative.

110/147
Multiclass SVMs (cont.)

Ambiguities in multiclass classification (PRML, Fig. 4.2)

111/147
Multiclass SVMs (cont.)

• (Weston and Watkins, 1999): Define a single objective function for training K SVMs
simultaenously based on maximizing margin from each to remaining classes.
• Slower training: instead of O(KN 2 ) cost of solving K optimization problems over N data
points, we solve single optimization problem of size (K − 1)/N with cost O(K 2 N 2 ).
• One-versus-one approach. Train K(K − 1)/2 binary SVMs on all possible pairs of
classes and classify datapoints according to which class has the highest number of
“votes”.
• Problem 1: ambiguities in resulting classification.
• For large K, more training time than one-versus-the-rest approach.
• More computation required for evaluating test points; can be somewhat alleviated by
arranging pairwise classifiers into a directed acyclic graph (DAGSVM) with K(K − 1)/2
classifiers and only K − 1 pairwise classifiers evaluated when testing points.

112/147
Nearest Neighbor Methods
Motivation

• Methods employing probability distributions (e.g., mixture of Gaussians) are called


parametric approach to density modeling.
• Forms of distributions governed by parameters whose values determined from a dataset.
• Chosen parameter values (and thus density) may be a poor model of the actual data
distribution, hence poor predictive performance.
• Alternative: nonparametric approaches.
• Make a few assumptions about the form of distribution (instead of setting parameter
values).
• Examples: histogram, kernel estimators, nearest neighbors.

113/147
Histogram approach

• Let x be a continous variable.


• Histogram partitions x into distinct bins of width ∆i and count the number of
observations ni of x falling in bin i.
• Normalizing the probability density, divide by total number N of observations and by
the width ∆i , i.e.,
ni
pi =
N ∆i
R
Note: p(x)dx = 1, so capturing density p(x) that is constant over the width of each
bin, and often that width are chosen to be the same ∆i = ∆ for all bins.

114/147
Histogram approach: Lesson learned

• Small ∆, density model is very spiky with a lot of


structure that is actually not present in the underlying
distribution that generated the data set.
• Large ∆, density model is to smooth and thus fails to
capture bimodal property of the green curve.
• Best results: some intermediate value of ∆
• Histogram also depends on the choice of edge location
of the bins, though the influence is typically much less
significant than ∆.
• Once histogram is computed, data can be discarded.
Three histogram density estimates with
different width ∆. Green curve is actual • Histogram is easy to apply if data arrives sequentially.
distribution from which data was drawn. • Estimated density has discontinuities due to bin edges
(not from the data).
• Not scalable: M D bins for D-dimensional data (curse
of dimensionality)

115/147
Histogram approach: Lesson learned (cont.)

• Locality: Estimating probability density at particular location needs to consider data


points lying within some local neighborhood the point.
• Locality requires a distance measure, e.g., Euclidean, etc.
• Smoothing parameter (i.e., bin width) cannot be too large nor too small.

116/147
General form of density estimate

• Observations drawn from some unknown probability density p(x) in some


D-dimensional space.
• Aim: estimate the value of p(x).
• Consider some small region R containing x. Probability mass function for this region:
Z
P = p(x)dx
R

• Suppose we are sampling N observations of p(x) where P is the probability of a


sampled point lies in R
• The total number of K points that lie inside R follows binomial distribution:

N!
Bin(K|N, P ) = P K (1 − P )1−K
K!(N − K)!

where E[K] = N P and var[K] = N P (1 − P ).

117/147
General form of density estimate (cont.)

• For large N , the binomial distribution peaks sharply a round the mean, i.e., K ' N P
• But if R is assumed to be very small that p(x) is roughly constant in it, then
P ' p(x)V where V is the (spherical) volume of R
• So, for sufficiently large N , density estimate is :

K
p(x) =
NV
• Hence, estimating density can be approached in two ways:
• Fixing K and determine V from data K-nearest neighbor
• Fixing V and determine K from data kernel estimator

118/147
Nearest neighbor methods

• We estimate p(x) by considering a small sphere V centered on point x and letting the
radius of the sphere to grow until it covers K data points.
• K governs degree of smoothing (see figure): there is an optimum K.
• Model produced by K-nearest neighbor is not a true density model (integral diverges).

119/147
K-nearest neighbor for classification

P
• Let: Nk points of class Ck with N = k Nk .
• To classify a point x, draw a sphere centered on x containing precisely K points
(irrespective of their class).
• The sphere form depends on the distance measure used, e.g., Euclidean distance,
Mahalanobis, etc.
• Suppose the sphere has volume V and contains Kk points from class Ck . Then,
density estimate of each class:

Kk
p(x|Ck ) =
Nk V
• Unconditional density:
K
p(x) =
NV
• Class priors:
Nk
p(Ck ) =
N
120/147
K-nearest neighbor for classification (cont.)

• Using Bayes, posterior probability of class membership is:

p(x|Ck )p(Ck ) Kk
p(Ck |x) = =
p(x) K

• To minimize probability of misclassification,


K-nearest neighbor rule
Assign x to the class having largest posterior probability, i.e., the one with the largest
Kk .
• Ties can be broken at random.

121/147
Distance measure

Distance between two points x = (x1 , . . . , xn ) and y = (y1 , . . . , yn )

• Euclidean distance, straight-line distance, L2 distance:


pPn
d(x, y) = ky − xk2 = 2
i=1 (yi − xi )
• Manhattan distance, city block distance, rectilinear distance, L1 distance:
Pn
d(x, y) = ky − xk1 = i=1 |yi − xi |
n
! p1
X
• Minkowski distance, Lk distance: d(x, y) = Lk (x, y) = |yi − xi |p
i=1
• Mahalanobis distance,
if x and y are from the same distribution with covariance matrix S:
p
d(x, y) = (x − y)T S − (x − y)

122/147
Decision boundary

(a) New point (black diamond) classified according to majority class membership of 3
nearest neighbor.
(b) Decision boundary of 1-nearest neighbor is composed of hyperplanes forming
perpendicular bisector of pairs of points from different classes

123/147
Example plot

• Small K: many small regions of each class.


• Large K: few large regions of each class.
• For K = 1, with N → ∞, error rate 6 2× minimum achievable error rate of optimal classifier (one that uses
true distributions).
• Expensive: requires computing distance to all training set; if computing a single distance needs O(D) time,
then computing distance of a point to all points in training set is O(N D). If we don’t store the distances for
each k = 1, . . . , K, we obtain O(N DK) complexity. If we precompute the distances, O(N D + N K)
complexity. Using efficient data structures (such as KD-tree) or median computation algorithms can reduce
the complexity further.

124/147
Dimensionality Reduction:
Principal Component Analysis
Dimensionality reduction: Why?

• When plotting and inspecting data, can never go beyond 3D, and 2D is easier to
interpret.
• Curse of dimensionality: more dimension → more training data.
• Dimensionality is explicit factor in computational cost.
• Can remove noise.
• Significantly improve performance of learning.
• Make dataset easier to work with.
• Make results easier to understand.

125/147
Dimensionality reduction: How?

• Feature selection: looking through available features and seeing if they are actually
useful (i.e., correlated to output variables).
• Even when using neural network (which was to avoid “getting your hands dirty”), results
can be much improved with feature selection.
• Feature derivation: deriving new features from the old ones through applying some
transform(s) to the dataset that simply change the axes (moving/rotating).
• Allows combining features and identifying useful ones.
• Clustering: group similar datapoints and see if this allows fewer features to be used.

126/147
Principal Component Analysis (PCA)

Axes found by PCA


Original data

• Invented by Karl Pearson (1901) as analogue of the principal axis theorem in


mechanics. Independently developed by Harold Hotelling (1930s).
• Aliases: Karhunen-Loeve transform (signal processing), Hotelling transform (multivariate
quality control), proper orthogonal decomposition (mechanical engineering), etc.
• Unsupervised since labels in data are not needed.
• Intuition: to fit an n-dimensional ellipsoid to the data.
• Principal components: axes of the ellipsoid, a direction in the data with the largest
variation.
• Small axis variance along it is small axis can be ignored (with only small loss).

127/147
Intuitive reading of PCA algorithm

• Intuition:
• Center the data by substracting off the mean.
• Choose direction with the largest variation
• Place an axis in that direction.
• Look at the remaining variation and find another axis that is orthogonal to the first axis and
covers as much of the remaining variation as possible.
• Iterate until no possible axes found.
• End result: all variation is along the axes, the covariance matrix is diagonal (each new
variable is uncorrelated with every variable except itself).
• Some axes found last have very little variation and can be removed without affecting
variability of data.

128/147
Formal derivation

Let X be the data matrix (with datapoints as rows).

• We wish to rotate X with a rotation matrix P T , i.e., to obtain Y with P T X = Y where P is


chosen such that cov(Y ) is diagonal (why?).
 
λ1 0 0 . . . 0
0 λ 0 ... 0 
2
cov(Y ) = cov(P T X) = 
 
. . .
. .
. .. 

. . . ... . 
0 0 0 . . . λN

• Then, cov(Y ) = E[Y Y T ] = E[(P T X)(P T X)T ] = E[(P T X)(X T P )] = P T E[XX T ]P =


P T cov(X)P , because E[P ] = P as it is independent of the data.
• Thus, P cov(Y ) = P P T cov(X)P = cov(X)P because for every rotation matrix P T = P −1
(invert rotation = rotate the same amount in the opposite direction).
• Writing P as a set of column vectors, i.e., P = [p1 , . . . , pN ], we have
P cov(Y ) = [λ1 p1 , . . . , λN pN ].
• Let Z = cov(X). Then, for each i, λi pi = Zpi . Thus, the matrix P is found so that for the
directions given by P , Z just rescales them. These directions are special and called
eigenvectors and the scale amounts are called eigenvalues.
129/147
Eigenvectors and eigenvalues

Definition
Given a M × M matrix A. An eigenvector of A is a column vector v such that Av = λv
for some scalar λ. The just mentioned equation is called the eigenvalue equation for the
matrix A and the scalar λ is the corresponding eigenvalue.

Example: since " #" # " # " #


2 3 3 12 3
= =4×
2 1 2 8 2
" # " #
3 2 3
the vector is an eigenvector of with 4 as the corresponding eigenvalue.
2 2 1

Eigenvalues and eigenvectors appear in many fields of science, e.g., classical and
quantum mechanics, geology, epidemiology, any science that uses principal component
analysis (bioinformatics, data mining, chemistry, psychology, image and signal
processing).

130/147
Eigenvectors and eigenvalues (cont.)

The eigenvalue equation of A can be written (A − λI)v = 0 where I is the identity matrix.

• The equation has nonzero solution in v if and only if the determinant |A − λI| = 0
• An eigenvalue is thus a root of the polynomial of degree n induced by the determinant
expression. The polynomial is called characteristic polynomial of A.

a b
For 2x2 matrix c d , its determinant ad − bc.
" # " #
2 3 2−λ 3
If A = , then A − λI =
2 1 2 1−λ

and the characteristic polynomial is (2 − λ)(1 − λ) − 6 = −4 − 3λ + λ2 , which can be


factored into (λ − 4)(λ + 1), giving 4 and -1 as the eigenvalues of the matrix.

131/147
Eigenvectors and eigenvalues (cont.)

To find the eigenvector, we substitute the eigenvalue back to the eigenvalue equation:
" #" # " #
2 3 x x
=4
2 1 y y

giving us two linear equations 2x + 3y = 4x and 2x + y = 4y. Solving these equations yield
x = 23 y. In fact, there is an infinite number of solutions of the form x = 3t and y = 2t for
any real number t. So, there are infinitely many eigenvectors for the eigenvalue λ = 4,
including [ 32 ].

In general, computing eigenvalues (and thus eigenvectors) must be done by approximate


numerical methods.

132/147
Properties of eigenvalues and eigenvectors

• Eigenvectors are found only on square matrices, but not all square matrices have
eigenvectors.
• An N × N matrix can have at least d and at most N linearly independent eigenvectors
where d is the number of distinct eigenvalues of the matrix. Note that d may be
smaller than N .
• Each eigenvalue λ of a matrix has at least one corresponding eigenvector.
• Eigenvectors with distinct eigenvalues are linearly independent.

133/147
Eigendecomposition

Every N × N matrix with N linearly independent eigenvectors qi , i = 1, . . . , N , can be


decomposed/factorized as follows:

A = EDE −1

where
• E is an N × N matrix whose i-th column is the eigenvector qi of A
• D a diagonal matrix whose diagonal elements Dii = λi is the eigenvalue
corresponding to qi .
The eigendecomposition above can also be written: E −1 AE = D. Also, the eigenvectors
can be chosen so that they are normalized to have length one.

Special case: for every real symmetric matrix, the eigenvalues are real numbers and the
eigenvectors can be chosen so that they are mutually orthogonal.
• The decomposition becomes A = EDE T with E orthogonal
• Note that orthogonality implies linear independence (but not vice versa).

134/147
Eigenvectors and eigenvalues for PCA

• Eigenvectors tell us the directions.


• Eigenvalues tell us the amount of rescaling needed along their corresponding
eigenvector dimensions.
• The more rescaling needed, the larger variation along that dimension.
• Dimensions with large eigenvalues have lots of variation (thus useful)
• For dimensions with small eigenvalues, datapoints are tightly bunched together (not much
variation), so can be dropped.

135/147
PCA algorithm

1. Write N datapoints (of M -dimension) xi = (x1i , x2i , . . . , xM i ) as row vectors.


2. Put these vectors into a matrix X of size N × M .
3. Center the data by substracting off the mean of each column and put the result into the matrix
PN
B. That is, define the mean vector µ = N1 i=1 xi and the N × M matrix M whose rows are
identical, namely the vector µ. Then, B = X − M .
4. Compute the M × M covariance matrix C = 1
N
B T B.
1 T
• Some literature suggests C = N −1
B B.
5. Compute eigenvalues and eigenvectors of C, i.e., the eigendecomposition V −1 CV = D.
• This can be done numerically using many off-the-shelf matrix algebra systems.
6. Sort the eigenvalues in D in a decreasing order, and apply the same order to the columns of
(eigenvectors in) V .
7. Choose L principal components corresponding to the L-largest eigenvalues to obtain an N × L
feature matrix W .
8. A datapoint x can be projected into the new bases: y = xW . So, the transformed data is
L-dimensional

136/147
Example

Original 2D-data Reconstructed after PCA using only the first component

137/147
Kernel PCA

• Problem with PCA: assume the directions of all variations are straight lines (often not
true).
• Idea: use a possibly nonlinear function Φ(·) as kernel. So steps are:
• Choose a kernel and apply it to all pairs of data points to get matrix K of distances
between pairs of points in the transformed space.
• Compute eigenvalues and eigenvectors of K
• Normalize the eigenvectors by the square root of eigenvalues.
• Retain the eigenvectors corresponding to the largest eigenvalues.

138/147
K-Means Clustering
Unsupervised learning

• Supervised learning: training set contains labeled target data


• Reinforcement and evolutionary learning: training includes scoring system indicating
the goodness of prediction.
• What if we don’t have target or score information?
• Learning must spot similarity between different inputs by itself.
• Unsupervised learning: training finds clusters of similar inputs (without being told
explicitly which class they belong to).
• Some dimensionality reduction is unsupervised, e.g., PCA

139/147
Error criterion for unsupervised learning

• Unsupervised learning cannot use external error criterion that relies on targets or
other ouside information.
• Need to find internal measure
• Measure must be task-independent (otherwise, algorithm must be changed every time
task is changed).
• Note: error criterion in supervised learning is task-specific due to the targets provided.
• Which internal measure?
• Distance between inputs: small means similar, hence should be in the same cluster.

140/147
K-means: Overview

• Value k is given: the number of clusters/categories.

Aim of K-means algorithms


Find k cluster centers and position them so that there is one cluster center in the middle
of each cluster.

Definition of “middle” depends on:

• distance measure, e.g., Euclidean, or other distances


• mean average, determining central point of a set of datapoints

141/147
K-means algorithm

Intialization: choose value of K, choose K random positions in input space, and assign
cluster centers µk , k = 1, . . . , K, to those positions.

Learning:
Repeat until cluster centers stop moving:

• For each datapoint xn , compute distances to each cluster center and assign it to the
nearest cluster center.
• For each cluster Ck , move the position of its center µk to the mean of points in that
cluster:
1 X
µk = x
Nk
x∈Ck

Testing/Usage: for each testing point, compute its distance to each cluster center and
assign it to the nearest cluster center.

142/147
K-means as optimization

• K-means aims to minimize a distortion measure J:


N X
X K
J= rnk kxn − µk k2
n=1 k=1

where rnk = 1 if xn is assigned to cluster k and 0 otherwise.


• J is sum of squared distances of each data point to its assigned vector.
• Each iteration consists of
• E-step: updating rnk while µk fixed
• M-step: updating µk while rnk fixed
• Solution for rnk is straightforward: J is linear in rnk and for different n, rnk ’s are independent.
• So, set rnk = 1 for k that gives minimum value of kxn − µk k2 , i.e,. when µk is cluster
center.
• Solution for µk : J is quadratic in µk , so we set its derivative to zero: 2 N
P
n=1 rnk (xn − µk ) = 0,
yielding P
n rnk xn
µk = P
n rnk
That is, µk is the mean of all data point assigned to cluster k.

143/147
Illustration

144/147
Computational consideration

• Poor initial values for µk lead to more steps before convergence


• In practice, better to choose cluster centers from random subset of K data points.
• K-means is often used to initialize parameters for a Gaussian mixture model prior to
applying EM algorithm.
• Direct implementation is slow since each E-step requires computing distance of every
data point to every cluster center.
• Use of tree data structure (tree of nearby points) or triangle inequality can reduce the
amount of computation.
• Distance measure may imply that non-numerical features cannot be used.
• Use of means can make clustering not robust to outliers.
• Imagine what happen to cluster containing points: {1, 2, 1, 2, 100}.
• Instead of means, one can use median.
• Other approach: K-medoids

145/147
K-medoids algorithm

• Minimizes sum of general pairwise dissimilarities, instead of sum of squared


distances.
• Dissimilarity is more general than distance: every distance can essentially be seen as
dissimilarity measure
• Dissimilarity doesn’t have to be symmetric.
• A medoid of a set of points is a point in the set whose average dissimilarity to all data
points in the set is minimal.
• In other words, medoid is the most centrally located point in the set.
• Unlike mean, medoid is always a point found in the set.

146/147
Partitioning around medoids (PAM) algorithm

Initialization: Choose K from the N data points as medoids.

Training:
Repeat until there is no change in assignment:

• Assignnment step: Associate each data point to the closest medoid


• Update step: For each medoid m and each data point o, swam m and o and compute
the average dissimilarity of o to all data points associated to m. Then, select the point
o with the lowest average dissimilarity as the new medoid.

Testing: perform assignment step on the testing point.

147/147

You might also like