0% found this document useful (0 votes)

202 views233 pages

Machine Learning Crashcourse

Machine learning involves developing algorithms that can learn from data to make predictions or decisions without being explicitly programmed. This document provides an overview of machine learning, including the key elements of representation, evaluation, and optimization. It describes the main types of machine learning problems like supervised learning, unsupervised learning, and reinforcement learning. Finally, it outlines the typical machine learning pipeline from data collection and preparation to algorithm selection, training, evaluation, and deployment.

Uploaded by

Hendrick ST., MT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

202 views233 pages

Machine Learning Crashcourse

Uploaded by

Hendrick ST., MT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 233

Machine Learning - A Crash Course

Adila Krisnadhi
[email protected]
December 12, 2020
Faculty of Computer Science
Universitas Indonesia
Acknowledgements

• Longer version of these slides were given in the Machine Learning course taught by
me, Ari Saptawijaya, and Raja Oktovin at the Faculty of Computer Science,
Universitas Indonesia.
• Parts of these slides are taken from Pedro Domingos’ slides (Introduction and
Inductive Learning) at
https://courses.cs.washington.edu/courses/cse446/15sp/
• Some figures are taken from Marsland’s “Machine Learning: An Algorithmic
Perspective” (MLAP) and Bishop’s “Pattern Recognition and Machine Learning”
(PRML).

1/147
Outline

What is Machine Learning?

Supervised Learning

Decision Trees

Linear Regression

Logistic Regression

Naive Bayes Classifier

Support Vector Machine

2/147
Outline (cont.)
Nearest Neighbor Methods

Dimensionality Reduction: Principal Component Analysis

K-Means Clustering

3/147
What is Machine Learning?
Artificial Intelligence

Image tweeted by Shawn Riley (@CyberIntelSME) Feb 11, 2019, https://twitter.com/CyberIntelSME/status/1094962204669964288

4/147
What is machine learning?

• Learning is one of the characteristics of intelligence.

• Important parts of learning:
• remembering
• adapting
• generalizing
• Machine learning methods are subsymbolic because no symbols or symbolic
manipulation is involved, as opposed to symbolic methods in AI such as reasoning
and logical deduction.
• Machine learning is about automating automation: making computers modify or adapt
their actions (e.g., making predictions, controlling a robot, etc.) by capturing patterns
in the data so that these actions are as “good” as possible.
• goodness: how well the chosen actions reflect the most appropriate ones.

5/147
Traditional programming vs. machine learning

Traditional programming

Machine learning

6/147
Key elements of machine learning

7/147
Key elements of ML: Representation

Representation refers to the type of model under consideration. Could be understood as

the set/space/landscape of all possible models with respect to a given formal language.

• Different types of model mean different representation.

• 3-layer feedforward neural networks form one type, while support vector machines
form another.
• Examples: decision trees, rules/logic programs, instances, graphical models
(Bayes/Markov nets), neural networks, SVM, model ensembles, etc.

8/147
Key elements of ML: Evaluation

Evaluation concerns judgement or preference of one model over the others.

• Expressed in terms of some metrics such as utility function, loss function, scoring
function, or fitness function.
• Examples: accuracy, precision and recall, squared error, likelihood, posterior
probability, cost/utility, margin, entropy, etc.
• For each given model, evaluation gives the height of the landscape.
• Which height is preferred depends on the evaluation functions, e.g., using squared
error means going to lower areas, while likelihood may mean going to higher areas.

9/147
Key elements of ML: Optimization

Optimization refers to the strategy to satisfy the preference.

• “how do you get to the most preferable area”, i.e., search the space of represented
models to obtain better evaluations
• Examples:
• Combinatorial optimization, e.g., greedy search, genetic algorithm
• Convex optimization, e.g., gradient descent
• Constrained optimization, e.g., linear programming, quadratic programming
• Study theory of optimization for deeper discussion.

10/147
Types of Machine Learning

Question: how does a machine know that it is getting better at performing its task?
Different answers lead to different categories below:

• Supervised (inductive) learning.

• Given training examples, correct responses (targets) are given to the algorithm.
• The algorithm generalizes to respond correctly to all possible inputs, i.e., learning from
examples.
• Unsupervised learning.
• Given training examples, no correct response is given to the algorithm.
• The algorithm identifies similarities between inputs so that inputs that have something in
common are categorized/clustered together.
• Reinforcement learning.
• Given training examples, the algorithm gets told when the answer is wrong, but not how to
correct it, i.e., learning with a critic.
• The algorithm explores different possibilities until it knows how to give a correct answer.

11/147
Learning problems

Supervised learning problems:

• Classification aims to assign each input to one of a finite number of categories (target values
are discrete).
• Regression aims to assign each input to a value in a continuous set of possible target values
• Probability estimation is a special case of regression where target values range between 0
and 1 and represent probability values.

Unsupervised learning problems:

• Clustering aims to discover groups of similar examples within the input space.
• Density estimation aims to determine the distribution of data within the input space.
• Projection/dimensionality reduction aims to obtain a representation of data in a dimension
different (typically lower) than its original dimension.

Reinforcement learning problem:

• Credit assignment aims to determine a way to reward (or punish) every action the algorithm
provides so that at the end of the action sequence, it arrives at the best/correct answer.

12/147
Machine learning applications

• Web search: present which web pages you’ll likely click on, based on your past browsing
activities.
• Computational biology: predict a drug design (e.g., how likely the design will give a certain
pharmacological effect) based on various properties of the constituting molecules.
• Commerce/Finance: predict if an individual’s credit risk is high (so not to approve his credit card
application) based on information of his profile.
• Space exploration/astronomy: predict, from a picture of the surface of planet Venus, whether it
has a volcano.
• Robotics: self-driving car – predict whether it should accelerate or brake based on the data
about its surroundings from its sensors.
• Social networks: predict if you’ll likely send a friend request to an individual based on your
social network profile and the activities you do in the social network application.
• Debugging: based on the traces of the program writing and how people debug in the past,
predict the most likely locations of the bugs.
• Your example?

13/147
Machine learning pipeline

1. Data collection.
• Need to understand domain, prior knowledge, and goals
• Data may not exist.
• Large amount of relevant data exist, but are hard to collect.
• Supervised learning also requires target data.
• More data is better, preferably without too much noise. But, large dataset means
increased computational cost.
2. Data preparation
• May include data integration, selection, cleaning, preprocessing, etc.
• Preprocessing may also involve feature selection, extraction, or engineering.
• Not all features are useful for the machine learning task at hand. Some may be redundant
or irrelevant.
• Collecting features should not be costly. Features should be robust to noise.
• Some machine learning representation may not even need feature engineering.
3. Algorithm choice.

14/147
Machine learning pipeline (cont.)

4. Hyperparameter and model selection.

• In many algorithms, parameters have to be set manually or requires experiments to
determine their appropriate values.
5. Training. Learn the models
6. Evaluation.
• The trained model should be evaluated on data it was not trained on.
• May involve comparison with human experts and the selection of appropriate metrics for
such comparison.
7. Model deployment on the actual system.
8. Loop if necessary.

15/147
Supervised Learning
Supervised learning

• Input x is a vector of values called features and the

length of x is called the dimension
• Output y is a target variable/vector
• x comes from the space X of input values, while y
comes from the space Y of output values.
• We are given a training set consisting of finitely
many pairs input-output variables (x(i) , y (i) ),
i = 1, . . . , m, of training examples such that
y (i) = f (x(i) ) for some unknown target function f .
• Aim is to find a function h : X → Y such that h is a
good approximation of f . The function h is called a
hypothesis

16/147
Example Applications

• Credit risk assessment

x: properties of customer and proposed purchase
f (x): approve purchase or not
• Disease diagnosis
x: properties of patient (symptoms, lab tests)
f (x): disease (or maybe, a recommended therapy)
• Face recognition
x: bitmap picture of person’s face
f (x): name of the person
• Automatic steering
x: bitmap picture of road surface in front of a car
f (x): degrees to turn the steering wheel

17/147
When is supervised learning appropriate?

1. ... when there is no human expert

• i.e., when nobody knows how to do it
• machine has a lot of power that human brains don’t possess
• Example: Given bond graph for a new molecule, predict the binding strength of AIDS
protease molecule; strong binding can prevent AIDS protease to cleave hence preventing
maturation of HIV virion into their infectious form.
2. ... when humans can perform the task but are unable to describe how to do it
• i.e., we know how to do it, but don’t know how to program it.
• Example: Given a bitmap picture of hand-written character, predict the ASCII code of the
character.
• Example: learning how to ride a bike.
• Ideal for supervised learning: we can generate the supervision by having people do it,
e.g., in car driving task, simply ask someone to drive and record both the road during
driving (video), and the degree the steering turns every time.

18/147
When is supervised learning appropriate? (cont.)

3. ... when the desired function is changing frequently

• We know how to do it and can program it, but it is too expensive to program it because the
program has to change very frequently.
• Example: given information about stock prices and trades in the last 10 days, predict a
good stock transactions to be done.
4. ... when every user requires a customized function
• The target function doesn’t change (i.e., stable) for a user, but different users need
different functions.
• Example: given an incoming mail message, predict how important the mail for presenting
to user (or deleting it without presenting).

19/147
Example

Suppose you are asked to predict an unknown Boolean function f (output y is 0 or 1)

from inputs consisting of 4 Boolean features (x1 , x2 , x3 , x4 ). That is, y = f (x1 , x2 , x3 , x4 ).
• The Boolean function f is called concept and our task here is concept learning. Input
x for which f (x) = 1 is called a positive instance, and if f (x) = 0, x is called a
negative instance.
• Concept is a special case of classifier. A classifier is a discrete-valued function and
the possible target values f (x) ∈ {1, . . . , K} are the classes or class labels.
• Given examples below, how would you induce an approximation of f so that when
new examples of (x1 , x2 , x3 , x4 ) come along in the future the approximation can also
give you the correct result according to the new examples?
Example x1 x2 x3 x4 y
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 1 1 1
4 1 0 0 1 1
5 0 1 1 0 0
6 1 1 0 0 0 20/147
7 0 1 0 1 0
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in

0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0
0 0 1 1 1
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 ?
1 0 0 0 ?
1 0 0 1 1
1 0 1 0 ?
1 0 1 1 ?
1 1 0 0 0
1 1 0 1 ?
1 1 1 0 ?
1 1 1 1 ?
21/147
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in

0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0 216 = 65536 (how?)
0 0 1 1 1
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 ?
1 0 0 0 ?
1 0 0 1 1
1 0 1 0 ?
1 0 1 1 ?
1 1 0 0 0
1 1 0 1 ?
1 1 1 0 ?
1 1 1 1 ?
21/147
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in

0 0 0 0 ? principle, be resulted from a learning algorithm.
0 0 0 1 ? • How many possible Boolean functions are possible in total?
0 0 1 0 0 216 = 65536 (how?)
0 0 1 1 1 • Version space: the space of all hypotheses that have not yet
0 1 0 0 0 been ruled out by a training example.
0 1 0 1 0 • If 7 examples are known, how many possible Boolean
0 1 1 0 0 functions are possible? Still 29 possibilities.
0 1 1 1 ? • In machine learning, we’re dealing with a double-exponential
1 0 0 0 ? problem: if you have n features, you need to search for a correct
n
1 0 0 1 1 function from 22 possibilities.
1 0 1 0 ? • If we’re in the state of complete ignorance: we don’t know which
1 0 1 1 ? function is correct until we know every possble input-output pair,
1 1 0 0 0 can you do much better than random guessing?
1 1 0 1 ?
1 1 1 0 ?
1 1 1 1 ?
21/147
Hypothesis spaces

x1 x2 x3 x4 y • Hypothesis space: the space of all hypotheses that can, in

So, maybe introduce some bias by choosing

a language for representation,
e.g., (conjunction over “true” features) ⇔
output.
• How many hypotheses, i.e., conjunctive
rules are possible?

22/147
Hypothesis spaces

So, maybe introduce some bias by choosing

a language for representation,
e.g., (conjunction over “true” features) ⇔
output.
• How many hypotheses, i.e., conjunctive
rules are possible? 16.

22/147
Hypothesis spaces

So, maybe introduce some bias by choosing Rule Counterexample no.

a language for representation, ⇔y 3, 4
e.g., (conjunction over “true” features) ⇔ x1 ⇔ y 3, 6
output. x2 ⇔ y 2, 3, 4, 5, 6, 7
x3 ⇔ y 1, 4, 5
• How many hypotheses, i.e., conjunctive
x4 ⇔ y 7
rules are possible? 16.
x1 ∧ x2 ⇔ y 3, 4, 6
Example x1 x2 x3 x4 y x1 ∧ x3 ⇔ y 3, 4
1 0 0 1 0 0 x1 ∧ x4 ⇔ y 3
2 0 1 0 0 0 x2 ∧ x3 ⇔ y 3, 4, 5
3 0 0 1 1 1 x2 ∧ x4 ⇔ y 3, 4, 7
x3 ∧ x4 ⇔ y 4
4 1 0 0 1 1
x1 ∧ x2 ∧ x3 ⇔ y 3, 4
5 0 1 1 0 0
x1 ∧ x2 ∧ x4 ⇔ y 3, 4
6 1 1 0 0 0
x1 ∧ x3 ∧ x4 ⇔ y 3, 4
7 0 1 0 1 0 x2 ∧ x3 ∧ x4 ⇔ y 3, 4
x1 ∧ x2 ∧ x3 ∧ x4 ⇔ y 3, 4

22/147
Hypothesis spaces

So, maybe introduce some bias by choosing Rule Counterexample no.

a language for representation, ⇔y 3, 4
e.g., (conjunction over “true” features) ⇔ x1 ⇔ y 3, 6
output. x2 ⇔ y 2, 3, 4, 5, 6, 7
x3 ⇔ y 1, 4, 5
• How many hypotheses, i.e., conjunctive
x4 ⇔ y 7
rules are possible? 16.
x1 ∧ x2 ⇔ y 3, 4, 6
No hypothesis fits the whole of given data! x1 ∧ x3 ⇔ y 3, 4
x1 ∧ x4 ⇔ y 3
x2 ∧ x3 ⇔ y 3, 4, 5
x2 ∧ x4 ⇔ y 3, 4, 7
x3 ∧ x4 ⇔ y 4
x1 ∧ x2 ∧ x3 ⇔ y 3, 4
x1 ∧ x2 ∧ x4 ⇔ y 3, 4
x1 ∧ x3 ∧ x4 ⇔ y 3, 4
x2 ∧ x3 ∧ x4 ⇔ y 3, 4
x1 ∧ x2 ∧ x3 ∧ x4 ⇔ y 3, 4

22/147
Hypothesis spaces

Try m-of-n rules, e.g., for features

{x1 , x2 , x3 }:
• 1-of: if at least one of x1 , x2 , x3 is
true then y is true, else y is false.
• 2-of: if at least two of x1 , x2 , x3 are
true then y is true, else y is false.
How many hypotheses are there?

23/147
Hypothesis spaces

Try m-of-n rules, e.g., for features

23/147
Hypothesis spaces

Try m-of-n rules, e.g., for features Counterexample no.

{x1 , x2 , x3 }: Variables 1-of 2-of 3-of 4-of
• 1-of: if at least one of x1 , x2 , x3 is {x1 } 3,6 - - -
{x2 } 2,3,4,5,6,7 - - -
true then y is true, else y is false.
{x3 } 1,4,5 - - -
• 2-of: if at least two of x1 , x2 , x3 are {x4 } 7 - - -
true then y is true, else y is false. {x1 , x2 } 2,3,5,6,7 3,4,6 - -
How many hypotheses are there? 32 {x1 , x3 } 1,5,6 3,4 - -
{x1 , x4 } 6,7 3 - -
Example x1 x2 x3 x4 y
{x2 , x3 } 1,2,4,5,6,7 3,4,5 - -
1 0 0 1 0 0
{x2 , x4 } 2,5,6,7 3,4,7 - -
2 0 1 0 0 0
{x3 , x4 } 1,5,7 4 - -
3 0 0 1 1 1
{x1 , x2 , x3 } 1,2,5,6,7 3,4,5,6 3,4 -
4 1 0 0 1 1
{x1 , x2 , x4 } 2,5,6,7 3,6,7 3,4 -
5 0 1 1 0 0
{x1 , x3 , x4 } 1,5,6,7 *** 3,4 -
6 1 1 0 0 0
{x2 , x3 , x4 } 1,2,5,6,7 4,5,7 3,4 -
7 0 1 0 1 0
{x1 , x2 , x3 , x4 } 1,2,5,6,7 5,6,7 3,4 3,4

23/147
Hypothesis spaces

Try m-of-n rules, e.g., for features Counterexample no.

{x1 , x2 , x3 }: Variables 1-of 2-of 3-of 4-of
• 1-of: if at least one of x1 , x2 , x3 is {x1 } 3,6 - - -
{x2 } 2,3,4,5,6,7 - - -
true then y is true, else y is false.
{x3 } 1,4,5 - - -
• 2-of: if at least two of x1 , x2 , x3 are {x4 } 7 - - -
true then y is true, else y is false. {x1 , x2 } 2,3,5,6,7 3,4,6 - -
How many hypotheses are there? 32 {x1 , x3 } 1,5,6 3,4 - -
{x1 , x4 } 6,7 3 - -
• Aha! We found one hypothesis that {x2 , x3 } 1,2,4,5,6,7 3,4,5 - -
fits all the given data. {x2 , x4 } 2,5,6,7 3,4,7 - -
{x3 , x4 } 1,5,7 4 - -
{x1 , x2 , x3 } 1,2,5,6,7 3,4,5,6 3,4 -
{x1 , x2 , x4 } 2,5,6,7 3,6,7 3,4 -
{x1 , x3 , x4 } 1,5,6,7 *** 3,4 -
{x2 , x3 , x4 } 1,2,5,6,7 4,5,7 3,4 -
{x1 , x2 , x3 , x4 } 1,2,5,6,7 5,6,7 3,4 3,4

23/147
Hypothesis spaces

Try m-of-n rules, e.g., for features Counterexample no.

{x1 , x2 , x3 }: Variables 1-of 2-of 3-of 4-of
• 1-of: if at least one of x1 , x2 , x3 is {x1 } 3,6 - - -
{x2 } 2,3,4,5,6,7 - - -
true then y is true, else y is false.
{x3 } 1,4,5 - - -
• 2-of: if at least two of x1 , x2 , x3 are {x4 } 7 - - -
true then y is true, else y is false. {x1 , x2 } 2,3,5,6,7 3,4,6 - -
How many hypotheses are there? 32 {x1 , x3 } 1,5,6 3,4 - -
{x1 , x4 } 6,7 3 - -
• Aha! We found one hypothesis that {x2 , x3 } 1,2,4,5,6,7 3,4,5 - -
fits all the given data. {x2 , x4 } 2,5,6,7 3,4,7 - -
{x3 , x4 } 1,5,7 4 - -
• But, does that hypothesis fit the
{x1 , x2 , x3 } 1,2,5,6,7 3,4,5,6 3,4 -
unseen examples (not in the given
{x1 , x2 , x4 } 2,5,6,7 3,6,7 3,4 -
data)? {x1 , x3 , x4 } 1,5,6,7 *** 3,4 -
{x2 , x3 , x4 } 1,2,5,6,7 4,5,7 3,4 -
{x1 , x2 , x3 , x4 } 1,2,5,6,7 5,6,7 3,4 3,4

23/147
Two observations about learning

• Learning = removal of the remaining uncertainty.

• If we “already” knew that the unknown function is some m-of-n function, then the training
examples can be used to infer which function it is.
• But, our prior knowledge could be wrong: what are the odds that some m-of-n function is
actually the unknown function?
• Learning requires guessing a good, small hypothesis class.
• Our guess could be wrong: good guess often depends on domain knowledge or intuition.
• Smaller hypothesis class: easier to learn, but less likely to contain the correct hypothesis.
• Start with a small hypothesis class, then enlarge it until it contains the correct hypothesis
(e.g., this is how decision tree learning works).

Note that the following hypotheses are also consistent with the training data:

• x4 ∧ 1-of{x1 , x3 } ⇒ y
• x4 ∧ ¬x2 ⇒ y

If either of these is the unknown function, while our guess is the m-of-n one, what will
happen when new x values are given?
24/147
So, what are the strategies?

• Develop languages for expressing prior knowledge

• Guiding criteria: which representation makes it easiest to encode prior knowledge?
• E.g., should we use rules or stochastic models?
• Develop flexible hypothesis spaces.
• E.g., decision trees learning

In either case, machine learning is about developing algorithms for finding hypothesis
that fits the data.

25/147
Key issues

• What are good hypothesis spaces? Which ones have been useful in practice and
why?
• What algorithms can work in these spaces? Are there general design principles?
• How do we optimize accuracy on future/unseen data points? Also called the
overfitting problem. This distinguishes machine learning from the usual theory of
optimization.
• How can we have confidence in the results? The statistical question: how much
training data is needed to find accurate hypotheses?
• Are some learning problems computationally intractable? The computational
question: does my learning algorithm scale?
• How do we formulate application problems as machine learning problems?
(The engineering question)

26/147
Decision Trees
Example

• if there is party, go to party

• if there is no party and deadline
is urgent, then study
• if there is no party and no
deadline, then go to mall
• if there is no party and deadline
is near and you’re lazy, then
watch TV
• if there is no party and deadline
is near and you’re not lazy,
then study

Easy to understand (i.e., transparent), since it can be turned into a set of if-then rules

27/147
How to create decision tree: Basic principle

Basic principle (followed by most decision tree algorithms)

Build tree in a greedy manner, starting at the root, choosing the most informative feature
at each step.

In decision tree literature, features are often called (input) attributes. Example algorithms:

• Quinlan’s ID3
• C4.5
• Classification and regression trees (CART)

28/147
Quantifying information content: Entropy

• From Claude Shannon’s 1948 paper: “A Mathematical Theory of Communication”

• Information entropy describes the amount of impurity in a set of attributes.
• Given a random variable x, the amount of information received when a value of this
variable is observed can be seen as the “degree of surprise” on learning it.
• Suppose a sender wishes to transmit the value of random variable x to a receiver.
The entropy is the average amount of information transmitted, which is the
expectation w.r.t. p(x) (where we take p(x) log2 p(x) = 0 whenever p(x) = 0):
X
Entropy[x] = − p(x) log2 p(x)
x

• Let S be a set of labeled examples whose label are coming from the set
L = {a1 , a2 , . . . , aC }, then the entropy of S is
X
Entropy(S) = − P (x) log2 P (x)
x∈L

where P (x) is the proportion of S labeled as x.

29/147
Example

Given the following weather data S to decide whether to play outdoor.

outlook temperature humidity windy play
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny cool normal false yes
rainy mild high false yes
sunny mild high false no

9 9 5 5
Entropy(S) = − log2 − log2
rainy mild normal false yes 14 14 14 14
sunny mild normal true yes = 0.94029
overcast mild high true yes
rainy mild high true no
sunny hot high false no
sunny hot high true no
overcast hot high false yes
overcast hot normal false yes

30/147
From entropy to information gain

• Information gain: how much the entropy of the training set decreases if a particular
attribute is chosen for the next classification step.
• Let S be a set of examples, F be an attribute of S, v(F ) the set of possible values of
F , and Sf the subset of S containing all examples in S for which the value of F is f .
Also, |S| denotes the size of S and Entropy(S) the entropy of the set S.
• Average entropy of a split based on F w.r.t. a set S of examples:
X |Sf |
I(S, F ) = Entropy(Sf )
|S|
f ∈v(F )

• Information gain G(S, F ) of S when attribute F is chosen is defined by

G(S, F ) = Entropy(S) − I(S, F )

31/147
The ID3 algorithm

Let I be a set of input attributes, C the set of all labels, and S the set of labeled examples.

Function ID3(I, C, S):

• If S is empty, return a single node with value “Failure”

• If all examples s ∈ S have the same label o, then return a single node with value o
• If I is empty, then return a single node with value the most frequent label in S.
• Compute information gain G(S, F ) for each attribute F ∈ I w.r.t. S
• Let F̂ be attribute with the largest gain
• Let f1 , . . . , fm be possible values for attribute F̂
• Let Sf with f ∈ {f1 , . . . , fm } be subsets of S when S is partitioned according to the
value of F̂ .
• Return a tree with root node labeled F̂ and edges labeled f1 , . . . , fm where the edges
go to m trees obtained from ID3(I − {F̂ }, C, Sf1 ), ID3(I − {F̂ }, C, Sf2 ), . . . ,
ID3(I − {F̂ }, C, Sfm ),

32/147
Example

Construct a decision tree using ID3 algorithm on the following weather data to decide whether to
play outdoor. Should we play outdoor if the outlook is overcast, the temperature is cool, the humidity
is high, and it is windy out there?
outlook temperature humidity windy play

rainy cool normal false yes

rainy cool normal true no
overcast cool normal true yes
sunny cool normal false yes
rainy mild high false yes
sunny mild high false no
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
rainy mild high true no
sunny hot high false no
sunny hot high true no
overcast hot high false yes
overcast hot normal false yes

33/147
Issues in Decision Tree Learning

• Overfitting: pre-pruning or post-pruning of tree

• Random forest and XGBoost.
• Highly-branching attributes
• Numeric (continuous-valued) attributes
• Unknown attribute values

34/147
Linear Regression
Motivation: House pricing

Area (m2 ) Price (1000 USD) Area (m2 ) Price (1000 USD) Area (m2 ) Price (1000 USD)
158.9 208.5 79.3 132 122.4 40
117.2 181.5 93.3 149 114.1 149.35
165.9 223.5 120.4 90 114.6 179.9
159.5 140 103.5 159 157.9 165.5
204.2 250 124.4 139 145.0 277.5
126.5 143 220.7 325.3 227.8 309
157.4 307 102.9 139.4 101.9 145
194.2 200 166.8 230 120.5 153
164.8 129.9 98.5 129.9 98.2 109
100.1 118 98.5 154 107.0 82
96.6 129.5 148.6 256.3 123.0 160
215.9 345 83.6 134.8 123.4 170
84.7 144 158.3 306 82.1 144
138.8 279.5 148.6 207.5 87.1 130.25
116.4 157 48.3 68.5 106.8 141

• Task: based on the above data, predict the house price given its size/area.

35/147
Regression problem

• Is the previous task a supervised learning problem?

36/147
Regression problem

• Is the previous task a supervised learning problem?

• Yes, since we have (input, output) pairs as the given examples.

36/147
Regression problem

• Is the previous task a supervised learning problem?

• Yes, since we have (input, output) pairs as the given examples.
• Can we use decision tree to learn the prediction?

36/147
Regression problem

• Is the previous task a supervised learning problem?

• Yes, since we have (input, output) pairs as the given examples.
• Can we use decision tree to learn the prediction?
• The target value can be any (real) number, hence task is regression
• The usual decision tree may be inappropriate.

36/147
Let’s look at the data ...

37/147
Hypothesis

Which representation gives a reasonable hypothesis?

38/147
Hypothesis

Which representation gives a reasonable hypothesis? Straight lines? Which one?

38/147
Two hypotheses and errors at each example

Dashed/dotted lines represent the magnitude of error at each example.

39/147
Linear regression

• For n features, the hypotheses h are linear hyperplanes of n-dimension within n + 1

space (if n = 1, it’s a straight line):

h(x1 , . . . , xn ) = w0 + w1 x1 + · · · + wn xn

where the xi ’s are the inputs, wi ’s are the parameters or weights of the hypothesis,
and w0 is called the intercept term.
• Linear regression aims to learn the parameters wi ’s in the above linear form.
T
• Using vector notation (see later), let x = (x1 , . . . , xn ) with x0 = 1, and
T
w = (w1 , . . . , wn ) . Then, the hypotheses can be written as a matrix operation:
 
n
x1
X  .  T
hw (x) =  ..  = w x
wi xi = (w1 , . . . , wn )  
i=0
xn

40/147
Gradient descent

• How do we learn w? Choose a linear hyperplane closest to all given data points
• Particular parameters w correspond to a particular linear hyperplane.
• Need a cost function J(w) of w that measures how close the linear hyperplane to all data,
i.e., J(w) is minimal when w makes the linear hyperplane closest to the data.
• We assume J is a paraboloid, convex function: with suitable axes, J always has a
global minima.
• Finding linear hyperplane closest to given data points ≡ finding minima of J.

Left, middle: paraboloid convex. Right: paraboloid, non-convex

Image source: Ag2gaeh (https://commons.wikimedia.org/wiki/File:Parabol-el-zy-hy-s.svg), https://creativecommons.org/licenses/by-sa/4.0/legalcode

41/147
Deriving gradient descent rule

42/147
Deriving gradient descent rule

• For simplicity, let w = (w0 , w1 ), i.e., J(w) = J(w0 , w1 ) is convex paraboloid in 3D whose
vertical axis is parallel to the z-axis..
• Let A = (w0 , w1 , J(w0 , w1 )) be any point in the paraboloid. Which direction should we move
from A so that we get a maximum decrease in J?
• Let O be the origin point and B = (w00 , w10 , J(w00 , w10 )) the destination of our move.
• Let P = (w0 , w1 ) and Q = (w00 , w10 ) be resp. projections of A and B on the xy-plane.
−−→
• The vector P Q is a projection of the vector of our move and moving from A to B in the
paraboloid is equivalent to moving from P to Q in the xy-plane.
• The largest decrease of J happens in the direction of −∇J.
• Thus, we choose Q in the direction of −∇J.

−
−→ − −→ − −→ ∇J
OQ = OP + P Q ⇔ (w00 , w10 )T = (w0 , w1 )T + δ −
k∇Jk
where δ is a step size (guaranteed convergence if δ is sufficiently small)
∇J
• − k∇Jk is the unit vector opposite direction of ∇J (we only care about the direction).
−
−→
• P Q is a small shift in the direction of −∇J, hence a multiple of the unit vector.

42/147
Gradient descent rule

The previous equation embodies the gradient descent algorithm, often written:

w := w − α∇J(w)

or, in the case of two variables:

∂
w0 := w0 − α J(w0 , w1 )
∂w0
∂
w1 := w1 − α J(w0 , w1 )
∂w1

Here, α is called the learning rate and like δ, it is enough if we choose α to be sufficiently
small.

43/147
Gradient descent for linear regression

• The gradient descent for linear regression uses the sum of squares as the cost
function over m training examples (x(i) , y (i) ), i = 1, . . . , m:
m
1X
J(w) = (hw (x(i) ) − y (i) )2
2 i=1

• Why quadratic form?

44/147
Gradient descent for linear regression

• The gradient descent for linear regression uses the sum of squares as the cost
function over m training examples (x(i) , y (i) ), i = 1, . . . , m:
m
1X
J(w) = (hw (x(i) ) − y (i) )2
2 i=1

• Why quadratic form? It’s a paraboloid convex

• Why multiply by 1/2?

44/147
Gradient descent for linear regression

• The gradient descent for linear regression uses the sum of squares as the cost
function over m training examples (x(i) , y (i) ), i = 1, . . . , m:
m
1X
J(w) = (hw (x(i) ) − y (i) )2
2 i=1

• Why quadratic form? It’s a paraboloid convex

• Why multiply by 1/2? Just for convenience in derivative operation late :-).
• The sum of squares cost function leads to the ordinary least squares regression
model.
• The aim becomes finding w such that J(w) is minimal!.

44/147
Concrete update rule for linear regression

Parameter update in gradient descent requires computing ∇J. We can derive a concrete
update rule such that the ∇J expression no longer appears.

Let m be the number of training examples x. We compute each partial derivative

component of the vector ∇J. Let wj be a component of ∇J. Then:
m X m
∂ ∂ 1X 2 ∂ 1 2
J(w) = hw (x(i) ) − y (i) = hw (x(i) ) − y (i)
∂wj ∂wj 2 i=1 i=1
∂wj 2
m
X 1 ∂
hw (x(i) ) − y (i) · hw (x(i) ) − y (i)

= 2· By chain rule
i=1
2 ∂wj
m
X ∂
hw (x(i) ) − y (i) · hw (x(i) ) − y (i)

=
i=1
∂wj
m n X m
X ∂ X (i) (i)
hw (x(i) ) − y (i) · wk xk − y (i) hw (x(i) ) − y (i) xj

= =
i=1
∂w j k=0 i=1

46/147
Batch gradient descent

Applying the previous partial derivative to the gradient descent algorithm, we obtain for
each parameter wj :
m
∂ X (i)
wj := wj − α J(w) ⇒ wj := wj + α y (i) − hw (x(i) ) xj
∂wj i=1

The previous partial derivative allows us to derive the following batch gradient descent
algorithm:

1. Start with an initial value of w = (w0 , . . . , wn )T .

2. Repeat until convergence:
Pm (i)
• For every j: wj := wj + α i=1 y (i) − hw (x(i) ) xj

47/147
Gradient descent contour

The ellipses are the contours of the quadratic function J. Also shown are the trajectory
taken by gradient descent initialized at (48,30)

Image source: Andrew Ng, CS229 Lecture Notes for CS229 Fall 2018 course, Part I: Linear Regression, page 6.

48/147
Termination condition

Choices for termination condition (convergence):

• Fixed number of iterations: for a fixed step size and under certain conditions, if w(k) is
w at the k-th iteration and w∗ is the optimal parameters, then J(w(k) ) − J(w∗ ) 6
after O(1/) iterations.
• kw0 − wk (change in parameter values) or kJ(w0 ) − J(w)k (change in error value) is
small enough where w0 is w after a single update.
• k∇J(w)k is small enough: the closer it is to the minima, the smaller the gradient is.

49/147
Stochastic/incremental gradient descent

• Batch gradient descent must scan through the entire training set before making a
single update.
• May be expensive if the size of training set is very large.
• Alternative: stochastic gradient descent that performs update to the parameters
according to the gradient of the error/loss at each single example.

1. Start with an initial value of w = (w0 , . . . , wn )T .

2. Loop until convergence
• Loop for every training example x(i) :
(i)
• For every j: wj := wj + α y (i) − hw (x(i) ) xj

• SGD may get close to the minima faster than batch gradient descent, but may never
converge to the minimum (the parameters keep oscilating around the minima).

50/147
Closed form solution to linear regression

• Besides using an iterative algorithm, we can find a closed-form solution for the
optimal weight w.
• Most regression problem (esp. nonlinear ones) don’t have a closed-form solution.
• Let X be the design matrix containing all training examples as its rows and y the
vector containing all target values from the training set.
 T   
— x(1) — y (1)
T  (2) 
 — x(2) —  y 
 
X=
 ..  y=
 .. 

 . 

 . 
(m) T y (m)

— x —

• If J(w) is the sum of squares cost function, then it can be shown that
∇J = X T Xw − X T y, which is minimized when X T Xw = X T y
• Thus, the optimal w is w = (X T X)−1 X T y

51/147
Why not just use the closed-form solution?

• If possible, it’s better to use the closed-form solution.

• But, the closed-form solution may be impractical:
• Suppose X is very large, sparse matrix (e.g., 106 examples with 105 features), but X T X is
fairly dense.
• Note that the dimension of X T X is 105 ×105 . Hence, we have to store 1010 floating point
numbers (at 8 bytes per number, takes roughly 80 gigabytes space) may be impractical.
• Matrix operations in the closed-form solution may have numerical accuracy issues
(refer to the Numerical Analysis course).

52/147
Logistic Regression
Motivation

Let’s consider a classification problem with target y ∈ {0, 1}

53/147
Motivation

Let’s consider a classification problem with target y ∈ {0, 1}

• Linear regression algorithms can be used, but it is quite simple to construct examples
where the algorithm performs very poorly.
• Intuitively, it makes no sense to have target values larger than 1 or smaller than 0.
• Can we still use the same linear form, but ensure the output is neither below 0 nor
above 1?
• Idea: scale the output of the linear form, which is R, into the range [0, 1] using some
function (called the link function).
• Intuition: the weights/parameters transform the input into a point in an intermediate space
and the link function transform that point into a probability value that the input belongs to
the class (given by the label).

53/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

• 0 < g(z) < 1 for every z ∈ R. Why? Because ez > 0 for all z ∈ R.
• g(z) is symmetric with respect to the point (0, 0.5). Why?
e−z e−z +1−1 1
For every z ∈ R, g(−z) = e−z +1
= e−z +1
=1− e−z +1
= 1 − g(z). So, let h(z) = g(z) − 0.5.
Then, h(−z) = g(−z) − 0.5 = 1 − g(z) − 0.5 = 0.5 − g(z) = −(g(z) − 0.5) = −h(z), i.e., h(z) is
an odd function, which always has a rotational symmety with respect to (0,0). Thus,
g(z) = h(z) + 0.5 must have a rotational symmetry with respect to (0,0.5)
• g(z) is strictly increasing. Why?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

54/147
Logistic function

Consider the following continuous function called logistic function or sigmoid function
1 ez
g(z) = =
1 + e−z ez + 1
What properties are satisfied by g(z)?

54/147
Logistic function

Image source: Andrew Ng, CS229 Lecture Notes for CS229 Fall 2018 course, Part II: Classification and Logistic Regression, page 17.

55/147
Logistic regression

• Logistic function is a natural choice (Why? Read the literature on Generalized Linear
Model) for our classification setting (though other functions that smoothly increase
from 0 to 1 can also be used).
• With input x of n features (x0 , x1 , . . . , xn ), we have a hypothesis of the form:
1
hw (x) = g(wT x) =
1 + e−wT x
• The model parameter is w like in linear regression.
• Classification is then simply done by taking the threshold of 0.5, i.e., x is labeled 1 if
hw (x) > 0.5 and 0 otherwise.

56/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

57/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

57/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

• The data is given by the design matrix X (all training inputs as its roows) and the vector y (all
target values in the training set).
• For a particular value of w, the likelihood function LX,y (w) gives the probability (or probability
density) of the data X, y to occur when the parameter of the model is w.
• Our model represents the (probabilistic) relationship between the known input X and target y.
That is, the conditional distribution pw (y|X) of y given X, which is parameterized by w.

57/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

57/147
Likelihood function for logistic regression

How do we learn w? Or which value of w is best supported by the available data, i.e., the
training examples?

Our aim is thus finding the value of w that maximizes the conditional probability above,
i.e., maximizes the likelihood.

• To realize it, we first need to make some assumption about pw (y|X).

57/147
Maximizing likelihood for logistic regression

For logistic regression, we assume that the result of prediction hw (x) is the expected value
of the target variable y given the data X = x, which follows the conditional distribution
pw (y|x). Since the target value can only be 1 or 0 (binary classification), we write:

pw (y=1|x) = hw (x) pw (y=0|x) = 1 − hw (x)

or more compactly as a single (mass) function:

58/147
Maximizing likelihood for logistic regression

pw (y=1|x) = hw (x) pw (y=0|x) = 1 − hw (x)

or more compactly as a single (mass) function:

pw (y|x) = (hw (x))y (1 − hw (x))1−y

• This corresponds to our earlier intuition that w brings x to wT x and then the logistic
function brings wT x to the appropriate probability value.

58/147
Maximizing likelihood for logistic regression (2)

• Next, we assume that X and y give m training examples (x(i) , y (i) ), each generated
independently.
• Then, the likelihood LX,y (w) corresponds to joint distribution over all training
examples (x(i) , y (i) ).

LX,y (w) =

59/147
Maximizing likelihood for logistic regression (2)

• Maximimizing the likelihood is equivalent to maximizing the log-likelihood `(w) whose

form is easier for computing derivatives (the log is natural logarithm):

`(w) = log L(w) =

59/147
Maximizing likelihood for logistic regression (2)

• Maximimizing the likelihood is equivalent to maximizing the log-likelihood `(w) whose

form is easier for computing derivatives (the log is natural logarithm):
m
X
`(w) = log L(w) = y (i) log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
i=1

59/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

60/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

We compute the partial derivative for each weight wj :

m
∂ ∂ X (i)
`(w) = y log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
∂wj ∂wj i=1

60/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

We compute the partial derivative for each weight wj :

m
∂ ∂ X (i)
`(w) = y log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
∂wj ∂wj i=1
m
X ∂
= y (i) log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
i=1
∂wj

60/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

We compute the partial derivative for each weight wj :

m
∂ ∂ X (i)
`(w) = y log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
∂wj ∂wj i=1
m
X ∂
= y (i) log hw (x(i) ) + (1 − y (i) ) log(1 − hw (x(i) ))
i=1
∂wj
m
y (i) 1 − y (i)

X ∂
= (i) )
− (i) ) ∂w
hw (x(i) ) (because ∂
∂x log x = x1 )
i=1
h w (x 1 − hw (x j

60/147
Maximizing likelihood for logistic regression (3)

For maximization, we employ gradient ascent instead of gradient descent.

w := w + α∇`(w)

We compute the partial derivative for each weight wj :

60/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:

∂
`(w)
∂wj

61/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:

m
y (i) 1 − y (i)

∂ X ∂
`(w) = T x(i) )
− T x(i) )
g(wT x(i) )(1 − g(wT x(i) )) wT x(i)
∂wj i=1
g(w 1 − g(w ∂w j

61/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:

m
y (i) 1 − y (i)

∂ X ∂
`(w) = T x(i) )
− T x(i) )
g(wT x(i) )(1 − g(wT x(i) )) wT x(i)
∂wj i=1
g(w 1 − g(w ∂w j
m
X (i)
= y (i) (1 − g(wT x(i) )) − (1 − y (i) )g(wT x(i) ) xj
i=1

61/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:

m
y (i) 1 − y (i)

∂ X ∂
`(w) = T x(i) )
− T x(i) )
g(wT x(i) )(1 − g(wT x(i) )) wT x(i)
∂wj i=1
g(w 1 − g(w ∂w j
m
X (i)
= y (i) (1 − g(wT x(i) )) − (1 − y (i) )g(wT x(i) ) xj
i=1
m
(i)
X
= (y (i) − g(wT x(i) ))xj
i=1

61/147
Maximizing likelihood for logistic regression (4)

Since g is the logistic function, using chain rule we obtain:

61/147
Batch & stochastic gradient ascent for logistic regression

The gradient ascent update turned out to be the same as the gradient descent update in
linear regression, except that hw is no longer linear in x(i) . This is actually not coincidence
– for deeper discussion, read the topic of Generalized Linear Model.
m
(i)
X
wj := wj + α (y (i) − hw (x(i) ))xj
i=1

Thus, analogous to linear regression we have two version of gradient ascents.

Batch gradient ascent from an initial value of w = (w0 , . . . , wn )T .

• Repeat until convergence:
Pm (i)
• For every j: wj := wj + α i=1 y (i) − hw (x(i) ) xj

Stochastic gradient ascent from an initial value of w = (w0 , . . . , wn )T .

• Loop until convergence
• Loop for every training example x(i) :
(i)
• For every j: wj := wj + α y (i) − hw (x(i) ) xj

62/147
Naive Bayes Classifier
Roles of Bayesian Method

• Provides practical learning algorithms

• Naive Bayes learning
• Bayesian belief network learning
• Combine prior knowledge (prior probabilities) with observed data
• Requires prior probabilities

• Provides useful conceptual framework for evaluating other learning algorithms

• maximum likelihood and least-squared error hypotheses (e.g., in linear regression)
• maximun likelihood for predicting probabilities (e.g., in logistic regression)

63/147
Bayes Theorem

P (D|h)P (h)
P (h|D) =
P (D)

• P (h) = prior probability of hypothesis h

• P (D) = prior probability of training data D
• P (h|D) = probability of h given D
• P (D|h) = probability of D given h

P (h|D):

• increases as P (D|h) and P (h) increases

• decreases as P (D) increases: the more probable D is observed independently of h,
the less evidence D provides in support of h.

64/147
Naive Bayes Classifier

Along with decision trees, neural networks, nearest neighbor, one of the most practical
learning methods.

When to use:

• Moderate or large training set available

• Attributes that describe instances are conditionally independent given classification

Successful applications:

• Diagnosis
• Classifying text documents

65/147
Naive Bayes Classifier

Assume target function f : X → V , where each instance x described by attributes

ha1 , a2 . . . an i.
Most probable value of f (x) is:
vM AP = arg max P (vj |a1 , a2 . . . an )
vj ∈V

P (a1 , a2 . . . an |vj )P (vj )

vM AP = arg max
vj ∈V P (a1 , a2 . . . an )
= arg max P (a1 , a2 . . . an |vj )P (vj )
vj ∈V

Naive Bayes assumption:

Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i

which gives

Y
Naive Bayes classifier: vN B = arg max P (vj ) P (ai |vj )
vj ∈V
i 66/147
Naive Bayes Algorithm

Naive_Bayes_Learn(examples):

For each target value vj :

P̂ (vj ) ← estimate P (vj )

For each attribute value ai of each attribute a:

P̂ (ai |vj ) ← estimate P (ai |vj )

Classify_New_Instance(x):
Y
vN B = arg max P̂ (vj ) P̂ (ai |vj )
vj ∈V
ai ∈x

67/147
Example

Determine what to do in the evening if you have deadlines looming, but none are
particularly urgent, that there is no party currently going on, and that you are currently
lazy, based on the following data.

Deadline? Is there a party? Lazy? Activity

Urgent Yes Yes Party
Urgent No Yes Study
Near Yes Yes Party
None Yes No Party
None No Yes Go to mall
None Yes No Party
Near No No Study
Near No Yes Watch TV
Near Yes Yes Party
Urgent No No Study

68/147
Naive Bayes: Subtleties

1. Conditional independence assumption is often violated

Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i

...but it works surprisingly well anyway.

2. What if none of the training instances with target value vj have attribute value ai ?
Then Y
P̂ (ai |vj ) = 0, and P̂ (vj ) P̂ (ai |vj ) = 0
i

Typical solution is Bayesian estimate for P̂ (ai |vj )

nc + mp
P̂ (ai |vj ) ←
n+m

• n is number of training examples for which v = vj ,

• nc number of examples for which v = vj and a = ai
1
• p is prior estimate for P̂ (ai |vj ), typically uniform priors, i.e., p = k
for k possible values of
attribute
• m is weight given to prior (i.e. number of “virtual” examples)
69/147
Support Vector Machine
Motivation

Consider logistic regression with hw (x) = g(wT x) with g the logistic function.

• The prediction hw (x) models the probability pw (y=1|x).

• We predict “1” for x iff hw (x) > 0.5, i.e., wT x > 0.
• For a positive training example (y = 1), the larger wT x is, the larger is the probability
pw (y=1|x), hence the higher degree of “confidence” that the label is 1.
• w is a good fit for the training set if wT x(i) 0 whenever y (i) = 1 and wT x(i) 0
whenever y (i) = 0
• is read “much greater than” and is read “much less than”.

70/147
Motivation (2)

• Class × vs ◦ (data with 2 features).

• Linear decision boundary (the line) satisfies
wT x = 0
• A is very far from decision boundary: very
high confidence that A is ×
• C is very close to boundary: predicting that
C is × has low confidence; small change to
w can change it to ◦
• B is in between the above two cases.
• The farther the point from the decision
boundary, the more confident our prediction
is.
• Aim: find decision boundary such that all
predictions are correct and confident on the
training examples.
71/147
Notation

We consider linear classifiers (those with linear decision boundary) for binary classification
problem

• Input features x with n dimension

• Labels t ∈ {−1, 1} (instead of {0, 1})
• Parameter w with the intercept term b, i.e., w = [w1 , . . . , wn ]T .
• We drop the extra constant feature x0 = 1, i.e., x = [x1 , . . . , xn ]T .
• Hypotheses of the form:
(
T 1 if wT x + b > 0,
yw,b (x) = g(w x + b) =
−1 otherwise

• Here, we directly predict a “1” or “-1”, instead of employing the logistic regression
approach of estimating the probability that y = 1

72/147
Which classification line is better than the others?

Which decision boundary is better than the others? Why? (MLAP, Fig. 8.1.)

73/147
Margin: Optimal separation

• Measure the distance from the line before hitting a datapoint: imagine a “(wide)
street” around the line.
• Any point on the street is too close to the line to be accurately clasified.
• Symmetric about the line: region in 2D, cuboid in 3D, hyperrectangle D > 3.
• The width of a separating hyperplane is called margin.
Which of the three examples of separating hyperplanes in the previous slide seem to
capture the maximum margin, i.e., optimal separating hyperplane?

74/147
Margin: Optimal separation

MLAP, Fig. 8.2.

74/147
Support vectors

• Assuming linearly separable data, the classifier with the largest margin is called the
maximum margin (linear) classifier.
• Datapoints of each class that lie closest to the classification line: support vectors.
• Maximum margin classifier is one that goes through the middle of the “street”
• margin should be as large as possible;
• support vectors are most useful datapoints since they are the ones that might be
misclassified
• once maximum margin is found, classification of a test point only depends on the support
vectors.
• Support vector machine (SVM) [Vapnik, 1992] performs binary classification by
finding a maximum margin classifier in:
• the original input space if the data is linearly separable
• a higher dimensional space to which data is transformed (using kernel functions) if the
data is not linearly separable.

75/147
Linear decision boundary (1)

Assume data is linearly separable into two classes (region Ry>0 and Ry<0 ). The decision
boundary is a linear hyperplane: y = y(x) = wT x + b = 0 where w: weight vector of the
features, b: intercept term. E.g., for 2 features: the line w1 x1 + w2 x2 + b = 0.

76/147
Linear decision boundary (2)

77/147
Distance from decision hyperplane to any point (1)