0% found this document useful (0 votes)

79 views49 pages

Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu

1) Probability theory provides a framework for dealing with uncertainty in machine learning. 2) Key concepts include random variables, probability distributions, marginal, conditional, and joint probabilities. 3) The sum and product rules relate these probabilities, and Bayes' theorem updates probabilities based on new information.

Uploaded by

asdfasdffdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views49 pages

Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu

Uploaded by

asdfasdffdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Machine Learning Srihari

Probability Theory

Sargur N. Srihari
[email protected]

1
Machine Learning Srihari

Probability Theory in Machine Learning

•  Probability is key concept is dealing with

uncertainty
–  Arises due to finite size of data sets and noise on
measurements
•  Probability Theory
–  Framework for quantification and manipulation of
uncertainty
–  One of the central foundations of machine learning

2
Machine Learning Srihari

Random Variable (R.V.)

•  Takes values subject to chance
–  E.g., X is the result of coin toss with values Head
and Tail which are non - numeric
•  X can be denoted by a r.v. x which has values of 1 and 0
–  Each value of x has an associated probability
•  Probability Distribution
–  Mathematical function that describes
1. possible values of a r.v.
2. and associated probabilities

3
Machine Learning Srihari

Probability with Two Variables

•  Key concepts:
–  conditional & joint probabilities of variables
•  Random Variables: B and F
–  Box B, Fruit F
•  F has two values orange (o) or apple (a)
•  B has values red (r) or blue (b)
2 apples 3 apples
6 oranges 1 orange P(F=o)=3/4 and P(F=a)=1/4
Let p(B=r)=4/10 and p(B=b)=6/10

Given the above data we are interested in

several probabilities of interest:
marginal, conditional and joint
Described next

4
Machine Learning Srihari

Probabilities of Interest
•  Marginal Probability
2 apples 3 apples
–  what is the probability of an 6 oranges 1 orange
apple? P(F=a)
•  Note that we have to consider P(B)
•  Conditional Probability
–  Given that we have an orange
what is the probability that we
chose the blue box? P(B=b|F=o)
•  Joint Probability
–  What is the probability of orange
AND blue box? P(B=b,F=o) 5
Machine Learning Srihari

Sum Rule of Probability Theory

•  Consider two random variables

•  X can take on values xi, i=1,, M
•  Y can take on values yi, i=1,..L
•  N trials sampling both X and Y
•  No of trials with X=xi and Y=yj is nij
nij
Joint Probability p(X = x i ,Y = y j ) =
N
ci
p(X = x i ) =
•  Marginal Probability N
L
Since ci = ∑ nij , p(X = x i ) = ∑ p(X = x i ,Y = y j )
j j =1 6
Machine Learning Srihari

Product Rule of Probability Theory

•  Consider only those instances for which X=xi
•  Then fraction of those instances for which Y=yj is
written as p(Y=yj|X=xi)
•  Called conditional probability
•  Relationship between joint and conditional probability:

nij
p(Y = y j | X = x i ) =
ci
nij nij ci
p(X = x i ,Y = y j ) = = •
N ci N
= p(Y = y j | X = x i )p(X = x i )

7
Machine Learning Srihari

Bayes Theorem
•  From the product rule together with the symmetry
property p(X,Y)=p(Y,X) we get
p(X |Y )p(Y )
p(Y | X ) =
p(X )
•  Which is called Bayes’ theorem
•  Using the sum rule the denominator is expressed as
Normalization constant to
ensure sum of conditional
p(X ) = ∑ p(X |Y )p(Y ) probability on LHS
Y sums to 1 over all values of Y

8
Machine Learning Srihari

Rules of Probability
•  Given random variables X and Y
•  Sum Rule gives Marginal Probability
L
ci
p(X = x i ) = ∑ p(X = x i ,Y = y j ) =
j =1 N

•  Product Rule: joint probability in terms of conditional and

marginal
nij nij ci
p(X,Y ) = = p(Y | X )p(X ) = ×
N ci N

•  Combining we get Bayes Rule

p(X |Y )p(Y ) where p(X ) = ∑ p(X |Y )p(Y )
p(Y | X ) = Y
p(X )
Viewed as
Posterior a likelihood x prior
9
Machine Learning Srihari

Ex: Joint Distribution over two Variables

X takes nine possible values, Y takes two values

N = 60 data points
Histogram
of Y
(Fraction of
data points
having each
value of Y)

Histogram Histogram
of X of X given Y=1

Fractions would equal the probability as N àoo

10
Machine Learning Srihari

Bayes rule applied to Fruit Problem

•  Probability that box is red given
that fruit picked is orange
p(F = o | B = r) p(B = r)
p(B = r | F = o) =
p(F = o)
3 4
×
= 4 10 = 2 = 0.66 The a posteriori probability of 0.66 is
9 3 different from the a priori probability of 0.4
20

•  Probability that fruit is orange

€
–  From sum and product rules
p(F = o) = p(F = o,B = r) + p(F = o,B = b)
= p(F = o | B = r) p(B = r) + p(F = o | B = b) p(B = b)
6 4 1 6 9 The marginal probability of 0.45 is lower
= × + × = = 0.45
8 10 4 10 20 than average probability of 7/12=0.58 11
Machine Learning Srihari

Independent Variables
•  If p(X,Y)=p(X)p(Y) then X and Y are said to be
independent
•  Why?
p(X,Y )
•  From product rule: p(Y | X ) =
p(X )
= p(Y )

•  In fruit example if each box contained same

fraction of apples and oranges then p(F|B)=p(F)

12
Machine Learning Srihari

Probability Density Function (pdf)

Cumulative
Distribution
Function
•  Continuous Variables
•  If probability that x falls in
interval (x,x+δx) is given
by p(x)dx for δx à0
then p(x) is a pdf of x
•  Probability x lies in Probability that x lies in
interval (a,b) is Interval (-∞,z) is
z
b

p(x ∈ (a,b)) = ∫ p(x)dx P(z) = ∫ p(x)dx

a −∞

13
Machine Learning Srihari

Several Variables
•  If there are several continuous variables x1,…,xD
denoted by vector x then we can define a joint
probability density p(x)=p(x1,..,xD)
•  Multivariate probability density must satisfy
p(x) ≥ 0
∞

∫ p(x)d x = 1
−∞

14
Machine Learning Srihari

Sum, Product, Bayes for Continuous

•  Rules apply for continuous, or combinations of
discrete and continuous variables

p(x) = ∫ p(x,y)dy
p(x,y) = p(y | x)p(x)
p(x | y)p(y)
p(y | x) =
p(x)

•  Formal justification of sum, product rules for

continuous variables requires measure theory
15
Machine Learning Srihari

Expectation
•  Expectation is average value of some function f(x) under the
probability distribution p(x) denoted E[f]
•  For a discrete distribution
E[f] = Σx p(x) f(x) Examples of f(x)
•  For a continuous distribution of use in ML:
f(x)=x; E[f] is mean
f(x)=ln p(x); E[f] is entropy
E[f ] = ∫ p(x)f (x)dx f(x)=-ln[q(x)/p(x)]; K-L divergence
•  If there are N points drawn from a pdf, then expectation can be
approximated as This approximation is extremely important
when we use
E[f] = (1/N)ΣnN=1 f(xn) sampling to determine expected value
•  Conditional Expectation with respect to a conditional distribution
Ex[f] = Σx p(x|y) f(x)
16
Machine Learning Srihari

Variance
•  Measures how much variability there is in f(x)
around its mean value E[f(x)]
•  Variance of f(x) is denoted as
var[f] = E[(f(x) – E[f(x)])2]
•  Expanding the square
var[f] = E[(f(x)2] – E[f(x)]2
•  Variance of the variable x itself
var[x] = E[x2] – E[x]2
17
Machine Learning Srihari

Covariance
•  For two random variables x and y their covariance is
•  cov[x,y] = Ex,y [{x-E[x]} {y-E[y]}]
= Ex,y [xy] - E[x]E[y]

–  Expresses how x and y vary together

•  If x and y are independent then their covariance
vanishes
•  If x and y are two vectors of random variables
covariance is a matrix
•  If we consider covariance of components of vector x
with each other then we denote it as cov[x] =cov [x,x]

18
Machine Learning Srihari

Bayesian Probabilities
•  Classical or Frequentist view of Probabilities
–  Probability is frequency of random, repeatable event
–  Frequency of a tossed coin coming up heads is 1/2
•  Bayesian View
–  Probability is a quantification of uncertainty
–  Degree of belief in propositions that do not involve random
variables
–  Examples of uncertain events as probabilities:
•  Whether Arctic Sea ice cap will disappear
•  Whether moon was once in its own orbit around the sun
•  Whether Thomas Jefferson had a child by one of his slaves
•  Whether a signature on a check is genuine 19
Machine Learning Srihari

Whether Arctic Sea cap will disappear

•  We have some idea of how

quickly polar ice is melting

•  Revise it on the basis of

fresh evidence (satellite
observations)

NASA Video •  Assessment will affect

actions we take (to reduce
greenhouse gases)
An uncertain event
Answered by general Bayesian interpretation
20
Machine Learning Srihari

Bayesian Representation of Uncertainty

•  Use of probability to represent uncertainty is not an
ad-hoc choice
•  If numerical values are used to represent degrees of
belief, then simple set of axioms for manipulating
degrees of belief leads to sum and product rules of
probability (Cox’s theorem)
•  Probability theory can be regarded as an extension of
Boolean logic to situations involving uncertainty
(Jaynes)

21
Machine Learning Srihari

Bayesian Approach
•  Quantify uncertainty around choice of parameters w
–  E.g., w is vector of parameters in curve fitting
M
y(x, w) = w0 + w1x + w2x 2 + ..+ wM x M = ∑ w j x j
j =0

•  Uncertainty before observing data expressed by p(w)

•  Given observed data D ={ t1, . . tN }
–  Uncertainty in w after observing D, by Bayes rule:
p(D | w)p(w)
p(w | D) =
p(D)

–  Quantity p(D|w) is evaluated for observed data

•  It can be viewed as function of w
•  It represents how probable the data set is for different parameters w
•  It is called the Likelihood function
•  Not a probability distribution over w 22
Machine Learning Srihari

Bayes theorem in words

•  Uncertainty in w expressed as
p(D | w)p(w)
p(w | D) =
p(D)

•  Bayes theorem in words:

posterior α likelihood ✕ prior

•  Denominator is normalization factor

•  Involves marginalization over w

p(D) = ∫ p(D | w)p(w)d w by Sum Rule

Machine Learning Srihari

Role of Likelihood Function

•  Likelihood Function plays central role in both

Bayesian and frequentist paradigms
•  Frequentist:
•  w is a fixed parameter determined by an estimator;
•  Error bars on estimate are obtained from possible
data sets D
•  Bayesian:
•  There is a single data set D
•  Uncertainty in parameters expressed as probability
distribution over w
Machine Learning Srihari

Maximum Likelihood Approach

•  In frequentist setting w is a fixed parameter
–  w is set to value that maximizes likelihood function p(D|w)
–  In ML, negative log of likelihood function is called error
function since maximizing likelihood is equivalent to
minimizing error
•  Error Bars
–  Bootstrap approach to creating L data sets
•  From N data points new data sets are created by drawing N points at
random with replacement
•  Repeat L times to generate L data sets
•  Accuracy of parameter estimate can be evaluated by variability of
predictions between different bootstrap sets
25
Machine Learning Srihari

Bayesian: Prior and Posterior

•  Inclusion of prior knowledge arises naturally
•  Coin Toss Example
–  Fair looking coin is tossed three times and lands Head each time
–  Classical m.l.e of the probability of landing heads is 1 implying all
future tosses will land Heads
–  Bayesian approach with reasonable prior will lead to less
extreme conclusion

µ=p(H)

p(µ) p(µ|H)

26
Machine Learning Srihari

Practicality of Bayesian Approach

•  Marginalization over whole parameter space is
required to make predictions or compare
models
•  Factors making it practical:
•  Sampling Methods such as Markov Chain Monte Carlo
methods
•  Increased speed and memory of computers
•  Deterministic approximation schemes such as
Variational Bayes and Expectation propagation
are alternatives to sampling
27
Machine Learning Srihari

The Gaussian Distribution

•  For single real-valued variable x What is an Exponential:
1 ⎧⎪ 1 2⎪
⎫ y=ex, where e=2.718
N(x | µ, σ ) =2
exp ⎨− 2 (x − µ) ⎪⎬
⎪ Its value for argument 0 is 1
2 1/2
(2πσ ) ⎪⎪⎩ 2σ ⎪⎪⎭ It is its own derivative

•  It has two parameters: Maximum of a distribution is its mode

–  Mean µ, variance σ 2, For a Gaussian, mode coincides with mean

–  Standard deviation σ
•  Precision β =1/σ 2
•  Can find expectations of functions of
x under Gaussian
∞

∫ N(x | µ, σ )
2
E[x ] =
−∞
∞

E[x 2 ] = ∫ N(x | µ, σ )x dx = µ
2 2 2
+ σ2 µ= 0, σ =1
−∞

var[x ] = E[x 2 ]− E[x ]2 = σ 2

Machine Learning Srihari

Multivariate Gaussian Distribution

•  For single real-valued variable x
1 1 ⎧⎪ 1 ⎫⎪
N(x | µ,Σ) = exp ⎨− (x − µ) Σ (x − µ)⎪⎬
⎪ T −1
D/2
(2π) Σ 1/2 ⎪⎪⎩ 2 ⎪⎪⎭

•  It has parameters:
–  Mean µ, a D-dimensional vector
–  Covariance matrix Σ
•  Which is a D ×D matrix
Machine Learning Srihari

Likelihood Function for Gaussian

•  Given N scalar observations x=[x1,.. xn]T
–  Which are independent and identically
distributed
•  Probability of data set is given by
likelihood function
N
Data: black points
p(x | µ, σ ) = ∏ N(x n | µ, σ 2 )
2
Likelihood= product of blue values
n=1
Pick mean and variance to maximize
•  Log-likelihood function is this product

1 N N N
ln p(x | µ, σ ) = − 2 ∑ (x n − µ)2 − ln σ 2 − ln(2π)
2

2σ n=1 2 2
•  Maximum likelihood solutions are given by
1 N
µML =
N
∑x
n=1
n which is the sample mean
1 N
2
σML =
N
∑ (x
n=1
n
− µML )2 which is the sample variance
30
Machine Learning Srihari

Bias in Maximum Likelihood

•  Maximum likelihood
systematically
underestimates variance
–  E[µML]=µ
–  E[σ 2ML]=((N-1)/N)σ 2
Green curve is true distribution
Averaged across three data sets
–  Not an issue as N increases mean is correct
Variance is underestimated
•  Problem is related to over- because it is estimated relative
to sample mean and not true mean
fitting problem
31
Machine Learning Srihari

Curve Fitting Probabilistically

•  Goal is to predict for target
variable t given a new value
of the input variable x

–  Given N input values

x=(x1,..xN)T and corresponding
target values t=(t1,..,tN)T

–  Assume given value of x, value

of t has a Gaussian distribution
with mean equal to y(x,w) of
polynomial curve Gaussian conditional distribution
for t given x.
p(t|x,w,β)=N(t|y(x,w),β-1) Mean is given by
M polynomial function y(x,w)
y(x, w) = w0 + w1x + w2x 2 + ..+ wM x M = ∑ w j x j Precision given by β
j =0
32
Machine Learning Srihari

Curve Fitting with Maximum Likelihood

N
p(t | x, w, β) = ∏ N(tn | y(x n , w), β −1 )
•  Likelihood Function is n=1

•  Logarithm of the Likelihood function is

β N N N
ln p(t | x, w, β) = − ∑ {y(x n , w) −tn }2 + ln β − ln(2π)
2 n=1 2 2

•  To find maximum likelihood solution for

polynomial coefficients wML
–  Maximize w.r.t w
–  Can omit last two terms -- don’t depend on w
–  Can replace β/2 with ½ (since it is constant wrt w)
–  Minimize negative log-likelihood
–  Identical to sum-of-squares error function
33
Machine Learning Srihari

Precision parameter with MLE

•  Maximum likelihood can also be used to
determine β of Gaussian conditional distribution
•  Maximizing likelihood wrt β gives
2
1 1 N
=
βML N
∑ {y(x , w
n=1
n ML
)−tn }

•  First determine parameter vector wML governing

the mean and subsequently use this to find
precision βML

34
Machine Learning Srihari

Predictive Distribution
•  Knowing parameters w and β
•  Predictions for new values of x can be made
using
p(t|x,wML,βML)=N(t|y(x,wML),βML-1)
•  Instead of a point estimate we are now giving a
probability distribution over t

35
Machine Learning Srihari

A More Bayesian Treatment

•  Introducing a prior distribution over polynomial
coefficients w
(M +1)/2
⎛ α ⎞⎟ ⎧
⎪ α T ⎫
⎪
p(w | α) = N(w | 0, α I ) = ⎜⎜⎜ ⎟⎟
−1
exp ⎨− w w⎪
⎪ ⎬
⎝ 2π ⎟⎠ ⎪
⎪
⎩ 2 ⎪
⎪
⎭
–  where α is the precision of the distribution
–  M+1 is total no. of parameters for an Mth order polynomial
–  α are Model parameters also called hyperparameter
•  they control distribution of model parameters

36
Machine Learning Srihari

Posterior Distribution
•  Using Bayes theorem, posterior distribution for w is
proportional to product of prior distribution and
likelihood function
p(w|x,t,α,β) α p(t|x,w,β)p(w|α)
•  w can be determined by finding the most probable
value of w given the data, ie. maximizing posterior
distribution
•  This is equivalent (by taking logs) to minimizing
2
β N
α T
∑
2 n=1
{y(x n , w )−tn } + w w
2
•  Same as sum of squared errors function with a
regularization parameter given by λ=α/β
37
Machine Learning Srihari

Bayesian Curve Fitting

•  Previous treatment still makes point estimate of w
–  In fully Bayesian approach consistently apply sum and
product rules and integrate over all values of w
•  Given training data x and t and new test point x , goal
is to predict value of t
–  i.e, wish to evaluate predictive distribution p(t|x,x,t)
•  Applying sum and product rules of probability
–  Predictive distribution can be written in the form
p(t | x, x,t) = ∫ p(t, w |x, x, t)dw by Sum Rule (marginalizing over w)

=∫ p(t|x,w,x,t) p(w | x, x,t) by Product Rule

=∫ p(t|x,w)p(w|x,t)dw by eliminating unnecessary variables

p(t | x, w) = N(t | y(x , w), β −1 ) Posterior distribution over parameters 38

Also a Gaussian
Machine Learning Srihari

Bayesian Curve Fitting

•  Predictive distribution is also Gaussian
p(t | x, x,t) = N(t | m(x),s 2(x))
–  Where the Mean and Variance are dependent on x
N
m(x) = βφ(x) S ∑ φ(x n )tn
T

n=1 First term is uncertainty in predicted value due to noise in target

2 −1 T Second term is uncertainty in parameters due to Bayesian treatment
s (x) = β + φ(x) Sφ(x)
N
S −1
= αI + β ∑ φ(x n )φ(x)T
n=1

φ(x) has elements φi (x) = x i for i = 0,..M

Predictive Distribution is a M=9 polynomial

α = 5 x 10-3
β =11.1
Red curve is mean
Red region is +1 std dev

39
Machine Learning Srihari

Model Selection

40
Machine Learning Srihari

Models in Curve Fitting

•  In polynomial curve fitting:
–  an optimal order of polynomial gives best
generalization
•  Order of the polynomial controls
–  the number of free parameters in the model and
thereby model complexity
•  With regularized least squares l also controls
model complexity

41
Machine Learning Srihari

Validation Set to Select Model

•  Performance on training set is not a good
indicator of predictive performance
•  If there is plenty of data,
–  use some of the data to train a range of models Or a
given model with a range of values for its parameters
–  Compare them on an independent set, called
validation set
–  Select one having best predictive performance
•  If data set is small then some over-fitting can
occur and it is necessary to keep aside a test set
42
Machine Learning Srihari

S-fold Cross Validation

•  Supply of data is limited S=4

•  All available data is

partitioned into S groups
•  S-1 groups are used to train
and evaluated on remaining
group
•  Repeat for all S choices of If S=N this is the
leave-one-out method
held-out group
•  Performance scores from S
runs are averaged

43
Machine Learning Srihari

Bayesian Information Criterion

•  Criterion for choosing model
•  Akaike Information criterion (AIC) chooses
model for which the quantity
ln p(D|wML) –M
•  Is highest
•  Where M is number of adjustable parameters
•  BIC is a variant of this quantity

44
Machine Learning Srihari

The Curse of Dimensionality

Need to deal with spaces with many

variables in machine learning

45
Machine Learning Srihari

Example Clasification Problem

•  Three classes

•  12 variables:
two shown
•  100 points
•  Learn to Which class
should x
classify from belong to?
data

46
Machine Learning Srihari

Cell-based Classification
•  Naïve approach of
cell based voting
will fail because of
exponential growth
of cells with
dimensionality
•  Hardly any points in
each cell

47
Machine Learning Srihari

Volume of Sphere in High Dimensions

•  Sphere is of radius r =1 in
D-dimensions
•  What fraction of volume
lies between radius
r = 1-ε and r =1?
•  VD(r)=KDrD
•  This fraction is given by 1-
(1-ε)D
•  As D increases high
proportion of volume lies Fraction of volume of sphere
near outer shell lying in range r =1- ε to r = 1
for various dimensions D

48
Machine Learning Srihari

Gaussian in High-dimensional Space

•  x-y space converted to r-
space using polar
coordinates
•  Most of the probability
mass is located in a thin
shell at a specific radius

Bunge Notation
No ratings yet
Bunge Notation
34 pages
MTH2222 Mathematics of Uncertainty
No ratings yet
MTH2222 Mathematics of Uncertainty
96 pages
Unit 1 - Deep Learning
No ratings yet
Unit 1 - Deep Learning
49 pages
ML U3
No ratings yet
ML U3
34 pages
ML Academy - Part II
No ratings yet
ML Academy - Part II
8 pages
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
No ratings yet
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
18 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
Mathematics in Machine Learning
No ratings yet
Mathematics in Machine Learning
83 pages
AML-IV_new
No ratings yet
AML-IV_new
98 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
Foundations of Machine Learning: Part A: Probability Basics
No ratings yet
Foundations of Machine Learning: Part A: Probability Basics
75 pages
All in One CheatSheet PDF
No ratings yet
All in One CheatSheet PDF
52 pages
Math_ML-trang-6
No ratings yet
Math_ML-trang-6
53 pages
ML Cheat Sheet
50% (2)
ML Cheat Sheet
74 pages
Intro_to_probability_FML_ons
No ratings yet
Intro_to_probability_FML_ons
26 pages
Lecture1 Intro ML
No ratings yet
Lecture1 Intro ML
60 pages
Project Maths
No ratings yet
Project Maths
14 pages
ML 3
No ratings yet
ML 3
14 pages
Unit 2 Overview of Probability
No ratings yet
Unit 2 Overview of Probability
21 pages
PMRprobabilistic Modelling Primer
No ratings yet
PMRprobabilistic Modelling Primer
14 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
Unit 2
No ratings yet
Unit 2
20 pages
ML Unit2-1
No ratings yet
ML Unit2-1
11 pages
lec2 (1)
No ratings yet
lec2 (1)
46 pages
Essentials of Machine Learning - Lesson 02
No ratings yet
Essentials of Machine Learning - Lesson 02
15 pages
Probability Theory and Stochastic Processes
0% (1)
Probability Theory and Stochastic Processes
155 pages
Chapter 1 Uncertainty
No ratings yet
Chapter 1 Uncertainty
32 pages
PROBABILITY AND RANDOM PROCESSES
No ratings yet
PROBABILITY AND RANDOM PROCESSES
167 pages
PTSP 2019 Class Notes
No ratings yet
PTSP 2019 Class Notes
177 pages
Unit No: 5 Overview of Probability (3170724)
No ratings yet
Unit No: 5 Overview of Probability (3170724)
11 pages
Introduction To Probability Theory: A Short Course On Graphical Models
No ratings yet
Introduction To Probability Theory: A Short Course On Graphical Models
30 pages
Applied Probability and Stochastic Processes Second Edition Beichelt pdf download
No ratings yet
Applied Probability and Stochastic Processes Second Edition Beichelt pdf download
60 pages
Applied Probability and Stochastic Processes Second Edition Beichelt instant download
100% (1)
Applied Probability and Stochastic Processes Second Edition Beichelt instant download
65 pages
probability
No ratings yet
probability
19 pages
Applied Probability and Stochastic Processes Second Edition Beichelt pdf download
No ratings yet
Applied Probability and Stochastic Processes Second Edition Beichelt pdf download
60 pages
3 6-ConditionalIndependence
No ratings yet
3 6-ConditionalIndependence
38 pages
Applied Probability and Stochastic Processes Second Edition Beichelt pdf download
100% (5)
Applied Probability and Stochastic Processes Second Edition Beichelt pdf download
54 pages
Probabilistic and Stochastic Models-Adi
No ratings yet
Probabilistic and Stochastic Models-Adi
33 pages
Basic Sampling Methods: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Basic Sampling Methods: Sargur Srihari Srihari@cedar - Buffalo.edu
30 pages
ML Unit-3
No ratings yet
ML Unit-3
16 pages
Statistics Day3
No ratings yet
Statistics Day3
9 pages
Markdown To PDF
No ratings yet
Markdown To PDF
3 pages
UNIT -3 ITAI & ML
No ratings yet
UNIT -3 ITAI & ML
57 pages
Conditional probability, Bayes rule
No ratings yet
Conditional probability, Bayes rule
22 pages
Leon-Garcia-IPPR_Chapters 1-6
No ratings yet
Leon-Garcia-IPPR_Chapters 1-6
180 pages
Probability and Random Processes-1-10
No ratings yet
Probability and Random Processes-1-10
10 pages
2.2 Intro To Proba Dist
No ratings yet
2.2 Intro To Proba Dist
31 pages
Probability Theory - Towards Data Science
No ratings yet
Probability Theory - Towards Data Science
19 pages
CS464 Review of Probability: (Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Review of Probability: (Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
unit2 AI & ML
No ratings yet
unit2 AI & ML
29 pages
Probability-Review 14 October 2024
No ratings yet
Probability-Review 14 October 2024
15 pages
AI20
No ratings yet
AI20
4 pages
Statistical Foundations of Machine Learning: The Handbook
No ratings yet
Statistical Foundations of Machine Learning: The Handbook
364 pages
Bayes Ejplo
No ratings yet
Bayes Ejplo
4 pages
Probability and Statistics Review
No ratings yet
Probability and Statistics Review
44 pages
Probability
No ratings yet
Probability
2 pages
Dempster Shafer
No ratings yet
Dempster Shafer
134 pages
Probabilistic Model
No ratings yet
Probabilistic Model
7 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Forward Sampling: Sargur Srihari Srihari@buffalo - Edu
No ratings yet
Forward Sampling: Sargur Srihari Srihari@buffalo - Edu
13 pages
Iterative Reweighted Least Squares: Sargur N. Srihari
No ratings yet
Iterative Reweighted Least Squares: Sargur N. Srihari
22 pages
Mixture Models: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Mixture Models: Sargur Srihari Srihari@cedar - Buffalo.edu
7 pages
K-Means Clustering: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
K-Means Clustering: Sargur Srihari Srihari@cedar - Buffalo.edu
20 pages
Ch13 4-LinearDynamicalSystems
No ratings yet
Ch13 4-LinearDynamicalSystems
20 pages
Kernel Methods!: Sargur Srihari!
No ratings yet
Kernel Methods!: Sargur Srihari!
29 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
Support Vector Machines: Srihari@buffalo - Edu
No ratings yet
Support Vector Machines: Srihari@buffalo - Edu
42 pages
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
No ratings yet
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
31 pages
904B B.P.S. IX S.A. II Maths Chapterwise 5 Printable Worksheets With Solution 2014 15
100% (1)
904B B.P.S. IX S.A. II Maths Chapterwise 5 Printable Worksheets With Solution 2014 15
203 pages
Midterm Date Sheet (V2)
No ratings yet
Midterm Date Sheet (V2)
2 pages
Solving Exponential Inequality
No ratings yet
Solving Exponential Inequality
17 pages
Mathematics The Universal Language
No ratings yet
Mathematics The Universal Language
2 pages
Mathematical Tricks
No ratings yet
Mathematical Tricks
6 pages
Anstey 2023 Ch01
No ratings yet
Anstey 2023 Ch01
3 pages
Interactive Nonlinear Multiobjective Optimization Methods
No ratings yet
Interactive Nonlinear Multiobjective Optimization Methods
68 pages
Backward Stochastic Differential Equations and Applications
No ratings yet
Backward Stochastic Differential Equations and Applications
9 pages
1.3 Errors & Uncertainties
No ratings yet
1.3 Errors & Uncertainties
9 pages
Algebra
100% (1)
Algebra
367 pages
6 The Capm: 6.1 Assumptions and Notation
No ratings yet
6 The Capm: 6.1 Assumptions and Notation
8 pages
Brouwer Notes
No ratings yet
Brouwer Notes
255 pages
Society of Actuaries/Casualty Actuarial Society: Exam C Construction and Evaluation of Actuarial Models
No ratings yet
Society of Actuaries/Casualty Actuarial Society: Exam C Construction and Evaluation of Actuarial Models
83 pages
TI 84 Plus Graphing Calculator Guidebook
75% (4)
TI 84 Plus Graphing Calculator Guidebook
827 pages
(ECE 401) Lecture 2 Vector Analysis - Part 1
No ratings yet
(ECE 401) Lecture 2 Vector Analysis - Part 1
34 pages
QB 12 Ch 1 Real Numbers
No ratings yet
QB 12 Ch 1 Real Numbers
4 pages
Thesis On Analysis of Beams and Plates by EFG Method
100% (1)
Thesis On Analysis of Beams and Plates by EFG Method
59 pages
Understanding Constraint Satisfaction Problems
No ratings yet
Understanding Constraint Satisfaction Problems
10 pages
PROSTAT Lecture7 Handouts
No ratings yet
PROSTAT Lecture7 Handouts
3 pages
Schedule - MidTerm Examinations - Fall Semester 2024-25
No ratings yet
Schedule - MidTerm Examinations - Fall Semester 2024-25
30 pages
What Are The Thinking Skills As Defined in The Revised Bloom's Taxonomy?
No ratings yet
What Are The Thinking Skills As Defined in The Revised Bloom's Taxonomy?
6 pages
2.1 To 2.4 CC Quiz REVIEW
No ratings yet
2.1 To 2.4 CC Quiz REVIEW
3 pages
Mean, Median, Mode
No ratings yet
Mean, Median, Mode
10 pages
Controller Design of Inverted Pendulum Using Pole Placement and LQR
100% (1)
Controller Design of Inverted Pendulum Using Pole Placement and LQR
7 pages
2018.6 D1 Mark Scheme
No ratings yet
2018.6 D1 Mark Scheme
18 pages
Module Math 10 New
No ratings yet
Module Math 10 New
152 pages
Unit-1-Errors and Approximation
No ratings yet
Unit-1-Errors and Approximation
32 pages
Chatelin F. - Eigenvalues of Matrices - Revised Edition PDF
100% (1)
Chatelin F. - Eigenvalues of Matrices - Revised Edition PDF
429 pages
Sequence & Series 3
No ratings yet
Sequence & Series 3
1 page

Uploaded by

Uploaded by

Machine Learning Srihari

Probability Theory in Machine Learning

• Probability is key concept is dealing with

Random Variable (R.V.)

Probability with Two Variables

Given the above data we are interested in

Sum Rule of Probability Theory

• Consider two random variables

Product Rule of Probability Theory

• Product Rule: joint probability in terms of conditional and

• Combining we get Bayes Rule

Ex: Joint Distribution over two Variables

Fractions would equal the probability as N àoo

Bayes rule applied to Fruit Problem

• Probability that fruit is orange

• In fruit example if each box contained same

Probability Density Function (pdf)

p(x ∈ (a,b)) = ∫ p(x)dx P(z) = ∫ p(x)dx

Sum, Product, Bayes for Continuous

• Formal justification of sum, product rules for

– Expresses how x and y vary together

Whether Arctic Sea cap will disappear

• We have some idea of how

• Revise it on the basis of

NASA Video • Assessment will affect

Bayesian Representation of Uncertainty

• Uncertainty before observing data expressed by p(w)

– Quantity p(D|w) is evaluated for observed data

Bayes theorem in words

• Bayes theorem in words:

• Denominator is normalization factor

p(D) = ∫ p(D | w)p(w)d w by Sum Rule

Role of Likelihood Function

• Likelihood Function plays central role in both

Maximum Likelihood Approach

Bayesian: Prior and Posterior

Practicality of Bayesian Approach

The Gaussian Distribution

• It has two parameters: Maximum of a distribution is its mode

– Mean µ, variance σ 2, For a Gaussian, mode coincides with mean

var[x ] = E[x 2 ]− E[x ]2 = σ 2

Multivariate Gaussian Distribution

Likelihood Function for Gaussian

Bias in Maximum Likelihood

Curve Fitting Probabilistically

– Given N input values

– Assume given value of x, value

Curve Fitting with Maximum Likelihood

• Logarithm of the Likelihood function is

• To find maximum likelihood solution for

Precision parameter with MLE

• First determine parameter vector wML governing

A More Bayesian Treatment

Bayesian Curve Fitting

=∫ p(t|x,w,x,t) p(w | x, x,t) by Product Rule

p(t | x, w) = N(t | y(x , w), β −1 ) Posterior distribution over parameters 38

Bayesian Curve Fitting

n=1 First term is uncertainty in predicted value due to noise in target

φ(x) has elements φi (x) = x i for i = 0,..M

Predictive Distribution is a M=9 polynomial

Models in Curve Fitting

Validation Set to Select Model

S-fold Cross Validation

• All available data is

Bayesian Information Criterion

The Curse of Dimensionality

Need to deal with spaces with many

Example Clasification Problem

Volume of Sphere in High Dimensions

Gaussian in High-dimensional Space

You might also like

•  Probability is key concept is dealing with

•  Consider two random variables

•  Product Rule: joint probability in terms of conditional and

•  Combining we get Bayes Rule

•  Probability that fruit is orange

•  In fruit example if each box contained same

•  Formal justification of sum, product rules for

–  Expresses how x and y vary together

•  We have some idea of how

•  Revise it on the basis of

NASA Video •  Assessment will affect

•  Uncertainty before observing data expressed by p(w)

–  Quantity p(D|w) is evaluated for observed data

•  Bayes theorem in words:

•  Denominator is normalization factor

•  Likelihood Function plays central role in both

•  It has two parameters: Maximum of a distribution is its mode

–  Mean µ, variance σ 2, For a Gaussian, mode coincides with mean

–  Given N input values

–  Assume given value of x, value

•  Logarithm of the Likelihood function is

•  To find maximum likelihood solution for

•  First determine parameter vector wML governing

•  All available data is