Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu
Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu
Probability Theory
Sargur N. Srihari
[email protected]
1
Machine Learning Srihari
2
Machine Learning Srihari
3
Machine Learning Srihari
4
Machine Learning Srihari
Probabilities of Interest
• Marginal Probability
2 apples 3 apples
– what is the probability of an 6 oranges 1 orange
apple? P(F=a)
• Note that we have to consider P(B)
• Conditional Probability
– Given that we have an orange
what is the probability that we
chose the blue box? P(B=b|F=o)
• Joint Probability
– What is the probability of orange
AND blue box? P(B=b,F=o) 5
Machine Learning Srihari
nij
p(Y = y j | X = x i ) =
ci
nij nij ci
p(X = x i ,Y = y j ) = = •
N ci N
= p(Y = y j | X = x i )p(X = x i )
7
Machine Learning Srihari
Bayes Theorem
• From the product rule together with the symmetry
property p(X,Y)=p(Y,X) we get
p(X |Y )p(Y )
p(Y | X ) =
p(X )
• Which is called Bayes’ theorem
• Using the sum rule the denominator is expressed as
Normalization constant to
ensure sum of conditional
p(X ) = ∑ p(X |Y )p(Y ) probability on LHS
Y sums to 1 over all values of Y
8
Machine Learning Srihari
Rules of Probability
• Given random variables X and Y
• Sum Rule gives Marginal Probability
L
ci
p(X = x i ) = ∑ p(X = x i ,Y = y j ) =
j =1 N
N = 60 data points
Histogram
of Y
(Fraction of
data points
having each
value of Y)
Histogram Histogram
of X of X given Y=1
Independent Variables
• If p(X,Y)=p(X)p(Y) then X and Y are said to be
independent
• Why?
p(X,Y )
• From product rule: p(Y | X ) =
p(X )
= p(Y )
12
Machine Learning Srihari
13
Machine Learning Srihari
Several Variables
• If there are several continuous variables x1,…,xD
denoted by vector x then we can define a joint
probability density p(x)=p(x1,..,xD)
• Multivariate probability density must satisfy
p(x) ≥ 0
∞
∫ p(x)d x = 1
−∞
14
Machine Learning Srihari
p(x) = ∫ p(x,y)dy
p(x,y) = p(y | x)p(x)
p(x | y)p(y)
p(y | x) =
p(x)
Expectation
• Expectation is average value of some function f(x) under the
probability distribution p(x) denoted E[f]
• For a discrete distribution
E[f] = Σx p(x) f(x) Examples of f(x)
• For a continuous distribution of use in ML:
f(x)=x; E[f] is mean
f(x)=ln p(x); E[f] is entropy
E[f ] = ∫ p(x)f (x)dx f(x)=-ln[q(x)/p(x)]; K-L divergence
• If there are N points drawn from a pdf, then expectation can be
approximated as This approximation is extremely important
when we use
E[f] = (1/N)ΣnN=1 f(xn) sampling to determine expected value
• Conditional Expectation with respect to a conditional distribution
Ex[f] = Σx p(x|y) f(x)
16
Machine Learning Srihari
Variance
• Measures how much variability there is in f(x)
around its mean value E[f(x)]
• Variance of f(x) is denoted as
var[f] = E[(f(x) – E[f(x)])2]
• Expanding the square
var[f] = E[(f(x)2] – E[f(x)]2
• Variance of the variable x itself
var[x] = E[x2] – E[x]2
17
Machine Learning Srihari
Covariance
• For two random variables x and y their covariance is
• cov[x,y] = Ex,y [{x-E[x]} {y-E[y]}]
= Ex,y [xy] - E[x]E[y]
18
Machine Learning Srihari
Bayesian Probabilities
• Classical or Frequentist view of Probabilities
– Probability is frequency of random, repeatable event
– Frequency of a tossed coin coming up heads is 1/2
• Bayesian View
– Probability is a quantification of uncertainty
– Degree of belief in propositions that do not involve random
variables
– Examples of uncertain events as probabilities:
• Whether Arctic Sea ice cap will disappear
• Whether moon was once in its own orbit around the sun
• Whether Thomas Jefferson had a child by one of his slaves
• Whether a signature on a check is genuine 19
Machine Learning Srihari
21
Machine Learning Srihari
Bayesian Approach
• Quantify uncertainty around choice of parameters w
– E.g., w is vector of parameters in curve fitting
M
y(x, w) = w0 + w1x + w2x 2 + ..+ wM x M = ∑ w j x j
j =0
µ=p(H)
p(µ) p(µ|H)
26
Machine Learning Srihari
– Standard deviation σ
• Precision β =1/σ 2
• Can find expectations of functions of
x under Gaussian
∞
∫ N(x | µ, σ )
2
E[x ] =
−∞
∞
E[x 2 ] = ∫ N(x | µ, σ )x dx = µ
2 2 2
+ σ2 µ= 0, σ =1
−∞
• It has parameters:
– Mean µ, a D-dimensional vector
– Covariance matrix Σ
• Which is a D ×D matrix
Machine Learning Srihari
1 N N N
ln p(x | µ, σ ) = − 2 ∑ (x n − µ)2 − ln σ 2 − ln(2π)
2
2σ n=1 2 2
• Maximum likelihood solutions are given by
1 N
µML =
N
∑x
n=1
n which is the sample mean
1 N
2
σML =
N
∑ (x
n=1
n
− µML )2 which is the sample variance
30
Machine Learning Srihari
34
Machine Learning Srihari
Predictive Distribution
• Knowing parameters w and β
• Predictions for new values of x can be made
using
p(t|x,wML,βML)=N(t|y(x,wML),βML-1)
• Instead of a point estimate we are now giving a
probability distribution over t
35
Machine Learning Srihari
36
Machine Learning Srihari
Posterior Distribution
• Using Bayes theorem, posterior distribution for w is
proportional to product of prior distribution and
likelihood function
p(w|x,t,α,β) α p(t|x,w,β)p(w|α)
• w can be determined by finding the most probable
value of w given the data, ie. maximizing posterior
distribution
• This is equivalent (by taking logs) to minimizing
2
β N
α T
∑
2 n=1
{y(x n , w )−tn } + w w
2
• Same as sum of squared errors function with a
regularization parameter given by λ=α/β
37
Machine Learning Srihari
39
Machine Learning Srihari
Model Selection
40
Machine Learning Srihari
41
Machine Learning Srihari
43
Machine Learning Srihari
44
Machine Learning Srihari
45
Machine Learning Srihari
• 12 variables:
two shown
• 100 points
• Learn to Which class
should x
classify from belong to?
data
46
Machine Learning Srihari
Cell-based Classification
• Naïve approach of
cell based voting
will fail because of
exponential growth
of cells with
dimensionality
• Hardly any points in
each cell
47
Machine Learning Srihari
48
Machine Learning Srihari
49