Probability Theory - Towards Data Science
Probability Theory - Towards Data Science
Probability Theory
Fundamentals of Machine Learning (Part 1)
William Fleshman Follow
Jan 29 · 18 min read
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 1/19
10/30/2019 Probability Theory - Towards Data Science
The frequentist’s position is that estimations come from experiments and experiments
only. If we want to estimate how likely a six sided die is to roll a 4, we should roll the die
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 2/19
10/30/2019 Probability Theory - Towards Data Science
This method works well when we have a large amount of data, but with fewer examples
we can’t be confident in our estimates. If we haven’t seen a 4 after five rolls, does that
mean a 4 is impossible? The other issue is that we can’t inject any of our prior knowledge
about dice into our estimates. If we knew the die was fair, not seeing a 4 in the first five
rolls is completely understandable.
. . .
Probability is all about the possibility of various outcomes. The set of all possible
outcomes is called the sample space. The sample space for a coin flip is {heads, tails}.
The sample space for the temperature of water is all values between the freezing and
boiling point. Only one outcome in the sample space is possible at a time, and the sample
space must contain all possible values. The sample space is often depicted as Ω (capital
omega) and a specific outcome as ω (lowercase omega). We represent the probability of
an event ω as P(ω).
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 3/19
10/30/2019 Probability Theory - Towards Data Science
Axioms of Probability
In plain English, the probability of any event has to be between 0 (impossible) and 1
(certain), and the sum of the probabilities of all events should be 1. This follows from the
fact that the sample space must contain all possible outcomes. Therefore, we are certain
(probability 1) that one of the possible outcomes will occur.
A random variable x, is a variable which randomly takes on values from a sample space.
We often indicate a specific value x can take on with italics. For example, if x represented
the outcome of a coin flip, we might discuss a specific outcome as x = heads. Random
variables can either be discrete like the coin, or continuous (can take on an uncountably
infinite amount of possible values).
Discrete Distributions:
Discrete random variables are described with a probability mass function (PMF). A
PMF maps each value in the variable’s sample space to a probability. One such PMF is the
uniform distribution over n possible outcomes: P(x=x) = 1/n. This reads as “The
probability of x taking on the value x is 1 divided by the number of possible values”. It’s
called the uniform distribution because each outcome is equally likely (the likelihood is
spread uniformly over all possible values). Fair dice rolls are modeled by a uniform
distribution since each face of the die is equally likely. A loaded die is modeled by a
categorical distribution, where each outcome is assigned a different probability.
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 4/19
10/30/2019 Probability Theory - Towards Data Science
Continuous Distributions:
Continuous random variables are described by probability density functions (PDF)
which can be a bit more difficult to understand. We generally indicate the PDF for a
random variable x as f(x). PDFs map an infinite sample space to relative likelihood
values. To understand this, let’s look at an example with one of the most famous
continuous distributions, the Gaussian (Normal) distribution.
The Gaussian distribution (colloquially called the bell curve) can be used to model
several natural phenomena. For example, heights of each gender are approximately
Gaussian distributed. The Gaussian distribution is parameterized by two values: the
mean μ (mu) and variance σ² (sigma squared). The mean specifies the center of the
distribution, and the variance specifies the width of the distribution. You may have also
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 5/19
10/30/2019 Probability Theory - Towards Data Science
heard about the standard deviation σ, which is just the square root of the variance. To
indicate that x is a random variable drawn from a Gaussian with mean μ and variance σ²,
we write:
The functional form of the PDF for the Gaussian can be intimidating at first glance. I
promise that after working with Gaussian distributions in applications the fear goes
away! The functional form of the PDF is:
The left hand side of the equation says “The PDF of x given μ and σ² ”. The vertical bar
represents the word “given” and indicates that we already know all of the values after it
in the parenthesis . Let’s plot this equation given μ = 0 and σ² = 4:
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 6/19
10/30/2019 Probability Theory - Towards Data Science
So what do I mean by relative likelihoods? Unlike discrete distributions, the value of the
PDF at x = x is not the actual probability of x. This is a common misconception when
people first start dabbling with probability theory. Since there are infinitely many values
that x could take on, the probability of x taking on any specific value is actually 0! I’m
guessing you don’t believe me, but let’s think about it together while we revisit our
axioms.
Recall that the total probability for every possible value needs to sum to 1. How do we
sum over an infinite number of values? The answer comes from calculus in the form of
the integral. We can rewrite our axiom in terms of the PDF using the integral:
If you aren’t familiar with calculus, the integral is an operator which calculates the area
under the curve f(x). It’s a generalization of the summation to an infinite amount of
values. So the area under the PDF represents the total probability of the Gaussian! If you
are familiar with calculus (and feel up to it) you can calculate the integral of the
Gaussian yourself to confirm that the area is 1.
Because the area is what we’re interested in, it’s often more useful to work with a
continuous random variable’s cumulative distribution function (CDF). We write the
CDF, F(x) as:
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 7/19
10/30/2019 Probability Theory - Towards Data Science
What in the world does that mean? Well, for a given value x, we’re taking the integral of
the PDF from negative infinity to that value. So F(x) gives us the the area under the PDF
for the interval negative infinity to x.
We just determined that the area corresponds to the probability, so F(x) gives us P(x≤x).
Now, we can use the CDF to determine the probability of any given range [a,b] by
noticing that P(a≤x≤b) = F(b)-F(a). This answers the question, “What is the probability
that x will be between a and b?”.
Asking for P(x=x) is equivalent to asking P(x≤x≤x) = F(x)-F(x) = 0. So there you have
it, the probability of sampling a specific number from the distribution is 0! (Stronger
arguments could be made by actually taking the limit). Here’s the CDF for the same
Gaussian from before:
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 8/19
10/30/2019 Probability Theory - Towards Data Science
We see that the probability of x taking on values less than -2.5 is nearly 0. We also see
that the values sampled from x will mostly be less than 2.5.
. . .
This toy example highlights a few important things. First, I’m terrible at making toy
examples. Second, notice that the table meets the requirements laid out by our axioms.
We can immediately answer questions of the form P(clothes=T-Shirt, weather=Sunny),
but joint distributions give us so much more!
Sum Rule
The term P(x) is called the marginal probability distribution, since we’ve
“marginalized” away the random variable y. Let’s use the sum rule to calculate the
probability I’ll wear a hoodie. P(Hoodie) = P(Hoodie, Sunny) + P(Hoodie, Cloudy) +
P(Hoodie, Rainy) = 3/9. We could use the same process to find the marginal probability
of any of the clothing items or any weather condition.
If P(x, y) was a joint distribution over continuous random variables, then to marginalize
out y we turn the summation into an integration over y just like before.
In other words, if I’ve observed that y=y, then the probability that x=x is P(x, y)/P(y).
Why does observing y change the probability of x? Well, imagine you saw me walk inside
with a jacket on. Knowing that I’m wearing a jacket gives you information about the
weather without observing the weather directly.
Note that the conditional probability only exists if P(y) > 0. If it’s impossible for y to
occur, then we couldn’t have observed y to begin with.
By multiplying both sides of the last equation by P(y) we get the chain rule of
probability, P(x, y) = P(x|y) ⋅ P(y). The chain rule can be generalized to joint
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 10/19
10/30/2019 Probability Theory - Towards Data Science
Bayes’ Rule:
Notice that we can write the chain rule for two variables in two equivalent ways:
If we set both right sides equal to each other and divide by P(y), we get Bayes’ rule:
Bayes Rule
Bayes’s rule is crucially important to much of statistics and machine learning. As alluded
to earlier, it’s the driving force behind Bayesian statistics. This simple rule allows us to
update our beliefs about quantities as we gather more observations from data. I’ll
definitely discuss Bayes’ rule (and Bayesian statistics) some more in a later post.
A similar concept is that of conditional independence. Two variables x and y are called
conditionally independent given another variable z if P(x, y|z) = P(x|z) ⋅ P(y|z). Let’s
do an example to see what this is all about.
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 11/19
10/30/2019 Probability Theory - Towards Data Science
. . .
Expectation:
The expected value, or expectation, of a function h(x) on a random variable x ~ P(x) is
the average value of h(x) weighted by P(x). For a discrete x, we write this as:
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 12/19
10/30/2019 Probability Theory - Towards Data Science
If x had been continuous, we would replace the summation with an integral (I’ll bet
you’re seeing a pattern by now). So the expectation acts as a weighted average over h(x),
where the weights are the probabilities of each x.
What’s the expected value of playing the guessing game at the casino if we assume we
have a 1/10 chance of guessing the correct number?
Another nice property of expectations is that they’re linear. Let’s assume g is another
function of x, and α and β are constants. Then we have:
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 13/19
10/30/2019 Probability Theory - Towards Data Science
When the absolute value of covariance is high, the two functions tend to vary far from
their means at the same time. When the sign of the covariance is positive, the two
functions map to higher values together. If the sign is negative, one function maps to
higher values, while the other function maps to lower values and vice versa. The
visualization at the beginning of this post shows samples from a joint Gaussian
distribution with positive covariance between the variables. You can see that as the first
variable increases, so does the second.
Moments:
Note that we can calculate the expectation and variance for a random variable by
replacing the function h(x) with x itself. The expectation of a distribution is its mean, or
first moment. The variance of a distribution is its second moment. Higher order
moments for probability distributions capture other characteristics like skewness and
kurtosis.
. . .
Important Distributions
We’ve covered most of the important aspects of probability theory. These ideas act as
building blocks for developing the underpinnings of the majority of statistics and
machine learning. In order to master probability theory and start bridging the gap
toward statistics, one needs to become somewhat familiar with the more useful
probability distributions.
Bernoulli: models the outcome of coin flips and other binary events
Geometric: models how many flips necessary before you get a success
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 14/19
10/30/2019 Probability Theory - Towards Data Science
Multinomial: a generalization of the Binomial to more than two outcomes (like a die
roll)
For continuous distributions it’s also useful to know the shape. For example, we saw that
the Gaussian distribution is shaped like a bell, with most of its density close to the mean.
The Beta distribution can take on a wide range of shapes over the interval [0,1]. This
makes the Beta distribution a good choice for modeling our beliefs about particular
probabilities.
It’s also important to remember that these well formed distributions are more like
templates than anything else. The true distribution of your data is probably not so nice
and may even be changing over time.
. . .
Supervised Learning:
In supervised machine learning, our goal is to learn from labeled data. Data being
labeled means that for some inputs X, we know the desired outputs Y. Some possible
tasks include:
Predict the price of a stock given some features about the company.
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 15/19
10/30/2019 Probability Theory - Towards Data Science
How can probability help us in these scenarios? We can learn a mapping from X to Y in
various ways. First, you could learn P(Y|X), that is to say, a probability distribution over
possible values of Y given that you’ve observed a new sample X. Machine learning
algorithms that find this distribution are called discriminative. Imagine I tell you that I
saw an animal that had fur, a long tail, and was two inches tall. Can you discriminate
between possible animals and guess what it was?
Alternatively, we could instead try to learn P(X|Y), the probability distribution over
inputs given labels. Algorithms for doing this are called generative. Given that I want a
mouse, can you describe the possible heights, furriness, and length of tails that mice
have? Enumerating the possible values for the features is sort of like generating all
possible mice.
You may be wondering how knowing a generative model would help us with our task of
classifying animals? Remember Bayes’ Rule? From our training data we can learn P(Y),
the probability of any specific animal, and P(X), the probability of any specific
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 16/19
10/30/2019 Probability Theory - Towards Data Science
configuration of the features. Using these terms we can answer queries in the form of
P(Y|X) using Bayes’ Rule.
It’s possible to learn a mapping from X to Y which isn’t in the form of a probability
distribution. We could fit a deterministic function f to our training data such that f(X) ≈
Y. What makes having a distribution better? Well, imagine an algorithm is diagnosing
your illness, and it tells you that you have a month left to live. The function f can’t
express to you how confident it is in the assessment. Maybe you have features that the
algorithm never saw in the training data, causing it to more or less guess an outcome.
The probabilistic model quantifies uncertainty, the regular function does not.
Unsupervised Learning:
Unsupervised learning is a broad set of techniques for learning from unlabeled data,
where we just have some samples X but no output Y. Common unsupervised tasks
include:
Taking high dimensional data and projecting it into a meaningful lower dimensional
space (dimension reduction, factor analysis, embedding).
Characterizing the distribution of unlabeled data is useful for many tasks. One example
is anomaly detection. If we learn P(X), where X represents normal bank transactions,
then we can use P(X) to measure the likelihood of future transactions. If we observe a
transaction with low probability, we can flag it as suspicious and possibly fraudulent.
Clustering is one of the canonical problems of unsupervised learning. Given some data
points originating from separate groups, how can we determine which group each point
belongs? One method is to assume that each group is generated from a different
probability distribution. Solving the problem then becomes finding the most likely
configuration of these distributions.
Dimension reduction is the other main area of unsupervised learning. High dimensional
data takes up memory, slows down computations, and is hard to visualize and interpret.
We’d like to have ways of reducing the data to a lower dimension without loosing too
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 17/19
10/30/2019 Probability Theory - Towards Data Science
much information. One can think of this problem as finding a distribution in a lower
dimensional space with similar characteristics to the distribution of the original data.
Reinforcement Learning:
The field of reinforcement learning is all about training artificial agents to perform well
at specific tasks. The agents learn by taking actions in their environment and observing
reward signals based on their behavior. The goal of the agent is to maximize its expected
long term reward. Probability is used in reinforcement learning for several aspects of the
learning process. You may have picked up on the word “expected” in the goal. The
agent’s learning process often revolves around quantifying the uncertainty of the utility
of taking one specific action over another.
. . .
Conclusion
This has been a gentle overview of the language of probability theory with a brief
discussion on how we will apply these concepts to more advanced machine learning and
statistics moving forward. Be sure to check out part two on Maximum Likelihood
Estimation:
If you’d like to tackle probability theory from another angle, I highly recommend
checking out this amazing visual introduction from Seeing Theory:
Seeing Theory
A visual introduction to probability and statistics.
seeing-theory.brown.edu
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 18/19
10/30/2019 Probability Theory - Towards Data Science
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 19/19