0% found this document useful (0 votes)

100 views19 pages

Probability Theory - Towards Data Science

This document provides an overview of probability theory, which is a fundamental concept in machine learning. It discusses different perspectives on probability, including objectivism, subjectivism, and frequentism. It also covers key probability concepts like sample spaces, events, axioms of probability, random variables, and different types of probability distributions (discrete vs. continuous). Discrete distributions discussed include the uniform and Bernoulli distributions. Continuous distributions covered include the Gaussian/normal distribution.

Uploaded by

Aleksandar Spasojevic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views19 pages

Probability Theory - Towards Data Science

Uploaded by

Aleksandar Spasojevic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

10/30/2019 Probability Theory - Towards Data Science

Probability Theory
Fundamentals of Machine Learning (Part 1)
William Fleshman Follow
Jan 29 · 18 min read

This is part one in a series of topics I consider fundamental to machine learning.

Probability theory is a mathematical framework for quantifying our uncertainty about
the world. It allows us (and our software) to reason effectively in situations where being
certain is impossible. Probability theory is at the foundation of many machine learning
algorithms. The goal of this post is to cover the vocabulary and mathematics needed
before applying probability theory to machine learning applications. The topics covered
will be built upon in future posts!

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 1/19
10/30/2019 Probability Theory - Towards Data Science

Some Philosophy on Probability

What exactly is probability? Most people know that the probability of an event is some
value between 0 and 1 which indicates how likely the event is to occur. Seems simple
enough, but where do those values actually come from?

The objectivist perspective is that randomness is fundamental to the universe. They

would say that the probability of a fair coin coming up heads is 0.5, because that’s the
nature of fair coins. Alternatively, the subjectivist viewpoint is that probabilities
represent our degree of belief that an event will occur. If we knew the initial position of
the coin and how the force was applied, then we could determine with certainty if it
would come up heads or tails. Under this perspective, probability is a measure of our
ignorance (like not knowing how the force is applied to the coin).

Personally, I’m a subjectivist. If we have the proper measurements we should be able to

predict anything with certainty. One area where this isn’t true (yet) is quantum
mechanics. To understand some quantum phenomena, we have to treat them as being
truly random. It’s possible that one day we will have a better understanding of how the
universe works, and therefore how to properly predict these phenomena as well. It
seems that Albert Einstein shares in my subjectivism:

“God does not play dice with the universe.”

Whatever our beliefs of the true nature of randomness, we need some principled
methods for actually estimating probabilities.

The frequentist’s position is that estimations come from experiments and experiments
only. If we want to estimate how likely a six sided die is to roll a 4, we should roll the die

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 2/19
10/30/2019 Probability Theory - Towards Data Science

a number of times and observe the frequency in which a 4 appears.

This method works well when we have a large amount of data, but with fewer examples
we can’t be confident in our estimates. If we haven’t seen a 4 after five rolls, does that
mean a 4 is impossible? The other issue is that we can’t inject any of our prior knowledge
about dice into our estimates. If we knew the die was fair, not seeing a 4 in the first five
rolls is completely understandable.

Another popular estimation philosophy is that of the Bayesian. A Bayesian treatment of

probability allows us to combine our prior beliefs with our observations. Imagine that a
coin we believe to be fair is flipped three times and results in three heads. A frequentist
calculation would suggest the coin is loaded (although with low confidence), but our
Bayesian prior that the coin is fair allows us to maintain some degree of belief that a tails
is still possible. The actual mechanics of how we combine our prior belief relies on
something called Bayes’ rule, which will be covered later.

. . .

The Mathematics of Probability

In the beginning, I suggested that probability theory is a mathematical framework. As
with any mathematical framework there is some vocabulary and important axioms
needed to fully leverage the theory as a tool for machine learning.

Probability is all about the possibility of various outcomes. The set of all possible
outcomes is called the sample space. The sample space for a coin flip is {heads, tails}.
The sample space for the temperature of water is all values between the freezing and
boiling point. Only one outcome in the sample space is possible at a time, and the sample
space must contain all possible values. The sample space is often depicted as Ω (capital
omega) and a specific outcome as ω (lowercase omega). We represent the probability of
an event ω as P(ω).

The two basic axioms of probability are:

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 3/19
10/30/2019 Probability Theory - Towards Data Science

Axioms of Probability

In plain English, the probability of any event has to be between 0 (impossible) and 1
(certain), and the sum of the probabilities of all events should be 1. This follows from the
fact that the sample space must contain all possible outcomes. Therefore, we are certain
(probability 1) that one of the possible outcomes will occur.

A random variable x, is a variable which randomly takes on values from a sample space.
We often indicate a specific value x can take on with italics. For example, if x represented
the outcome of a coin flip, we might discuss a specific outcome as x = heads. Random
variables can either be discrete like the coin, or continuous (can take on an uncountably
infinite amount of possible values).

To describe the likelihood of each possible value of a random variable x, we specify a

probability distribution. We write x ~ P(x) to indicate that x is a random variable
which is drawn from a probability distribution P(x). Probability distributions are
described differently depending on if the random variable is discrete or continuous.

Discrete Distributions:
Discrete random variables are described with a probability mass function (PMF). A
PMF maps each value in the variable’s sample space to a probability. One such PMF is the
uniform distribution over n possible outcomes: P(x=x) = 1/n. This reads as “The
probability of x taking on the value x is 1 divided by the number of possible values”. It’s
called the uniform distribution because each outcome is equally likely (the likelihood is
spread uniformly over all possible values). Fair dice rolls are modeled by a uniform
distribution since each face of the die is equally likely. A loaded die is modeled by a
categorical distribution, where each outcome is assigned a different probability.

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 4/19
10/30/2019 Probability Theory - Towards Data Science

Another common discrete distribution is the Bernoulli. A Bernoulli distribution specifies

the probability for a random variable which can take on one of two values (1/0,
heads/tails, true/false, rain/no rain, etc.). The PMF of a Bernoulli distribution is P(x) =
{p if x =1, and 1-p if x=0}. Therefore, we can specify the entire distribution with a single
parameter p, the probability of the positive outcome. For a fair coin we have p = 0.5, and
so heads or tails is equally likely. Alternatively, if we say that the probability of rain
tomorrow is p = 0.2, then we can infer that the probability of no rain is 0.8.

Continuous Distributions:
Continuous random variables are described by probability density functions (PDF)
which can be a bit more difficult to understand. We generally indicate the PDF for a
random variable x as f(x). PDFs map an infinite sample space to relative likelihood
values. To understand this, let’s look at an example with one of the most famous
continuous distributions, the Gaussian (Normal) distribution.

The Gaussian distribution (colloquially called the bell curve) can be used to model
several natural phenomena. For example, heights of each gender are approximately
Gaussian distributed. The Gaussian distribution is parameterized by two values: the
mean μ (mu) and variance σ² (sigma squared). The mean specifies the center of the
distribution, and the variance specifies the width of the distribution. You may have also
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 5/19
10/30/2019 Probability Theory - Towards Data Science

heard about the standard deviation σ, which is just the square root of the variance. To
indicate that x is a random variable drawn from a Gaussian with mean μ and variance σ²,
we write:

X is drawn from a Normal distribution with mean μ and variance σ².

The functional form of the PDF for the Gaussian can be intimidating at first glance. I
promise that after working with Gaussian distributions in applications the fear goes
away! The functional form of the PDF is:

The PDF of x given μ and σ²

The left hand side of the equation says “The PDF of x given μ and σ² ”. The vertical bar
represents the word “given” and indicates that we already know all of the values after it
in the parenthesis . Let’s plot this equation given μ = 0 and σ² = 4:

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 6/19
10/30/2019 Probability Theory - Towards Data Science

So what do I mean by relative likelihoods? Unlike discrete distributions, the value of the
PDF at x = x is not the actual probability of x. This is a common misconception when
people first start dabbling with probability theory. Since there are infinitely many values
that x could take on, the probability of x taking on any specific value is actually 0! I’m
guessing you don’t believe me, but let’s think about it together while we revisit our
axioms.

Recall that the total probability for every possible value needs to sum to 1. How do we
sum over an infinite number of values? The answer comes from calculus in the form of
the integral. We can rewrite our axiom in terms of the PDF using the integral:

The integral of the PDF over the sample space is 1.

If you aren’t familiar with calculus, the integral is an operator which calculates the area
under the curve f(x). It’s a generalization of the summation to an infinite amount of
values. So the area under the PDF represents the total probability of the Gaussian! If you
are familiar with calculus (and feel up to it) you can calculate the integral of the
Gaussian yourself to confirm that the area is 1.

Because the area is what we’re interested in, it’s often more useful to work with a
continuous random variable’s cumulative distribution function (CDF). We write the
CDF, F(x) as:

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 7/19
10/30/2019 Probability Theory - Towards Data Science

CDF is a function of the integral of the PDF.

What in the world does that mean? Well, for a given value x, we’re taking the integral of
the PDF from negative infinity to that value. So F(x) gives us the the area under the PDF
for the interval negative infinity to x.

We just determined that the area corresponds to the probability, so F(x) gives us P(x≤x).
Now, we can use the CDF to determine the probability of any given range [a,b] by
noticing that P(a≤x≤b) = F(b)-F(a). This answers the question, “What is the probability
that x will be between a and b?”.

Asking for P(x=x) is equivalent to asking P(x≤x≤x) = F(x)-F(x) = 0. So there you have
it, the probability of sampling a specific number from the distribution is 0! (Stronger
arguments could be made by actually taking the limit). Here’s the CDF for the same
Gaussian from before:

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 8/19
10/30/2019 Probability Theory - Towards Data Science

We see that the probability of x taking on values less than -2.5 is nearly 0. We also see
that the values sampled from x will mostly be less than 2.5.

. . .

Joint Probability Distributions:

A distribution over multiple random variables is called a joint probability distribution.
We can write a collection of random variables as a vector x. A joint distribution over x
specifies the probability of any particular setting of all the random variables contained in
x. To make this more clear, let’s consider two random variables x and y. We write the
joint probability as P(x=x, y=y) or just P(x, y) for short. I say this out loud as “The
probability that x=x and y=y”. If both random variables are discrete, we can represent
the joint distribution as a simple table of probabilities. For example, let’s consider the
joint distribution of which outer garment I wear with the weather conditions (in a
universe where these are the only options):

Joint Distribution of clothes and weather (in an odd universe)

This toy example highlights a few important things. First, I’m terrible at making toy
examples. Second, notice that the table meets the requirements laid out by our axioms.
We can immediately answer questions of the form P(clothes=T-Shirt, weather=Sunny),
but joint distributions give us so much more!

Marginal Probability Distributions:

First, let’s look at the sum rule:
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 9/19
10/30/2019 Probability Theory - Towards Data Science

Sum Rule

The term P(x) is called the marginal probability distribution, since we’ve
“marginalized” away the random variable y. Let’s use the sum rule to calculate the
probability I’ll wear a hoodie. P(Hoodie) = P(Hoodie, Sunny) + P(Hoodie, Cloudy) +
P(Hoodie, Rainy) = 3/9. We could use the same process to find the marginal probability
of any of the clothing items or any weather condition.

If P(x, y) was a joint distribution over continuous random variables, then to marginalize
out y we turn the summation into an integration over y just like before.

Conditional Probability Distributions:

We’re also often interested in the probability of an event given that another event has
already been observed. We denote the conditional probability distributions of x given
y as:

Probability of x conditioned on observing y

In other words, if I’ve observed that y=y, then the probability that x=x is P(x, y)/P(y).
Why does observing y change the probability of x? Well, imagine you saw me walk inside
with a jacket on. Knowing that I’m wearing a jacket gives you information about the
weather without observing the weather directly.

Note that the conditional probability only exists if P(y) > 0. If it’s impossible for y to
occur, then we couldn’t have observed y to begin with.

By multiplying both sides of the last equation by P(y) we get the chain rule of
probability, P(x, y) = P(x|y) ⋅ P(y). The chain rule can be generalized to joint

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 10/19
10/30/2019 Probability Theory - Towards Data Science

distributions with any number of random variables: P(x, y, z) = P(x|y, z) ⋅ P(y, z) =

P(x|y, z) ⋅ P(y|z) ⋅ P(z).

Bayes’ Rule:
Notice that we can write the chain rule for two variables in two equivalent ways:

P(x, y) = P(x|y) ⋅ P(y)

P(x, y) = P(y|x) ⋅ P(x)

If we set both right sides equal to each other and divide by P(y), we get Bayes’ rule:

Bayes Rule

Bayes’s rule is crucially important to much of statistics and machine learning. As alluded
to earlier, it’s the driving force behind Bayesian statistics. This simple rule allows us to
update our beliefs about quantities as we gather more observations from data. I’ll
definitely discuss Bayes’ rule (and Bayesian statistics) some more in a later post.

Independence and Conditional Independence:

In the previous example, we saw that P(x|y) ≠ P(x) because observing y gave us
information about x. Is this always the case? Let’s imagine that P(x, y) is a joint
distribution with x representing the amount of ice cream in a store and y representing
the number of times per day that the moon is struck by an object. Does knowing either
value give us any information about the other? Of course not! So in this case P(x|y) =
P(x)! By plugging this into the chain rule, we find that in this scenario we get P(x, y) =
P(x|y) ⋅ P(y) = P(x) ⋅ P(y). This leads us directly to our definition of independence. Two
variables x and y are said to be independent if P(x, y) = P(x) ⋅ P(y).

A similar concept is that of conditional independence. Two variables x and y are called
conditionally independent given another variable z if P(x, y|z) = P(x|z) ⋅ P(y|z). Let’s
do an example to see what this is all about.

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 11/19
10/30/2019 Probability Theory - Towards Data Science

Let’s assume x is a random variable indicating whether I brought an umbrella to work,

and y is a random variable indicating whether my grass is wet. It seems pretty obvious
that these events are not independent. If I brought an umbrella it probably means it’s
raining, and if it’s raining my grass is wet. Now let’s assume we observe variable z, which
represents that it is in fact raining outside. Now, regardless of whether I brought an
umbrella to work, you know that my grass is wet. So the condition of rain has made my
umbrella independent of my grass being wet!

Independence and conditional independence become very important when we need to

represent very large joint distributions. Independence let’s us factor our distribution into
simpler terms, enabling efficient memory usage and faster calculations. We’ll see this
concretely in a future post on Bayesian Networks!

. . .

Functions of Random Variables

It’s often useful to create functions which take random variables as input. Let’s consider
a trip to the casino. It costs $2 to play my favorite game “guess a number between 1 and
10”. If you guess correctly you win $10. If you guess incorrectly you win nothing. Let x be
a random variable indicating whether you guessed correctly. Then we can write a
function h(x) = {$8 if x = 1, and -$2 if x = 0}. In other words, if you guess right you get
$10 minus the $2 you paid to play, otherwise you just lose your $2. You might be
interested in knowing in advance what the expected outcome will be.

Expectation:
The expected value, or expectation, of a function h(x) on a random variable x ~ P(x) is
the average value of h(x) weighted by P(x). For a discrete x, we write this as:

Expected Value of h(x) with respect to P(x)

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 12/19
10/30/2019 Probability Theory - Towards Data Science

If x had been continuous, we would replace the summation with an integral (I’ll bet
you’re seeing a pattern by now). So the expectation acts as a weighted average over h(x),
where the weights are the probabilities of each x.

What’s the expected value of playing the guessing game at the casino if we assume we
have a 1/10 chance of guessing the correct number?

𝔼[h(x)] = P(winning) ⋅ h(winning) + P(loosing) ⋅ h(loosing)= (1/10) ⋅ $8 + (9/10) ⋅

(-$2) = $0.80 + (-$1.80) = -$1. So on average, we’ll loose $1 every time we play!

Another nice property of expectations is that they’re linear. Let’s assume g is another
function of x, and α and β are constants. Then we have:

Expectations are linear

Variance and Covariance:

We saw variance with respect to a Gaussian distribution when we were talking about
continuous random variables. In general, variance is a measure of how much random
values vary from their mean. Similarly, for functions of random variables, the variance is
a measure of the variability of the function’s output from its expected value.

Variance of h(x) with respect to P(x)

Another important concept is covariance. Covariance is a measure of how linearly

related two random variables (or functions on random variables) are with each other.
The covariance between functions h(x) and g(y) is written as:

Covariance between h(x) and g(y).

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 13/19
10/30/2019 Probability Theory - Towards Data Science

When the absolute value of covariance is high, the two functions tend to vary far from
their means at the same time. When the sign of the covariance is positive, the two
functions map to higher values together. If the sign is negative, one function maps to
higher values, while the other function maps to lower values and vice versa. The
visualization at the beginning of this post shows samples from a joint Gaussian
distribution with positive covariance between the variables. You can see that as the first
variable increases, so does the second.

Moments:
Note that we can calculate the expectation and variance for a random variable by
replacing the function h(x) with x itself. The expectation of a distribution is its mean, or
first moment. The variance of a distribution is its second moment. Higher order
moments for probability distributions capture other characteristics like skewness and
kurtosis.

. . .

Important Distributions
We’ve covered most of the important aspects of probability theory. These ideas act as
building blocks for developing the underpinnings of the majority of statistics and
machine learning. In order to master probability theory and start bridging the gap
toward statistics, one needs to become somewhat familiar with the more useful
probability distributions.

The functional forms of probability distributions can be intimidating. My advice is to not

focus too much on that aspect and instead focus on what types of situation each
distribution is good at modeling. Some examples of model/purpose descriptions
include:

Bernoulli: models the outcome of coin flips and other binary events

Binomial: models a series of Bernoulli trials (a series of coin flips, etc.)

Geometric: models how many flips necessary before you get a success

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 14/19
10/30/2019 Probability Theory - Towards Data Science

Multinomial: a generalization of the Binomial to more than two outcomes (like a die
roll)

Poisson: models the number of events that occur in a certain interval

For continuous distributions it’s also useful to know the shape. For example, we saw that
the Gaussian distribution is shaped like a bell, with most of its density close to the mean.
The Beta distribution can take on a wide range of shapes over the interval [0,1]. This
makes the Beta distribution a good choice for modeling our beliefs about particular
probabilities.

It’s also important to remember that these well formed distributions are more like
templates than anything else. The true distribution of your data is probably not so nice
and may even be changing over time.

. . .

Great, but what’s all this have to do with Machine

Learning?
The goal of this post was to build up our language of probability so that we can frame
machine learning in a probabilistic light. I’ll cover specific machine learning algorithms
and applications in future posts, but I’d like to describe a bit of what we’ve just enabled.

Supervised Learning:
In supervised machine learning, our goal is to learn from labeled data. Data being
labeled means that for some inputs X, we know the desired outputs Y. Some possible
tasks include:

Identify what’s in an image.

Predict the price of a stock given some features about the company.

Detect if a file is malicious.

Diagnose a patient with an illness.

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 15/19
10/30/2019 Probability Theory - Towards Data Science

How can probability help us in these scenarios? We can learn a mapping from X to Y in
various ways. First, you could learn P(Y|X), that is to say, a probability distribution over
possible values of Y given that you’ve observed a new sample X. Machine learning
algorithms that find this distribution are called discriminative. Imagine I tell you that I
saw an animal that had fur, a long tail, and was two inches tall. Can you discriminate
between possible animals and guess what it was?

Photo by Ricky Kharawala on Unsplash

Alternatively, we could instead try to learn P(X|Y), the probability distribution over
inputs given labels. Algorithms for doing this are called generative. Given that I want a
mouse, can you describe the possible heights, furriness, and length of tails that mice
have? Enumerating the possible values for the features is sort of like generating all
possible mice.

You may be wondering how knowing a generative model would help us with our task of
classifying animals? Remember Bayes’ Rule? From our training data we can learn P(Y),
the probability of any specific animal, and P(X), the probability of any specific

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 16/19
10/30/2019 Probability Theory - Towards Data Science

configuration of the features. Using these terms we can answer queries in the form of
P(Y|X) using Bayes’ Rule.

It’s possible to learn a mapping from X to Y which isn’t in the form of a probability
distribution. We could fit a deterministic function f to our training data such that f(X) ≈
Y. What makes having a distribution better? Well, imagine an algorithm is diagnosing
your illness, and it tells you that you have a month left to live. The function f can’t
express to you how confident it is in the assessment. Maybe you have features that the
algorithm never saw in the training data, causing it to more or less guess an outcome.
The probabilistic model quantifies uncertainty, the regular function does not.

Unsupervised Learning:
Unsupervised learning is a broad set of techniques for learning from unlabeled data,
where we just have some samples X but no output Y. Common unsupervised tasks
include:

Grouping similar data points together (clustering).

Taking high dimensional data and projecting it into a meaningful lower dimensional
space (dimension reduction, factor analysis, embedding).

Representing the data with a distribution (density estimation).

Characterizing the distribution of unlabeled data is useful for many tasks. One example
is anomaly detection. If we learn P(X), where X represents normal bank transactions,
then we can use P(X) to measure the likelihood of future transactions. If we observe a
transaction with low probability, we can flag it as suspicious and possibly fraudulent.

Clustering is one of the canonical problems of unsupervised learning. Given some data
points originating from separate groups, how can we determine which group each point
belongs? One method is to assume that each group is generated from a different
probability distribution. Solving the problem then becomes finding the most likely
configuration of these distributions.

Dimension reduction is the other main area of unsupervised learning. High dimensional
data takes up memory, slows down computations, and is hard to visualize and interpret.
We’d like to have ways of reducing the data to a lower dimension without loosing too
https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 17/19
10/30/2019 Probability Theory - Towards Data Science

much information. One can think of this problem as finding a distribution in a lower
dimensional space with similar characteristics to the distribution of the original data.

Reinforcement Learning:
The field of reinforcement learning is all about training artificial agents to perform well
at specific tasks. The agents learn by taking actions in their environment and observing
reward signals based on their behavior. The goal of the agent is to maximize its expected
long term reward. Probability is used in reinforcement learning for several aspects of the
learning process. You may have picked up on the word “expected” in the goal. The
agent’s learning process often revolves around quantifying the uncertainty of the utility
of taking one specific action over another.

. . .

Conclusion
This has been a gentle overview of the language of probability theory with a brief
discussion on how we will apply these concepts to more advanced machine learning and
statistics moving forward. Be sure to check out part two on Maximum Likelihood
Estimation:

Maximum Likelihood Estimation

Fundamentals of Machine Learning (Part 2)
towardsdatascience.com

If you’d like to tackle probability theory from another angle, I highly recommend
checking out this amazing visual introduction from Seeing Theory:

Seeing Theory
A visual introduction to probability and statistics.
seeing-theory.brown.edu

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 18/19
10/30/2019 Probability Theory - Towards Data Science

See you next time!

Data Science Probability Machine Learning William Fleshman Statistics

About Help Legal

https://towardsdatascience.com/probability-fundamentals-of-machine-learning-part-1-a156b4703e69 19/19

Appendix_J_DEIS_FishAquatic_J-1_J-10-1 Stibnite
No ratings yet
Appendix_J_DEIS_FishAquatic_J-1_J-10-1 Stibnite
804 pages
Notes
No ratings yet
Notes
422 pages
Elements of Probability and Statis
100% (1)
Elements of Probability and Statis
246 pages
Muskeg engineering handbook -- by the Muskeg Subcommittee of the NRC Associate Committee on -- Canadian building series,, 3, [Toronto], Ontario, 1969 -- 9780802015952 -- 563080f45b989dc48ebb19062fe5a6c2 -- Anna’s A
No ratings yet
Muskeg engineering handbook -- by the Muskeg Subcommittee of the NRC Associate Committee on -- Canadian building series,, 3, [Toronto], Ontario, 1969 -- 9780802015952 -- 563080f45b989dc48ebb19062fe5a6c2 -- Anna’s A
332 pages
Data Science
No ratings yet
Data Science
74 pages
Stat 231 Course Notes
100% (1)
Stat 231 Course Notes
326 pages
LargeScaleInference PDF
No ratings yet
LargeScaleInference PDF
273 pages
Ryuichi Sakamoto, Japanese Pop Pioneer and Oscar-Winning Composer, Dies Aged 71 - Ryuichi Sakamoto - The Guardian
No ratings yet
Ryuichi Sakamoto, Japanese Pop Pioneer and Oscar-Winning Composer, Dies Aged 71 - Ryuichi Sakamoto - The Guardian
9 pages
Bayesian Lecture Notes
No ratings yet
Bayesian Lecture Notes
28 pages
ESTIMATINGSETTLEMENTOFSANDCAUSEDBYconstructionvibration
No ratings yet
ESTIMATINGSETTLEMENTOFSANDCAUSEDBYconstructionvibration
10 pages
Introduction To Statistical Thought
100% (2)
Introduction To Statistical Thought
393 pages
Dunham-Bush Air Cooled Rotary Screw Chillers
100% (1)
Dunham-Bush Air Cooled Rotary Screw Chillers
44 pages
GU Intelligent Parameter Optimization Data Collection Guide (Manual Collection)
No ratings yet
GU Intelligent Parameter Optimization Data Collection Guide (Manual Collection)
55 pages
13641-2002 Verilog Synthesis LRM
No ratings yet
13641-2002 Verilog Synthesis LRM
109 pages
Estimation and Hypothesis
100% (2)
Estimation and Hypothesis
32 pages
BTMMeeting25Nov2020-StatisticalLearning
No ratings yet
BTMMeeting25Nov2020-StatisticalLearning
49 pages
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
From Everand
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
Harald Cramér
4/5 (2)
C.2 Wave Model Practice Questions
No ratings yet
C.2 Wave Model Practice Questions
22 pages
Microsoft SQL Server 2000 Programming by Example
No ratings yet
Microsoft SQL Server 2000 Programming by Example
704 pages
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
Binomial Distribution
No ratings yet
Binomial Distribution
16 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
CAGD_1990_007!01!015-031 Constructing Piecewise Rational Curves With Frenet Frame Continuity
No ratings yet
CAGD_1990_007!01!015-031 Constructing Piecewise Rational Curves With Frenet Frame Continuity
17 pages
Gamlss-Manual Instructions On How To Use The Gamlss Package 2008
No ratings yet
Gamlss-Manual Instructions On How To Use The Gamlss Package 2008
206 pages
Gamma Extended Frechet Distribution
No ratings yet
Gamma Extended Frechet Distribution
23 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
BAM 062: Project Management Student Activity Sheet Module No.3 BSAIS 3 A1-01 A. Lesson Preview/Review
No ratings yet
BAM 062: Project Management Student Activity Sheet Module No.3 BSAIS 3 A1-01 A. Lesson Preview/Review
2 pages
Higher Algebra
100% (7)
Higher Algebra
551 pages
Algequestions
No ratings yet
Algequestions
20 pages
R Lesson (1 of 2) PDF
No ratings yet
R Lesson (1 of 2) PDF
182 pages
A General Theory of The Hydraulic Transport of Solids in Full Suspension
No ratings yet
A General Theory of The Hydraulic Transport of Solids in Full Suspension
51 pages
Osram Led SMD Chip Datasheet
No ratings yet
Osram Led SMD Chip Datasheet
25 pages
Williams, D. (2006) - Virtual Cultivation. Online Worlds, Offline Perceptions
No ratings yet
Williams, D. (2006) - Virtual Cultivation. Online Worlds, Offline Perceptions
19 pages
puzzleeeeeeee
No ratings yet
puzzleeeeeeee
8 pages
Probability and Statistics
No ratings yet
Probability and Statistics
110 pages
Bayesian Statistics: A User's Perspective
No ratings yet
Bayesian Statistics: A User's Perspective
24 pages
Paper 05 - Answers - 2021 AL
No ratings yet
Paper 05 - Answers - 2021 AL
9 pages
Notes On Time Series Analysis
No ratings yet
Notes On Time Series Analysis
111 pages
SQL Assessment Questions With Hints Repaired
No ratings yet
SQL Assessment Questions With Hints Repaired
7 pages
8A PERT and CPM
No ratings yet
8A PERT and CPM
38 pages
Forecast
No ratings yet
Forecast
82 pages
? Final Exam Complete Review
No ratings yet
? Final Exam Complete Review
12 pages
Cheat Sheet - Machine Learning - Data Science Interview PDF
No ratings yet
Cheat Sheet - Machine Learning - Data Science Interview PDF
16 pages
Clojure For Data Science - Sample Chapter
100% (1)
Clojure For Data Science - Sample Chapter
61 pages
Chapter 2 Itterative Algorithms
No ratings yet
Chapter 2 Itterative Algorithms
8 pages
art3A10.10072Fs11668 016 0113 2
No ratings yet
art3A10.10072Fs11668 016 0113 2
16 pages
How China Is Winning The Race For Africa's Lithium - Financial Times
No ratings yet
How China Is Winning The Race For Africa's Lithium - Financial Times
12 pages
Technical_Toolboxes_HDD_PT_Overview (1)
No ratings yet
Technical_Toolboxes_HDD_PT_Overview (1)
2 pages
App.A - Detection and Estimation in Additive Gaussian Noise PDF
No ratings yet
App.A - Detection and Estimation in Additive Gaussian Noise PDF
55 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
Tutorial 4 Handout
No ratings yet
Tutorial 4 Handout
7 pages
03 Fitton
No ratings yet
03 Fitton
10 pages
Life Expectancy Using Data Analytics
100% (1)
Life Expectancy Using Data Analytics
9 pages
fINAL Theory Time Table WINTER 2023
No ratings yet
fINAL Theory Time Table WINTER 2023
6 pages
Extreme Value Statistics
No ratings yet
Extreme Value Statistics
41 pages
AI-driven Database Connects Chemical Sustainability To Profitability
No ratings yet
AI-driven Database Connects Chemical Sustainability To Profitability
7 pages
Colombia's $20bn Takeover Battle Turns Ugly After Year-Long Fight - Financial Times
No ratings yet
Colombia's $20bn Takeover Battle Turns Ugly After Year-Long Fight - Financial Times
7 pages
John Lanchester Diary - A Whiff of Tear Gas LRB 19 December 2019
No ratings yet
John Lanchester Diary - A Whiff of Tear Gas LRB 19 December 2019
6 pages
Week 5 Programming Assignment: (Https://swayam - Gov.in)
No ratings yet
Week 5 Programming Assignment: (Https://swayam - Gov.in)
12 pages
SNS2 - Pump Complete Dimension
No ratings yet
SNS2 - Pump Complete Dimension
1 page
How Infinite Series Reveal The Unity of Mathematics - Quanta Magazine
No ratings yet
How Infinite Series Reveal The Unity of Mathematics - Quanta Magazine
5 pages
The Scandalous History of The Cubic Formula - Quanta Magazine
No ratings yet
The Scandalous History of The Cubic Formula - Quanta Magazine
5 pages
All Natural Carrot Soap Recipe With Real Carrots - Lovely Greens
No ratings yet
All Natural Carrot Soap Recipe With Real Carrots - Lovely Greens
9 pages
Bayesian Statistics Primer PDF
No ratings yet
Bayesian Statistics Primer PDF
23 pages
Shorter Life Expectancy Gives UK Pensions An Unexpected Windfall - Financial Times
No ratings yet
Shorter Life Expectancy Gives UK Pensions An Unexpected Windfall - Financial Times
4 pages
ARIMA Models in Python Chapter2
No ratings yet
ARIMA Models in Python Chapter2
43 pages
Building A Career in Data Science - The Overview
No ratings yet
Building A Career in Data Science - The Overview
2 pages
Swiss Prosecutor Opens Probe Into Credit Suisse Takeover - Financial Times
No ratings yet
Swiss Prosecutor Opens Probe Into Credit Suisse Takeover - Financial Times
3 pages
Beware Elon Musk's Warped Libertarianism - Financial Times
100% (1)
Beware Elon Musk's Warped Libertarianism - Financial Times
4 pages
Ejot - Delta PT
No ratings yet
Ejot - Delta PT
2 pages
Woolford D. Applied Data Science. Data Translators Across The Disciplines 2023
No ratings yet
Woolford D. Applied Data Science. Data Translators Across The Disciplines 2023
195 pages
Eisenstein NLP Notes
No ratings yet
Eisenstein NLP Notes
573 pages
eMT3070B: Human Machine Interface With 7" TFT Display
No ratings yet
eMT3070B: Human Machine Interface With 7" TFT Display
3 pages
Data Science Analytics For Ordinary People PDF
No ratings yet
Data Science Analytics For Ordinary People PDF
199 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
Glimmix
No ratings yet
Glimmix
244 pages
Ma5160 Applied Probability and Statistics 1 PDF
50% (2)
Ma5160 Applied Probability and Statistics 1 PDF
4 pages
Study Plan For Data Science
No ratings yet
Study Plan For Data Science
5 pages
Alu Design
No ratings yet
Alu Design
3 pages
Statistics Basics From IITM Statistits 2 Course Week - 0
100% (1)
Statistics Basics From IITM Statistits 2 Course Week - 0
71 pages
Bayesian Statistics
No ratings yet
Bayesian Statistics
6 pages
Ron - Klinger - Bridge in A Hurry 2011 1 PDF
100% (3)
Ron - Klinger - Bridge in A Hurry 2011 1 PDF
61 pages
On The Use of Expert Judgement in The Qualification of Risk Assessment
No ratings yet
On The Use of Expert Judgement in The Qualification of Risk Assessment
52 pages
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
100% (1)
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
36 pages
Geometry II: Richard Earl Trinity Term 2007
No ratings yet
Geometry II: Richard Earl Trinity Term 2007
12 pages
PTE 高分模板 Describe image 模板
No ratings yet
PTE 高分模板 Describe image 模板
3 pages
The 8 Basic Statistics Concepts For Data Science - KDnuggets
No ratings yet
The 8 Basic Statistics Concepts For Data Science - KDnuggets
13 pages
Lumped Element Models For R, L and C - Rev - 3
No ratings yet
Lumped Element Models For R, L and C - Rev - 3
6 pages
Learningthepandaslibrary PDF
100% (1)
Learningthepandaslibrary PDF
233 pages
Examples and Problems in Mathematical Statistics
From Everand
Examples and Problems in Mathematical Statistics
Shelemyahu Zacks
5/5 (2)
Scientific Inference
From Everand
Scientific Inference
Harold Jeffreys
No ratings yet
Ace the Data Science Interview-1
No ratings yet
Ace the Data Science Interview-1
5 pages
(Springer Series in Statistics) R.-D. Reiss (Auth.) - Approximate Distributions of Order Statistics - With Applications To Nonparametric Statistics-Springer-Verlag New York (1989) PDF
100% (2)
(Springer Series in Statistics) R.-D. Reiss (Auth.) - Approximate Distributions of Order Statistics - With Applications To Nonparametric Statistics-Springer-Verlag New York (1989) PDF
362 pages
Hadoop For Finance Essentials - Sample Chapter
No ratings yet
Hadoop For Finance Essentials - Sample Chapter
15 pages
Exploratory Analysis
100% (1)
Exploratory Analysis
3 pages
Approaches To The Analysis of Survey Data PDF
No ratings yet
Approaches To The Analysis of Survey Data PDF
28 pages
Solving Inverse Problems Using Datadriven Models
No ratings yet
Solving Inverse Problems Using Datadriven Models
174 pages
Year 3 Time Worksheet
No ratings yet
Year 3 Time Worksheet
2 pages
Desriptive Statistics - Zarni Amri
No ratings yet
Desriptive Statistics - Zarni Amri
57 pages
An Introduction To Order Statistics: Mohammad Ahsanullah, Valery B. Nevzorov, Mohammad Shakil
No ratings yet
An Introduction To Order Statistics: Mohammad Ahsanullah, Valery B. Nevzorov, Mohammad Shakil
241 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
No ratings yet
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
14 pages
Bayesian Analysis
No ratings yet
Bayesian Analysis
9 pages
Statistics For Data Sciences
No ratings yet
Statistics For Data Sciences
10 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages

Uploaded by

Uploaded by

10/30/2019 Probability Theory - Towards Data Science

This is part one in a series of topics I consider fundamental to machine learning.

Some Philosophy on Probability

The objectivist perspective is that randomness is fundamental to the universe. They

Personally, I’m a subjectivist. If we have the proper measurements we should be able to

“God does not play dice with the universe.”

a number of times and observe the frequency in which a 4 appears.

Another popular estimation philosophy is that of the Bayesian. A Bayesian treatment of

The Mathematics of Probability

The two basic axioms of probability are:

To describe the likelihood of each possible value of a random variable x, we specify a

Another common discrete distribution is the Bernoulli. A Bernoulli distribution specifies

X is drawn from a Normal distribution with mean μ and variance σ².

The PDF of x given μ and σ²

The integral of the PDF over the sample space is 1.

CDF is a function of the integral of the PDF.

Joint Probability Distributions:

Joint Distribution of clothes and weather (in an odd universe)

Marginal Probability Distributions:

Conditional Probability Distributions:

Probability of x conditioned on observing y

distributions with any number of random variables: P(x, y, z) = P(x|y, z) ⋅ P(y, z) =

P(x, y) = P(x|y) ⋅ P(y)

P(x, y) = P(y|x) ⋅ P(x)

Independence and Conditional Independence:

Let’s assume x is a random variable indicating whether I brought an umbrella to work,

Independence and conditional independence become very important when we need to

Functions of Random Variables

Expected Value of h(x) with respect to P(x)

𝔼[h(x)] = P(winning) ⋅ h(winning) + P(loosing) ⋅ h(loosing)= (1/10) ⋅ $8 + (9/10) ⋅

Expectations are linear

Variance and Covariance:

Variance of h(x) with respect to P(x)

Another important concept is covariance. Covariance is a measure of how linearly

Covariance between h(x) and g(y).

The functional forms of probability distributions can be intimidating. My advice is to not

Binomial: models a series of Bernoulli trials (a series of coin flips, etc.)

Poisson: models the number of events that occur in a certain interval

Great, but what’s all this have to do with Machine

Identify what’s in an image.

Detect if a file is malicious.

Diagnose a patient with an illness.

Photo by Ricky Kharawala on Unsplash

Grouping similar data points together (clustering).

Representing the data with a distribution (density estimation).

Maximum Likelihood Estimation

See you next time!

Data Science Probability Machine Learning William Fleshman Statistics

About Help Legal

You might also like