Probability I Course
Probability I Course
Andrew Wade
Epiphany 2017–2018
Course Details
Lecturer Dr Andrew Wade.
Office CM302.
Email [email protected].
Lectures There are lectures at 10:00 Tuesday, 10:00 Thursday and 15:00 Friday, all taking
place in CLC013. Each week, a brief outline of the topics to be covered in the upcoming week’s
lectures will be made available on DUO, in order to help you to guide your independent study.
Tutorials take place every two weeks, starting in week 12.
The purpose of tutorials is that students ask particular questions about the exercises and/or
do exercises under supervision. These meetings are as important as the lectures, so take full
advantage of them. A record of attendance is kept.
Problem Classes take place in the Friday lecture slot every two weeks, starting in week 13.
The purpose of problem classes is to go over exercises, on the board, in detail.
DUO will contain all supporting material for the course, including resources such as summary notes,
handouts, problem sheets, and solutions. Also you will find the schedule of problems set for
homework and tutorials, information about tutors and tutor groups, and announcements (which
are made by email as well).
Homework is set each Thursday, to hand in to your tutor at a time agreed with them (generally on
the Friday of the following week). Your tutor will return your marked work to you.
1
About these lecture notes
These boxes delimit advanced material for students who wish to dig a little deeper; some of the
content of these boxes may show up in an exam, but only as ‘unseen’ material.
2
Contents
4 Interpretations of probability 25
4.1 Equally likely outcomes interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Relative frequency interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Betting interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Interpretation and the axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Random variables 30
6.1 Definition and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 The binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 The Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6 The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.7 The exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.8 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.9 Cumulative distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.10 Standard normal tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.11 Functions of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8 Expectation 50
8.1 Definition and interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2 Expectation of functions of random variables . . . . . . . . . . . . . . . . . . . . . . . 52
8.3 Linearity of expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.4 Variance and covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.5 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.6 Independence: multiplication rule for expectation . . . . . . . . . . . . . . . . . . . . . 61
8.7 Expectation and probability inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9 Limit theorems 65
9.1 The weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.3 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4
Chapter 1
1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Sample space and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Event calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 The axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Consequences of the axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Sigma-algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
R Goals: Understand elementary set theory and how to use it to formulate probabilistic scenar-
ios and to describe the calculus of events. Be familiar with the axioms of probability and their
consequences, and how these properties may be deduced from the axioms.
In this chapter, we lay the foundations of probability calculus, and establish the main techniques for
practical calculations with probabilities. The mathematical theory of probability is based on axioms,
like Euclidean geometry. In classical geometry, the fundamental objects posited by the axioms are
points and lines; in probability, they are events and their probabilities. When real world applications
demand interpretations of these mathematical objects, there are several different approaches, which
we mention later in this course; for now, we simply make the important point that the mathematical
theory does not depend on any real-world interpretation that we may have in mind.
It is very important to get a good familiarity with the axioms and properties of probability, as we
will rely on them very extensively for the remainder of this course. An excellent understanding of the
foundations laid in this chapter is essential to obtain an understanding of all further chapters.
1.1 Sets
Also see [BH15, Appendix A.1], [DS02, Section 1.4], [Sti03, Appendix to Chapter 0], or [SHE07, Section 1.2
(Sets)].
In probability, we make extensive use of sets, therefore we first spend some time on them. In essence,
a set is an unordered collection of distinguishable objects; these objects can be numbers, functions,
other sets, and so on—any mathematical object can belong to a set.
The formal notation for a set is an opening curly bracket, followed by a list of elements that belong
to the set, followed by a closing curly bracket.
Example 1.1 The set containing the elements 2, 4, and 5 is denoted by:
{2, 4, 5}.
5
Example 1.2 As mentioned already, the ordering of the elements is irrelevant. For example, {2, 4, 5}
and {4, 5, 2} denote the same set.
Definition 1.3 (Empty Set) The set with no outcomes is called the empty set, and is denoted by H:
H :“ {}.
Variables that represent sets are typically denoted by a capital letters such A, B, C, and so on.
Definition 1.4 (Subset) For two sets A and B, we say that A is a subset of B, and we write A Ď B
(or B Ě A), whenever every element that belongs to A also belongs to B.
Definition 1.7 (Power Set) The set consisting of all subsets of a set A is called the power set of A,
and is denoted as 2A :
2A :“ {B : B Ď A}.
2A “ {H, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}.
We consider a scenario in which there are various possible outcomes, but we are unsure about which
outcome will occur. A sample space is a set of outcomes for this scenario such that one and only one
will occur. In this course, we will usually denote the sample space by Ω, and a generic outcome by
ω P Ω.
Example 1.9 Consider rolling a standard six-sided die. The most obvious sample space is Ω “
{1, 2, 3, 4, 5, 6}, but if one was interested only in whether the die was odd or even, or a six or not,
one could use Ω “ {odd, even}, or Ω “ {not a 6, 6}.
Often, we may enumerate the elements of the sample space Ω in a finite or infinite list Ω “
{ω1 , ω2 , . . .}, in which case we say the set Ω is countable or discrete.
A set is said to be countable when its elements can be enumerated in a (possibly infinite) sequence.
Every finite set is countable, and so is the set of natural numbers N :“ {1, 2, 3, . . .}. The set of integers
Z is countable as well. The set of real numbers R is not countable, and neither is any interval ra, bs
when a ă b.
More formally:
Definition 1.10 A set A is countable if either:
(i) A is finite, or
(ii) there is a bijection (one-to-one and onto mapping) between A and the set of natural numbers N.
One can prove that the set of rational numbers Q is countable.
Definition 1.11 (Event) An event A associated with the sample space Ω is simply a subset of Ω:
event A Ď Ω,
and we say that an event A occurs when the outcome that occurs at the end of the scenario is in the
set A.
6
The empty set H represents the impossible event, i.e., it will never occur. The sample space Ω represents
the certain event, i.e., it will always occur. Most interesting events are somewhere in between.
The representation of an event as a set obviously depends on the choice of sample space Ω for the
specific scenario under study, as shown by Examples 1.12 and 1.13.
Example 1.12 For rolling a standard cubic die, with Ω “ {1, 2, 3, 4, 5, 6}, the event ‘throw odd number’
is the subset {1, 3, 5} consisting of three outcomes.
Example 1.13 For the same scenario, but with Ω “ {odd, even}, the event ‘throw odd number’ is the
subset {odd} consisting of just one outcome.
R Definition 1.15 (Disjoint) We say that events A and B are disjoint, mutually exclusive, or
incompatible if A X B “ H, i.e., it is impossible for A and B both to occur.
Example 1.16 Consider the sample space Ω :“ {1, 2, 3, 4, 5, 6}, and the events
A :“ {2, 4, 6} (even)
B :“ {1, 3, 5} (odd)
C :“ {1, 2, 3} (at most 3)
Then
AYB “Ω (even or odd)
AXB “H (even and odd)
c
A “B (not even)
CzA “ {1, 3} (at most 3 but not even)
A Y C “ {1, 2, 3, 4, 6} (even or at most 3)
A X C “ {2} (even and at most 3)
The events A and B are disjoint as A X B “ H.
7
Exercise 10 (*) See problem sheet.
P Q
As with sums ( ) and products ( ) of multiple numbers, for unions and intersections of multiple
sets we also write:
n
[
Ai :“ A1 Y A2 Y ¨ ¨ ¨ Y An ,
i“1
\n
Ai :“ A1 X A2 Y ¨ ¨ ¨ X An .
i“1
Occasionally, we will also need to take infinite unions and intersections over sequences of sets:
8
[
Ai :“ A1 Y A2 Y A3 Y ¨ ¨ ¨ “ {ω : ω P Ai for at least one i P N};
i“1
\8
Ai :“ A1 X A2 X A3 X ¨ ¨ ¨ “ {ω : ω P Ai for every i P N}.
i“1
We will also sometimes need De Morgan’s Laws: for a (possibly infinite) collection of events Ai ,
(a) ( i Ai )c “ i Aci , and
S T
(b) ( i Ai )c “ i Aci .
T S
It is often useful to visualize the sample space in a Venn diagram. Then events such as A are subsets
of the sample space. It is a helpful analogy to imagine the probability of an event as the area in the
Venn diagram.
A B
This analogy is more apt than it first appears, since the mathematical foundations of rigor-
ous probability theory are built on measure theory, which is the same theory that gives rigorous
foundation to the concepts of length, area, and volume.
8
A3 if A and B are disjoint events (i.e. if A X B “ H) then
P (A Y B) “ P (A) ` P (B) .
R A4 Countable additivity. For any infinite sequence A1 , A2 , . . . of pairwise disjoint events (so
Ai X Aj “ H for all i ‰ j), !
[8 X8
P Ai “ P (Ai ) .
i“1 i“1
For this course, we will usually assume that the probability distribution is given (and satisfies the
axioms), without worrying too much about how the important practical task of finding the probabilities
was carried out.
P (BzA) “ P (B) ´ P (A X B) .
P (A Y B) “ P (A) ` P (B) ´ P (A X B) .
C8 Boole’s inequality. For any events A1 , A2 , . . ., (these need not be pairwise disjoint),
8
! 8
[ X
P Ai ď P (Ai ) .
i“1 i“1
9
If A1 Ě A2 Ě ¨ ¨ ¨ is a decreasing sequence of events, then
8
!
\
P An “ lim P (An ) .
nÑ8
n“1
R Definition 1.18 (Partition) We say that the events E1 , E2 , . . . , Ek (all subsets of the same
sample space Ω) form a partition when:
(i) they all have positive probability, i.e., P (Ei ) ą 0 for all i;
(ii) they are pairwise disjoint, i.e., Ei X Ej “ H whenever i ‰ j; and
(iii) their union is the whole sample space: Yki“1 Ei “ Ω.
Example 1.19 Consider a sample space Ω “ {1, 2, 3, 4, 5, 6}. Possible partitions are:
and so on.
k
X
P (Ei ) “ 1.
i“1
10
Exercise 21 (**) See problem sheet.
These consequences have an enormous effect on the way we work with probability. In particular, it
turns out that we can solve most problems without ever having to explicitly write down the outcomes
in our sample space, as in the next example. In fact, some people do probability without even defining
a sample space.
Example 1.20 Jimmy’s die has the numbers 2,2,2,2,5,5. Your die has numbers 1,1,4,4,4,4. You both
throw and the highest number wins. Assuming all outcomes are equally likely, what is the probability
that Jimmy wins?
The event, J, that Jimmy wins happens if either Jimmy throws a 5 (call this event F ), or if you
throw a 1 (call this event A). Therefore J “ A Y F and by C6, P (J) “ P (A) ` P (F ) ´ P (A X F ) .
As P (F ) “ 1{3, P (A) “ 1{3 and P (A X F ) “ 4{36 “ 1{9 (by counting equally likely outcomes) we
have P (J) “ 1{3 ` 1{3 ´ 1{9 “ 5{9.
1.6 Sigma-algebras
Also see [Sti03, Section 1.2].
We have described the axioms that a probability should satisfy, formulated in terms of events, but
we have hidden one important issue. In general, it is too much to demand that P satisfying A1–4
should be defined on all subsets of Ω. The reason why this is a problem goes beyond the scope of this
course (see the Bibliographical notes at the end of this chapter for references), but the essence is that
for uncountable sample spaces, such as Ω “ r0, 1s, there exist subsets of Ω that cannot be assigned a
probability in a way that is consistent with the axioms. The construction of such non-measurable sets
is also the basis of the famous Banach–Tarski paradox.
The upshot of all this is that we can, in general, only demand that P satisfies A1–4 for all events
in some collection F of subsets of Ω (i.e., for some F Ď 2Ω ). What properties should the collection F
of events possess? Consideration of A4 and C2 suggests the following definition.
Definition 1.21 (σ-algebra) A collection F of subsets of Ω is called a σ-algebra over Ω if it satisfies
the following properties.
S1 Ω P F;
S2 A P F implies that Ac P F;
if A1 , A2 , . . . P F then 8
S
S3 n“1 An P F.
Property S2 says that F is closed under complementation, while S3 says that F is closed under
countable unions. Note that S1 and S2 combined imply that H P F.
Example 1.22 The power set 2Ω is a σ-algebra over Ω, and it is easy to see that it is the biggest
possible σ-algebra over Ω. As described above, for uncountable Ω the set 2Ω may be too unwieldy, in
which case we would consider a smaller σ-algebra.
The trivial σ-algebra {H, Ω} is the smallest possible σ-algebra over Ω, but it carries no information
about the outcome of the experiment.
11
Definition 1.23 (Probability Space) If Ω is a set and F is a σ-algebra of subsets of Ω, and if P
satisfies A1–4 for events in F, then the triple pΩ, F, Pq is called a probability space.
Example 1.24 Consider the sample space Ω “ {1, 2, 3, 4, 5, 6} for the experiment of rolling a fair die.
The choice of σ-algebra may depend on exactly what we are interested in:
• F1 “ {H, {1, 3, 5}, {2, 4, 6}, Ω} (if we only care whether the score is odd or even);
Bibliographical notes
Sets are important not only for probability theory, but for all of mathematics. In fact, all of standard
mathematics can be formulated in terms of set theory, under the assumption that sets satisfy the ZFC
axioms; see for instance http://en.wikipedia.org/wiki/Zermelo%E2%80%93Fraenkel_set_theory.
The foundations of probability have a long and interesting history [Hac75]. The classical theory owes
much to Pierre-Simon Laplace (1749–1827): see [Lap25]. However, a rigorous mathematical foundation
for the theory was lacking, and was posed as part of one of David Hilbert’s (1862–1943) famous list of
problems in 1900 (the 5th problem). After important work in measure theory by Henri Lebesgue (1875–
1941) and Émile Borel (1871–1956), it was Andrey Kolmogorov who succeeded in 1933 in providing
the axioms that we use today (see the 1950 edition of his book [Kol50]).
George Boole (1815–1864) and John Venn (1834–1923) both wrote books concerned with probability
theory [Boo54, Ven88]; both were working before the formulation of Kolmogorov’s axioms.
As mentioned in Section 1.6, it is necessary in the general theory of probability to restricting events
to some σ-algebra. The reason for this is that in standard ZFC set theory, when Ω is uncountable (such
as Ω “ r0, 1s the unit interval), it follows from an argument by Vitali (1905) that many natural prob-
ability assessments, such as the continuous uniform distribution, cannot be modelled by a probability
12
defined on all subsets of Ω satisfying A1–4: see for instance Chapter 1 of [Ros07]. In the case where Ω is
countable, one can always define P on the whole of 2Ω . In the case where Ω is uncountable, we usually
do not explicitly mention Ω at all (when we work with continuous random variables, for example).
An alternative approach to probability theory is to do away with axiom A4, in which case some of
these technical issues can be avoided, at the expense of certain pathologies; however, in the standard
approach to modern probability, based on measure theory, A4 is a central part of the theory.
13
Chapter 2
R Goals: Understand the equally likely outcomes model of classical probability. Know counting
principles, and when and how to apply them on specific problems.
In Chapter 1 we have seen the abstract formulation of probability theory; next we turn to the
question of how the probabilities themselves may be assigned. The most basic scenario occurs when
our experiment has a finite number of possible outcomes which we deem to be equally likely. Such
situations rarely—not to say never—occur in practice, but serve as good models in extremely controlled
environments such as in gambling or games of chance. However, this situation (which will essentially
come down to counting) gives us a good initial setting in which to obtain some very useful insight into
the nature and calculus of probability.
Suppose that we have a finite sample space Ω, which we may thus list as a finite collection of
m “ |Ω| possible outcomes Ω “ {ω1 , . . . , ωm }. In the equally likely outcomes model (also sometimes
known as classical probability) we suppose that each outcome has the same probability:
1
P (ω) “ for each ω P Ω,
|Ω|
or, in the notation above, P (ωi ) “ 1{m for each i.
The axioms of probability then allow us to determine the probability of any event A Ď Ω: by C7,
X |A|
P (A) “ P (ω) “ for any event A Ď Ω.
|Ω|
ωPA
(This is a particular case of the discrete sample space discussed in Chapter 1.)
Definition 2.1 (Equally likely outcomes) Consider a scenario with m equally likely outcomes {ω1 , . . . , ωm } “
Ω. In the equally likely outcomes model, the probability of an event A Ď Ω is then
mA
P (A) :“ ,
m
14
where mA is the number of outcomes in the event which we are calling A (i.e. mA “ |A|).
Note that the rules for counting show that axioms A1–4 are satisfied; in the case of a finite state
space, we may always take the σ-algebra F to be the whole of 2Ω .
Example 2.2 Draw a card at random from a well-shuffled pack, so that each of the 52 cards is equally
likely to be chosen. Typical events are that the card is a spade (a set of 13 outcomes), the card is a
queen (a set of 4 outcomes), the card is the queen of spades (a set of a single outcome). In the equally
likely outcomes model, the probability of drawing the queen of spades (or any other specified card) is
1/52, the probability of drawing a spade is 13/52, and the probability of drawing a queen is 4/52.
Example 2.3 Flip a coin and see whether it falls heads or tails, each assumed equally likely; then
‘heads’ or ‘tails’ each has probability 1/2.
Example 2.4 Roll a fair cubic die to get a number from 1 to 6. Here the word ‘fair’ is used to mean
each outcome is equally likely.
Example 2.5 If we roll a pair of fair dice then outcomes are pairs pi, jq so there are 36 possible
outcomes. If we assume that the outcomes are equally likely, then the probability of getting a pair of 6’s
is 1/36, for example.
The classical interpretation of probability is related to general probability theory in the same way
that counting is to mathematics. It is a good place to start and there are many important situations
where intuitively it seems reasonable to say that each outcome of a particular collection is equally
likely.
To extend the theory or apply it in practice we have to address situations where there are no
candidates for equally likely outcomes or where there are infinitely many possible outcomes and work
out how to find probabilities to put into calculations that give useful predictions. We will come back
to some of these issues later; but bear in mind that however we come up with our probability model,
the same system of axioms that we saw in Chapter 1 applies.
Given a finite sample space and assuming that outcomes are equally likely, to determine probabilities
of certain events comes down to counting. For example, in drawing a poker hand of five cards from
a well-shuffled deck of 52 cards, the probability of having a ‘full house’ (meaning two cards of one
denomination and three of another, e.g., two Kings and three 7s) is given by the number of hands that
are full houses divided by the total number of hands (each hand being equally likely).
These counting problems need careful thought, and we will describe some counting principles for
some of the most common situations. Those of you who have done Discrete Maths will be familiar
with most of this material; we have a slightly different emphasis.
Counting Principle 2.6 (Multiplication Principle) Suppose that we must make k choices in suc-
cession where there are:
15
• mk possibilities for the kth choice.
and all combinations of the possibilities are allowed. The total number of distinct possible selections is
k
Y
m1 ˆ m2 ˆ m3 ˆ ¨ ¨ ¨ ˆ mk “ mi .
i“1
Example 2.7 In a standard deck of playing cards, each card has a denomination and a suit. There
are 13 possible denominations: 2, 3, . . . , 10, J(ack), Q(ueen), K(ing), and A(ce). There are 4 possible
suits: ♥ (heart), ♦ (diamond), ♣ (club), ♠ (spade). Because all combinations of denomination and
suit are allowed, the multiplication principle applies: there are 13 ˆ 4 “ 52 cards in a standard deck.
Example 2.8 A hotel serves 3 choices of breakfast, 4 choices of lunch and 5 choices of dinner so a
guest selects from 3 ˆ 4 ˆ 5 different combinations of the three meals (assuming we opt to have all
three).
All of the following counting principles are effectively consequences of the multiplication principle.
Example 2.9 A coffee bar has 5 different papers to choose from, 19 types of coffee and 7 different
snacks. This means there are 6 ˆ 20 ˆ 8 “ 960 distinct selections of coffee, snack and paper. Of these
5 involve no coffee or snack (which the staff may object to) plus one has no coffee, snack or paper!
First, we look at ordered choices of distinct objects. In this case, we distinguish between selection
with replacement, where the same object can be selected multiple times, and selection without replacement,
where each object can only be selected at most once.
Counting Principle 2.10 (Ordered Choices of Distinct Objects With Replacement) Suppose
that we have a collection of m distinct objects and we select r ď m of them with replacement. The num-
ber of different ordered lists (ordered r-tuples) is
mr .
Counting Principle 2.11 (Ordered Choices of Distinct Objects Without Replacement) Suppose
that we have a collection of m distinct objects and we select r ď m of them without replacement. The
number of different ordered lists (ordered r-tuples) is
m!
pmqr :“ m ˆ pm ´ 1q ˆ pm ´ 2q ˆ ¨ ¨ ¨ ˆ pm ´ r ` 1q “ .
pm ´ rq!
So, the falling factorial notation pmqr (sometimes also denoted mr ) is simply a convenient way to
m!
write pm´rq! . In the special case where r “ m we set 0! “ 1 and then pmqm “ m! is the number of
permutations of the m objects. If m is large, and r is much smaller than m, then pmqr « mr .
Example 2.12 The number of ways we can deal out four cards in order from a pack of cards is p52q4
and the number of ways we can arrange the four aces in order is 4! so the probability of finding the
four aces on top of a well-shuffled deck is
4! 4ˆ3ˆ2ˆ1
“ .
p52q4 52 ˆ 51 ˆ 50 ˆ 49
16
Exercise 39 (**) See problem sheet.
To see this, first count the number of distinct ordered lists of r objects—this is pmqr . Each unordered
subset has been counted prqr “ r! times as this is the number of distinct ways of arranging r dif-
ferent objects. Therefore the pmqr ordered selections can be grouped into collections of size r!, each
representing a particular
subset, and the result follows by dividing.
m
The expression r is the binomial coefficient for choosing r objects from m and is often called
‘m-choose-r’. Note that
m m
“
r m´r
as we can choose to take r objects from m in exactly the same number of ways that we can choose to
leave behind r objects i.e., take m ´ r objects.
Example 2.14 Suppose that in a hand of cards their order is not important. Then there are 52
4 “
p52 ˆ 51 ˆ 50ˆ 49q{p4 ˆ 3 ˆ 2 ˆ 1q “ 270 725 distinct hands of four cards. The number of these with
no aces is 48
4 “ p48 ˆ 47 ˆ 46 ˆ 45q{p4 ˆ 3 ˆ 2 ˆ 1q “ 194 580 and so the probability of finding no aces
in a four card hand is
48 52 48 ˆ 47 ˆ 46 ˆ 45
“ « 0.7187.
4 4 52 ˆ 51 ˆ 50 ˆ 49
The same counting arguments can be used when we need to divide m objects into k ą 2 groups:
arranging m distinguishable objects into k groups with sizes r1 , . . . , rk where r1 ` ¨ ¨ ¨ ` rk “ m
can be done in
m m!
:“
r1 , . . . , rk r1 ! ¨ ¨ ¨ rk !
m
ways. The expression r1 , ... ,rk is called the multinomial coefficient [DS02, Section 1.9].
Counting Principle 2.15 (Ordered Choices of 2 Types of Objects) Suppose that we have m
objects, r of type 1 and m ´ r of type 2, where objects are indistinguishable from others of their type.
The number of distinct, ordered choices of the m objects is
m
.
r
17
To see why, note that each distinct order for laying out the m objects in a row is precisely the same as
choosing r of the m positions for type 1 objects, i.e. it is an unordered choice of r positions from the
m distinct positions.
Example 2.16 With four red tokens and three black (none distinguishable from others of the same
colour) there are 7!{p4! 3!q “ 35 different ways to lay them out in a row. The probability that they will
be alternately red and black is 1{35 as there is only one such ordering.
More generally, using the positions argument again, the multinomial coefficient is the number of
ordered choices of objects with k types, ri of type i, which are indistinguishable within each type.
Counting Principle 2.17 (Ordered Grouping of Indistinguishable Objects) The number of ways
to divide m indistinguishable objects into k distinct groups is
m`k´1 m`k´1
“ .
m k´1
When we have m indistinguishable objects to put into k distinct groups e.g. identical balls into num-
bered bags, we can count the number of choices with the ‘sheep-and-fences’ method. Along a line the
k groups have k ´ 1 boundaries so, representing each of these with a ‘fence’, the number of groupings
of the objects is the same as the number of choices of the objects and fences in a line e.g. ˚ ˚ ˚ | ˚ || ˚˚
for 6 objects in 4 groups, but this number is m`k´1
m “ m`k´1
k´1 by Counting Principle 2.15.
Bibliographical notes
Classical probability theory originated in calculation of odds for games of chance; as well as contribu-
tions by Pierre de Fermat (1601–1665) and Blaise Pascal (1623–1662), comprehensive approaches were
given by Abraham de Moivre (1667–1754) [dM56], Laplace (1749–1827) [Lap25], and Siméon Pois-
son (1781–1840). A collection of these classical methods made just before the advent of the modern
axiomatic theory can be found in [Whi01].
18
Chapter 3
R Goals: Know the definition of conditional probability and its properties. Have a solid knowledge
of the partition theorem and Bayes’s theorem, recognizing situations where one can apply them.
Understand the concept of independence.
P (A X B)
P (A | B) :“ whenever P (B) ą 0.
P (B)
Example 3.2 Throw three fair coins. What is the conditional probability of at least one head (event
A) given at least one tail (event B)? Let H be the event ‘all heads’, T the event ‘all tails’. Then
P (B) “ 1 ´ P (H) “ 7{8 and P (A X B) “ 1 ´ P (H) ´ P (T ) “ 6{8 so that
P (A X B) 6{8 6
P (A | B) “ “ “ .
P (B) 7{8 7
19
3.2 Properties of conditional probability
Also see [BH15, Sections 2.3 and 2.4], [DS02, Sections 2.1 and 2.3], or [Sti03, Section 2.1].
R P1 For any event B Ď Ω for which P (B) ą 0, P ( ¨ | B) satisfies axioms A1–4 (i.e., is a
probability on Ω) and therefore satisfies also C1–10.
R P2 The multiplication rule for probabilities (simple version): for any events A and B with
P (A) ą 0 and P (B) ą 0,
P (A X B) “ P (B) P (A | B) “ P (A) P (B | A) .
P (A X B | C) “ P (B | C) P (A | B X C) , if P (B X C) ą 0. (3.1)
These properties are immediate from the definition; for instance (3.1) follows from the fact that
P (B X C) P (A X B X C) P (A X B X C)
P (B | C) P (A | B X C) “ ¨ “ “ P (A X B | C) .
P (C) P (B X C) P (C)
Exercise 56 (*) See problem sheet.
Exercise 57 (**) See problem sheet.
Exercise 58 (*) See problem sheet.
Exercise 59 (**) See problem sheet.
R P3 The multiplication rule for probabilities (general version): for any events A0 , A1 , . . . , Ak
with P Xk´1
i“0 Ai ą 0,
k
!
\
Ai A0 “ P (A1 | A0 ) ˆ P (A2 | A1 X A0 ) ˆ ¨ ¨ ¨
P
i“1
! ! !
k´3 k´2 k´1
\ \ \
¨ ¨ ¨ ˆ P Ak´2 Ai ˆ P Ak´1 Ai ˆ P Ak Ai .
i“0 i“0 i“0
20
R P4 The partition theorem or law of total probability. If E1 , E2 , . . . , Ek form a partition then,
for any event A, we have
Xk
P (A) “ P (A | Ei ) P (Ei ) . (3.2)
i“1
But since the Ei are a partition, they are pairwise disjoint, and hence so are the A X Ei , so by C7
k
X
P (A X Ei ) “ P Yki“1 pA X Ei q “ P A X pYki“1 Ei q ,
i“1
R P5 Bayes’s theorem (first version): for any events A and B with P (A) ą 0 and P (B) ą 0,
P (A) P (B | A)
P (A | B) “ .
P (B)
P (A | C) P (B | A X C)
P (A | B X C) “ .
P (B | C)
P6 Bayes’s theorem (second version): for any partition A1 , . . . , Ak and any B with P (B) ą 0,
P (Ai ) P (B | Ai )
P (Ai | B) “ Pk .
j“1 P (Aj ) P (B | Aj )
21
More generally, if P (B | C) ą 0,
P (Ai | C) P (B | Ai X C)
P (Ai | B X C) “ Pk .
j“1 P (Aj | C) P (B | Aj X C)
Example 3.4 In the previous example, a component is drawn from the packet and found to be faulty.
What is the probability that it was made by machine A?
P (MA ) P (F | MA )
P (MA | F ) “
P (MA ) P (F | MA ) ` P (MB ) P (F | MB ) ` P (MC ) P (F | MC )
0.1 ˆ 31 1
“ 1 1 1 “ 6.
0.1 ˆ 3 ` 0.2 ˆ 3 ` 0.3 ˆ 3
R Definition 3.5 (Independence of Two Events) We say that two events A and B are independent
whenever
P (A X B) “ P (A) P (B) .
We say that two events A and B are conditionally independent given a third event C whenever
P (A X B | C) “ P (A | C) P (B | C) .
Example 3.6 Roll two standard dice. Let E be the event that we have an even outcome on the first
die. Let F be the event that we have a 4 or 5 on the second die. Are E and F independent? Let us
verify, using a counting argument.
There are 36 equally likely outcomes, namely:
Of those, 3 ˆ 6 are in E, and 6 ˆ 2 are in F , so P (E) “ 18{36 “ 1{2 and P (F ) “ 12{36 “ 1{3.
Moreover, 3 ˆ 2 of these outcomes belong to both E and F , so P (E X F ) “ 6{36 “ 1{6. Indeed,
22
Theorem 3.7 Consider any two events A and B with P (A) ą 0 and P (B) ą 0. The following
statements are equivalent.
(ii) P (A | B) “ P (A) .
(iii) P (B | A) “ P (B) .
In other words, learning about B will not tell us anything new about A, and similarly, learning about
A will not tell us anything new about B.
For conditional independence, we have a similar result.
Theorem 3.8 Consider any three events A, B, and C, with P (A X B X C) ą 0. The following state-
ments are equivalent.
(i) P (A X B | C) “ P (A | C) P (B | C) .
(ii) P (A | B X C) “ P (A | C) .
(iii) P (B | A X C) “ P (B | C) .
In other words, if we know C then learning about B will not tell us anything new about A, and
similarly, if we know C then learning about A will not tell us anything new about B.
R It is possible for two events to be conditionally independent on particular events, but not to
be (unconditionally) independent. We will see an example of this when we discuss genetics, in
Section 5.2.
It can be extremely useful to recognize situations where (conditional) independence can be applied.
Of course, it is equally important not to assume (conditional) independence where there really are
dependencies.
A collection of events Ai , i P I are mutually conditionally independent given another event C if for
every finite non-empty subcollection A Ď I,
!
\ Y
P AC “ P (A | C) .
APA APA
Example 3.10 Three events A, B, and C are mutually independent if all of the following equalities
are satisfied:
23
Example 3.11 Suppose we roll 4 dice and their values are independent. To find the probability of no
sixes let Ai be the event ‘throw i not 6’. By assumption A1 , . . . , A4 are independent so
4 4
!
\ Y 5 4
P (no sixes on 4 dice) “ P Ai “ P (Ai ) “
6
i“1 i“1
The same result is obtained from the classical model, by selection with replacement.
Example 3.12 Toss two fair coins. The sample space is Ω “ {HH, HT, T H, T T } and each out-
come has probability 1{4. Let A “ {HH, HT } be the event that the first coin comes up ‘heads’,
B “ {HH, T H} the event that the second coin comes up ‘heads’, and C “ {HH, T T } the event
that the coins come up the same. Then since P (A) “ P (B) “ P (C) “ 1{2 and each pairwise in-
tersection has probability 1/4, it is easy to see that the events are pairwise independent. However,
P (A X B X C) “ P (HH) “ 1{4 which is not the same as P (A) P (B) P (C) “ 1{8, so the three events
are not mutually independent.
Never confuse disjoint events with independent events! For independent events, we have that
P (A X B) “ P (A) P (B) , but for disjoint events, P (A X B) “ 0 because A X B “ H.
Bibliographical notes
Bayes’s theorem is named after the Reverend Thomas Bayes (1701–1761). In our modern approach to
probability, the theorem is a very simple consequence of our definitions; however, the result may be
interpreted more widely, and is one of the most important results regarding statistical reasoning.
24
Chapter 4
Interpretations of probability
R Goals: Understand that there are different ways to interpret probability values.
Example 4.1 Suppose we toss a coin 1000 times and observe 490 heads, then the relative frequency
of heads is 490/1000.
25
For some experiments, it may be reasonable to suppose that relative frequencies are stable for very
large n.
Example 4.2 If we toss a fair coin one billion times we might expect that the relative frequency of
heads after the first few thousand throws would remain very close to 1/2.
As a mathematical idealization, we suppose that there is a unique, empirical limiting value for nA {n,
as n tends to infinity, which we call the relative frequency probability of A.
Example 4.3 For our coin, the statement P (heads) “ 1{2 means ‘if we tossed the coin an extremely
large number of times, then the proportion of heads would be arbitrarily close to 1/2’.
This interpretation is widely used, especially in physics, where experiments are designed for repeata-
bility and we can expect future trials to behave like those in the past. In this view, probability is a
property of the experimental setup and may be “objectively” discovered by sufficient repetitions of the
experiment.
Amongst the problems with this interpretation are: it is often impossible to decide what “essentially
unchanged conditions” are; we often have no way of knowing when limiting frequencies become stable
(how many trials should we do to test this?); we can only use it in situations that are repeatable.
Bibliographical notes
For further reading on the different interpretations of probability, see [Háj12].
26
Chapter 5
R Goals: Understand the meaning of a reliability network. Know how to evaluate the reliability
of networks. Understand the probabilistic nature of genetics. Know how to derive the probability
of genotypes of children given the genotypes of parents. Know how and when to exploit the general
rules of (conditional) probability to solve complex reliability networks and problems in genetics.
Definition 5.1 (Reliability network) A reliability network is a diagram of nodes and arcs. The
nodes represent components of a multi-component system, where each node is either working or is
broken, and where the entire system works if it is possible to get from the left end to the right of the
diagram through working components only.
Suppose the ith component functions with probability pi , and different components are independent.
The probability that the system works is then clearly a function of the probabilities p1 , . . . , pk . We
denote this function by rpp1 , p2 , . . . , pk q, and call it the reliability function. It is determined by the
layout of the reliability network.
Example 5.2 The figure below shows (a) two components in series, (b) three in parallel, (c) a five
component bridge system. In each case assume the system works if it is possible to get from the left end
to the right through functioning components.
3 1 3
1 2 2 5
27
By independence of failures, in case (a) we have rpp1 , p2 q “ p1 p2 , and in case (b), setting qi “ 1 ´ pi
we have rpp1 , p2 , p3 q “ 1 ´ q1 q2 q3 . Case (c) is the subject of Exercise 82.
5.2 Genetics
Also see https://en.wikipedia.org/wiki/Mendelian_inheritance.
Inherited characteristics are determined by genes. The mechanism governing inheritance is random
and so the laws of probability are crucial to understanding genetics.
Your cells contain 23 pairs of chromosomes, each containing many genes (while 23 pairs is specific
to humans the idea is similar for all animals and plants). The genes take different forms called alleles
and this is one reason why people differ (there are also environmental factors). Of the 23 pairs of
chromosomes, 22 pairs are homologous (each of the pair has an allele for any gene located on this
pair). People with different alleles are grouped by visible characteristics into phenotypes; often one
allele, A say, is dominant and another, a, is recessive in which case AA and Aa are of the same
phenotype while aa is distinct. Sometimes, the recessive gene is rare and the corresponding phenotype
is harmful, for example haemophilia or sickle-cell anaemia.
Example 5.3 In certain strains of mice the gene for coat colour (a phenotype) has alleles B (black)
or b (brown). B is dominant, so BB or Bb mice are black, while bb mice are brown with no difference
between Bb and bB.
With sickle-cell anaemia, allele A produces normal red blood cells but a produces deformed cells.
Genotype aa is fatal but Aa provides protection against malarial infection (which is often fatal) and so
allele a is common in some areas of high malaria risk.
The final chromosome pair are the sex chromosomes. Each may be X, a long chromosome, or Y, a
short chromosome. Most of the genes on X do not occur on Y. Your sex is determined as XX (female)
or XY (male) but YY is not possible.1
In our simple model, each individual receives an allele for each gene on its homologous chromosomes
from each parent, chosen independently and at random from each parent’s two alleles for that gene. It
is extremely important to note that genotypes of siblings are dependent unless conditioned on parental
genotypes.
Example 5.4 For example, in a certain type of flowering pea, flower colour is determined by a gene
with alleles R and W , with phenotypes RR (red), RW (pink) and W W (white). The table of offspring
genotype probabilities given parental genotypes is
Parental genotype RR RR RR RW RR WW RW RW RW WW WW WW
offspring RR 1 1/2 0 1/4 0 0
genotype RW 0 1/2 1 1/2 1/2 0
WW 0 0 0 1/4 1/2 1
e.g. parents RR and RW produce RR offspring with chance 1/2 (the RR parent must supply an R but
the RW parent supplies either R or W, each with probability 1/2). When we cross red and white peas,
all the offspring will be pink but when we cross red and pink peas, about half of the peas will be red, half
pink (Mendel carried out experiments like these to establish the genetic basis of inheritance.)
Exercise 83 (**) See problem sheet.
1
X, XXX, XXY, and XYY can occur, but are very rare.
28
Similar but larger tables are relevant when there are more than two alleles.
Example 5.5 A gene has alleles A, a but a is recessive and harmful so genotype aa does not reproduce
while AA, Aa are indistinguishable. With proportions 1 ´ λ, λ of AA, Aa in the healthy population,
the probability of an aa offspring is λ2 {4. To show this use the partition FAA , FAa (father AA, Aa
respectively) to calculate
for the event F a that the father provides allele a. By symmetry, the mother also supplies allele a with
probability λ{2 and by independence (random mating) the probability that they both supply allele a is
λ2 {4 e.g. when λ « 1{2, about 6% of offspring will be aa. Over time the proportion of allele a will
decrease unless Aa has a reproductive advantage over AA.
Things are slightly different for genes on the X or Y chromosomes (sex-linked genes).
Consider a gene (on a homologous chromosome) with two alleles A and a. Suppose the proportions
of genotypes AA, Aa and aa in the population at generation 0 are u0 , 2v0 and w0 respectively (u0 `
2v0 ` w0 “ 1) both for males and females. Are these proportions stable over successive generations in a
large population where (i) mating in each generation is random with respect to this gene; (ii) different
genotypes have equal reproductive success?
The proportion of A in the gene pool at generation 0 is p “ u0 ` v0 . In generation 1, by random
mating, each child has probability p2 , 2pp1 ´ pq, p1 ´ pq2 to be AA, Aa, aa, respectively. As long as
the population is large enough (see the law of large numbers later in the course) these will also be the
generation 1 proportions of AA, Aa, aa i.e.
u1 “ p2 , v1 “ pp1 ´ pq, w1 “ p1 ´ pq2
Now let p1 “ u1 ` v1 be the proportion of A in the gene pool at generation 1. Substituting the values
of u1 , v1 we find that
p1 “ p2 ` pp1 ´ pq “ p
i.e. the proportions of A and a in the gene pool are constant. This repeats in later generations so
the proportions of the three genotypes remain constant from generation 1 onward (though perhaps
different to generation 0). This is called the Hardy–Weinberg equilibrium.
29
Chapter 6
Random variables
R Goals: To understand the definition of a random variable as a function on the sample space.
Master the notation for events and probabilities of events relating to random variables.
Know how to recognize a discrete random variable, how to identify its probability mass function,
and how to derive probabilities of associated events. Know how to recognize a binomial scenario,
the assumptions behind it, and how to identify the parameters of the corresponding binomial
distribution. The same for the Poisson distribution. Know when and how the binomial distribution
can be approximated by a Poisson distribution.
Know how to recognize a continuous real-valued random variable, how to identify its probability
density function, and how to derive probabilities of associated events. Know the uniform distribu-
tion. Know the exponential distribution and how it arises from the Poisson distribution. Know the
normal distribution, and how we can derive probabilities of events using its cumulative distribution
function and standard normal tables. Work with functions of random variables.
In many experiments we are often interested in a numerical value rather than the elementary event
ω P Ω per se: for example, in the financial industry we may not care about the behaviour of the stock
price throughout a given period, only whether it reached a certain level or not; in weather forecasting
we may not be interested in the detailed variation of atmospheric pressure and temperature, only in
how much rain is going to fall, and so on. These uncertain quantities associated with random scenarios
have as their mathematical idealization the concept of random variable.
Put simply, a random variable is a function or mapping of the sample space: for each ω P Ω,
the random variable X gives the output Xpωq. In this chapter, we study both discrete and continu-
ous univariate random variables, and discuss some important examples: binomial, Poisson, uniform,
exponential, and normal distributions. To enable practical calculations, we also discuss cumulative
distribution functions, standard tables, and how probabilities behave under transformations.
30
6.1 Definition and notation
Also see [BH15, Section 3.1], [DS02, Section 3.1], or [Sti03, Section 4.1].
• XpΩq Ď R, in which case we say that X is a real-valued random variable or a univariate random
variable; or
Example 6.2 Consider throwing two standard dice, with sample space
so elements of Ω correspond to pairs pi, jq. The sum of the numbers that show on the dice corresponds
to a real-valued random variable X defined by:
For any B Ď XpΩq, we write ‘X P B’ to denote the event {ω P Ω : Xpωq P B}. For any x P XpΩq,
we write ‘X “ x’ to mean the event X P {x}, that is, {ω P Ω : Xpωq “ x}. We sometimes also write
{X “ x} and {X P B} to emphasize that these are sets.
Example 6.3 Continuing the example where X denotes the sum of the outcomes of two dice, the
notation X “ 10 denotes the event {p4, 6q, p5, 5q, p6, 4q}, and X P r0, 4s denotes the event
{p1, 1q, p1, 2q, p2, 1q, p1, 3q, p2, 2q, p3, 1q}.
Example 6.4 A simple but important class of random variable is formed by the indicator random
variables, denoted for an event A Ď Ω by 1A and given by the mapping
(
1 if ω P A,
1A pωq “
0 otherwise.
Recall Definition 1.17. Given a probability distribution P on the sample space Ω, a random variable
X : Ω Ñ XpΩq induces a probability distribution PX on the sample space XpΩq as follows.
Rby
Theorem 6.5 The function PX , mapping sets A Ď XpΩq to a real number PX (A) , defined
31
As mentioned in Chapter 1, in general sample spaces Ω one cannot expect to be able to assign
probabilities to to every A P 2Ω , and one must restrict to some F Ă 2Ω of ‘nice events’. This has
consequences for the definition of random variable we gave above, as can be gleaned by examination
of Theorem 6.5: we must have that {ω P Ω : Xpωq P A} is a genuine event in F, at least for a large
enough class of A Ď XpΩq to tell us everything we would like to know about the random variable
X. Those of you who do Probability II (and later courses) will see that the relevant concept for
a more complete definition of random variable is that of a measurable function.
R Definition 6.6 (Discrete Random Variable & Probability Mass Function) A random
variable X : Ω Ñ XpΩq is said to be discrete when there is a function p : XpΩq Ñ R such that
X
P (X P B) “ p (x) ,
xPB
for all B Ď XpΩq. In this case, p is called the probability mass function of X.
R The probability mass function of a discrete random variable summarizes all information we
have about X. Specifically, it allows us to calculate the probability of every event of the form
{X P B}.
Clearly, if X is discrete, it must hold that p (x) “ P (X “ x) . If the random variable under
consideration is not clear from the context, we may write pX for the probability mass function of X.
Recall that a set is countable if there exists a bijection between that set and a subset of the natural
numbers N.
Theorem 6.7 A random variable X : Ω Ñ XpΩq is discrete whenever XpΩq is countable.
Exercise 89 (***) See problem sheet.
Example 6.8 Toss three fair coins and let X be the total number of heads obtained. This example is
discrete since the possible values are 0, 1, 2, 3. We have P (X “ 1) “ P (X P {1}) “ P ({HT T, T HT, T T H}) “
3{8 since each of the eight sequences of 3 tosses has probability 1/8. Similarly,
Example 6.9 For the three coins example the possible values of X are 0, 1, 2, 3. Grouping the 8
outcomes, as above, gives P (X “ x) “ x3 81 for x P XpΩq “ {0, 1, 2, 3} and so the probability mass
function of X is
1 3 3 1
p (0) “ , p (1) “ , p (2) “ , p (3) “ .
8 8 8 8
Exercise 90 (*) See problem sheet.
32
Exercise 91 (**) See problem sheet.
Corollary 6.10 Let X be a discrete random variable. Then the values of its probability mass function
p are non-negative and sum to one: X
p (x) “ 1.
xPXpΩq
• a sequence of n trials will be carried out, where n is known in advance of the experiment;
• each trial has only two outcomes, usually denoted ‘success’ or ‘failure’; and finally,
• because trials are independent (see Definition 3.9), every sequence ω P Ω with exactly x successes
and n ´ x failures has probability P ({ω}) “ px p1 ´ pqn´x ; and
to obtain X
p (x) “ P (X “ x) “ P (ω) .
P
ωPΩ: i ωi “x
33
In case of just a single trial (n “ 1), the binomial scenario is often referred to as a Bernoulli trial,
and X „ Binp1, pq is often referred to as a Bernoulli random variable.
Example 6.12 If we roll 4 fair cubic dice and let X be the number of 6s then P (X “ x) “ x4 p 16 qx p 56 q4´x
e´λ λx
p (x) “ for x P Z` .
x!
The Poisson distribution is used to model counts of events which occur randomly in time at rate
r per unit time (specifically P (event in px, x ` hq) « rh for small enough h). If X is the count over a
period of length t then it has distribution Poprtq. Typical applications are in models of call numbers
to telephone exchanges, numbers of accidents at busy traffic intersections, earthquakes at a tectonic
boundary, etc.
The Poisson distribution can be used to approximate the binomial distribution for small p and
large n.
Theorem 6.14 Consider any λ ą 0. Let Xn „ Binpn, λ{nq, and Y „ Popλq. Then for all x P Z` ,
34
Proof For fixed x we have that
n n!
pn´1 λqx p1 ´ n´1 λqn´x “ n´x λx p1 ´ n´1 λqn´x ,
x pn ´ xq!x!
where
n!
lim n´x “ 1 and lim p1 ´ n´1 λqn´x “ e´λ .
nÑ8 pn ´ xq! nÑ8
This means that if X „ Binpn, pq, where n is large and p is small, then approximately X „ Popnpq.
As a rule of thumb, with p ď 0.05, we find n “ 20 gives a reasonable approximation, while n “ 100
gives a good approximation. The approximation is useful when it allows us to sidestep calculating nx
for large values of n and x.
Example 6.15 A typist produces a page of 1000 symbols but has probability 0.001 of mistyping any
single symbol and such errors are independent (note: neither assumption is particularly realistic). The
probability that a page contains more than two mistakes is P (X ą 2) where X is the number of mistakes
on the page.
Clearly, X „ Binp1000, 0.001q. As n is large and p is small, we approximately have that X „
Pop1000 ˆ 0.001q “ Pop1q. Therefore,
0
11 12
´1 1
P (X ą 2) “ 1 ´ P (X ď 2) « 1 ´ e ` ` “ 0.0803.
0! 1! 2!
Exercise 107 (***) See problem sheet.
R Definition 6.16 (Continuous Random Variable & Probability Density Function) Consider
a real-valued random variable X : Ω Ñ R. We say that X is a continuous random variable, or that
X has a continuous probability distribution, or that X is continuously distributed, when there is a
non-negative piecewise continuous function f : R Ñ R such that
Z b
P (X P ra, bs) “ f (t) dt, (6.1)
a
for all ra, bs Ď R. In this case, f is called the probability density function of X.
If the random variable under consideration is not clear from the context, we may write fX for the
probability density function of X.
The probability density function of a random variable determines its probability distribution for a
very wide range of events—but not all, unlike the discrete case (Definition 6.6):
35
Theorem 6.17 If X is continuously distributed with probability densitity function f, then for any
A Ď R that is a finite union of intervals:
Z
P (X P A) “ f (x) dx
A
R The probability density function of a continuous random variable summarizes practically all
information we have about X. Specifically, it allows us to calculate the probability of every event
of the form {X P A} where A is a finite union of intervals.
As before, again by A2, and Theorem 6.17, we recover a version of C10 for probability density
functions:
Corollary 6.18 Let X be a continuous random variable. Then its probability density function f inte-
grates to one: Z 8
f (x) dx “ 1.
´8
Example 6.19 Suppose that X can take values between R0 and 2 and f (x) R 2 “ kx for some constant
8
k, and f (x) “ 0 for x R r0, 2s. To find k, note that 1 “ ´8 f (x) dx “ 0 kx dx “ 2k and therefore
R1
k “ 1{2. With A “ r0, 1s, we have P (X P A) “ 0 x{2 dx “ 1{4.
R Definition 6.20 (Uniform Distribution) Let a and b be real numbers with a ă b. We say
a continuous random variable X is uniformly distributed on ra, bs, and we write X „ Upa, bq, when
(
1{pb ´ aq for all x P ra, bs,
f (x) “
0 elsewhere.
When X „ Upa, bq then X can take any value in the continuous range of values from a to b and the
probability of finding X in any interval rx, x ` hs Ď ra, bs does not depend on x.
Exercise 109 (*) See problem sheet.
Suppose a bell chimes randomly in time at rate β. Let T ą 0 denote the time of the first chime
(a random variable). The events {no chimes in r0, τ s} and {T ą τ } are the same. We know that the
number of chimes in the interval r0, τ s is Popβτ q and so the probability of no chimes is e´βτ . Hence
36
R Definition 6.21 (Exponential Distribution) Let β be a strictly positive constant. We say
a continuous random variable X is exponentially distributed with parameter β, and we write X „
Exppβq, when (
βe´βx for all x ě 0,
f (x) “
0 elsewhere.
This is one of the most important probability distributions, for several reasons, some of which we
will see later in this course.
The probability density function of the normal distribution is bell shaped. The parameter µ deter-
mines the location of the bell, and σ determines the spread of the bell, e.g. for smaller σ, the bell is
narrower. They are related to the expectation and standard deviation, which will be introduced later
on.
R Definition 6.23 (Cumulative Distribution Function) For any real-valued random vari-
able X, the function F : R Ñ r0, 1s defined by
Fpxq :“ P (X ď x) for x P R
If the random variable under consideration is not clear from the context, we may write FX for the
cumulative distribution function of X.
37
Exercise 113 (*) See problem sheet.
Exercise 114 (**) See problem sheet.
Exercise 115 (*) See problem sheet.
Exercise 116 (**) See problem sheet.
By C1, it follows immediately that:
P (X P pa, bs) “ P (X ď b) ´ P (X ď a) “ F (b) ´ F (a) .
This, together with C7, implies that knowledge of the cumulative distribution function is sufficient to
calculate the probability of virtually any event of practical interest, that is, any finite union of intervals.
If X is a discrete real-valued random variable, then F is piecewise constant, and
X
p (x) “ F (x) ´ lim F x ´ n´1 ,
F (x) “ p (t) ,
nÑ8
t : tďx
In the statement p (x) “ F (x) ´ limnÑ8 F x ´ n´1 we have used C9 (continuity along mono-
tone limits), because the events An “ {x ´ n´1 ă X ď x} are decreasing in n and have as their
limit X8
n“1 An “ {X “ x}.
As already mentioned, Φ has no closed analytical form, however, it can be tabulated. Such tabula-
tion is called a standard normal table. Because φpzq “ φp´zq, it follows that Φpzq “ 1 ´ Φp´zq, and
so we only need to tabulate Φ for non-negative values of z. Some values that are often useful are:
1
In the programming language R, φ is the function dnorm, and Φ is the function pnorm.
38
z 0 1.28 1.64 1.96 2.58
Φpzq 0.5 0.9 0.95 0.975 0.995
The following bound is sometimes useful:
1 ´ Φpaq ď φpaq{a for all a ą 0.
Interestingly, we can use normal tables also to calculate P (X P ra, bs) for any X „ Npµ, σ 2 q, using
functions of random variables, introduced next.
Suppose X : Ω Ñ XpΩq is a random variable, and g : XpΩq Ñ S is some function. Then gpXq
is also a random variable, namely the outcome to a ‘new experiment’ obtained by running the ‘old
experiment’ to produce a value x for X, and then evaluating gpxq. Formally, as a function of ω P Ω,
gpXq :“ g ˝ X, i.e.,
gpXqpωq :“ gpXpωqq for all ω P Ω.
For example,
P (gpXq P A) “ P ({ω P Ω : gpXpωqq P A}) for all A Ď S.
Example 6.26 For any random variable X, we can consider sinpXq, e3X , X 3 , and so on, which are
all again random variables.
Example 6.27 Let X be the score when you roll a fair die and let Y “ pX ´ 3q2 . Then Y is a discrete
random variable with probability mass function
y 0 1 4 9
p (y) 1{6 1{3 1{3 1{6
and zero elsewhere. To see this, note that {Y “ 4} “ {X P {1, 5}}, and so on.
Example 6.28 If X is Binpn, pq, then n ´ X is Binpn, 1 ´ pq, as shown in Exercise 97.
Example 6.29 Let X „ Up0, 1q. For any constants a and b ą 0, define Y :“ a ` bX. Then Y „
Upa, a ` bq, because for any x P r0, 1s,
{Y ď a ` bx} “ {a ` bX ď a ` bx} “ {X ď x}
so that P (Y ď a ` bx) “ P (X ď x) “ x, and consequently F (y) “ P (Y ď y) “ y´a b whenever
y P ra, a ` bs. Therefore, by Equation (6.2), indeed, f (y) “ 1{b for y P ra, a ` bs (and zero elsewhere),
so Y is uniformly distributed on ra, a ` bs by Definition 6.20.
A similar result holds when b ă 0.
Exercise 117 (*) See problem sheet.
Example 6.30 If U „ Up0, 1q and X is a continuous random variable with a cumulative distribution
function FX which is strictly increasing on ra, bs, with FX (a) “ 0 and FX (b) “ 1, then FX ´1 exists
and FX ´1 pU q has the same distribution as X because for any x P ra, bs,
P FX´1 pU q ď x “ P (U ď FX pxq) “ FX pxq “ P (X ď x) .
This is a special case of the probability integral transform and is very useful for generating random
samples of X with computer generated ‘uniform random numbers’. For example, ´ β1 logp1 ´ U q is
Exppβq and so is β1 log 1{U (as U and 1 ´ U are both Up0, 1q).
There is one particularly important function which enables us to get the cumulative distribution
function of any normally distributed random variable, using just the the standard normal tables (i.e.
Φ, the cumulative distribution function of the standard normal).
39
R Theorem 6.31 (Standardizing the Normal Distribution) Suppose µ P R and σ ą 0. If
X „ Npµ, σ 2 q and Z „ Np0, 1q, then
X ´µ
„ Np0, 1q, and σZ ` µ „ Npµ, σ 2 q.
σ
We can prove this is via change of variable in the integral for the cumulative distribution function.
A much shorter proof goes via the moment generating function, which will be introduced later.
Bibliographical notes
The normal distribution appeared already in work of de Moivre, and is sometimes known as the
Gaussian distribution after Carl Friedrich Gauss (1777–1855). The name ‘normal distribution’ was
applied by biometrician Sir Francis Galton (1822–1911) and statistician Karl Pearson (1857–1936) to
mark the distribution’s ubiquity in biometric data.
There is a great deal of subtle and interesting mathematics on the subject of what functions are
integrable over what sets. You may see some of this in the third year probability course. The Riemann
integral that we use here is sufficient for integrating piecewise continuous functions over finite unions of
intervals. Here, we will only consider continuous random variables which have a piecewise continuous
probability density function. Other approaches to integration are required to deal with more general
functions.
For instance, for infinite countable unions of intervals, we would need the Lebesgue integral (see
for instance [Ros07]). More precisely, f still determines the value of P (X P A) when A is an infinite
countable union of intervals, but that value is not necessarily given by the Riemann integral.
Many random variables are neither discrete nor continuous: a simple example is to take the sum
X ` Y of a discrete random variable X and a continuous random variable Y . To deal with these (and
other) general random variables we need the mathematical framework of measure theory; you will see
some of this if you take later probability courses.
40
Chapter 7
R Goals: Understand a multivariate random variable as a function from the sample space to a
higher dimensional space.
Know and recognize jointly distributed discrete random variables, their joint probability mass
function, marginal probability mass functions, and conditional probability mass functions. Know
the partition theorem for discrete random variables, as well as independence, and how to apply
these. Understand the properties of and links between these different probability mass functions,
and how those properties and links arise from the axioms of probability distributions.
Know and recognize continuously distributed random variables, their joint probability density func-
tion, marginal probability density functions, and conditional probability density functions. Know
the partition theorem for jointly continuously distributed random variables, as well as independence,
and how to apply these. Understand the properties of and links among these different probability
density functions.
Understand how to work with functions of multiple random variables.
It is essential for most useful applications of probability to have a theory which can handle many
random variables simultaneously. To start, we consider having two random variables. The theory for
more than two random variables is an obvious extension of the bivariate case covered below.
Remember that, formally, random variables are simply mappings from Ω into some set. A bivariate
random variable is a mapping from Ω into a Cartesian product of two sets. Of course, a bivariate
random variable is a random variable according to our original definition, just with a special kind of
set of possible values. However, the concept of bivariate random variable is a useful one if the individual
components of the bivariate variable have their own meaning or interest.
Definition 7.1 (Bivariate Random Variable) Consider two random variables, X : Ω Ñ XpΩq and
Y : Ω Ñ Y pΩq. The mapping pX, Y q : Ω Ñ XpΩq ˆ Y pΩq defined by
pX, Y qpωq :“ pXpωq, Y pωqq
is then a bivariate random variable.
41
X
Ω
X Y
Xpωq
pX, Y q pXpωq, Y pωqq
Y
ω X
Y
Y pωq
Example 7.2 If we throw two dice, then X and Y can be the values from each die, and the joint
outcome pX, Y q forms a bivariate random variable.
Similarly to before, for any A Ď XpΩq ˆ Y pΩq, we write ‘pX, Y q P A’ to denote the event
For any x P XpΩq and y P Y pΩq, we write ‘X “ x, Y “ y’ to mean the event pX, Y q P {px, yq}.
We sometimes also write {pX, Y q “ px, yq} and {pX, Y q P A} to emphasize that these are sets; for
instance:
and so on.
Definition 7.3 (Joint Probability Mass Function) The joint probability mass function p of two
discrete random variables X : Ω Ñ XpΩq and Y : Ω Ñ Y pΩq is defined by
To avoid ambiguity, we sometime write if pX,Y if it is not clear from the context which random variables
the two arguments refer to; note that pX,Y is not the same as pY,X .
Again, by C7, the joint probability mass function determines the joint probability distribution of
X and Y , and the values of the joint probability mass function sum to one:
Theorem 7.4 Let X and Y be discrete random variables. Then we have that
X
P (pX, Y q P A) “ p (x, y) for all A Ď XpΩq ˆ Y pΩq. (7.1)
px,yqPA
42
In other words, the bivariate random variable pX, Y q is a discrete random variable with p (x, y) as
its probability mass function: a bivariate random variable whose two components are discrete is itself
discrete.
Corollary 7.5 Let X and Y be discrete random variables. Then the values of the joint probability
mass function are non-negative and sum to one:
X
p (x, y) “ 1.
px,yqPXpΩqˆY pΩq
Example 7.6 In a card game played with a standard 52 card deck, hearts are worth 1, the queen of
spades is worth 13 and all other cards worth 0. Let X, Y denote the values of the first and second cards
dealt (without replacement, as usual). The possible outcomes of the bivariate random variable pX, Y q
are
{p0, 0q, p1, 0q, p0, 1q, p1, 1q, p0, 13q, p13, 0q, p1, 13q, p13, 1q}
(not p13, 13q). With a well shuffled deck, the event A that the pair does not include the queen of spades,
is
{X ď 1, Y ď 1} “ {pX, Y q P A}
where
A “ {p0, 0q, p1, 0q, p0, 1q, p1, 1q}.
So, by Equation (7.1), and a few simple counting arguments,
38 37 1 38 38 13 1 12 51 ˆ 50 50
P (pX, Y q P A) “ ` ` ` “ “ .
52 51 4 51 52 51 4 51 52 ˆ 51 52
In a multivariate context, the probability mass functions p (x) and p (y) are also called marginal
probability mass functions. They are connected to the joint probability mass function p (x, y) as follows.
Corollary 7.7 (Marginal Probability Mass Function) Let X and Y be jointly distributed dis-
crete random variables. Then X and Y are (each separately) discretely distributed, with marginal prob-
ability mass functions
X X
pX (x) “ p (x, y) for all x P XpΩq, pY (y) “ p (x, y) for all y P Y pΩq.
yPY pΩq xPXpΩq
S
To prove for instance the first equality, simply note that {X “ x} “ yPY pΩq {X “ x, Y “ y}, and
apply A4.
Definition 7.8 (Conditional Probability Mass Function) Let X and Y be discrete random vari-
ables. For y P Y pΩq, the conditional probability mass function of X given Y “ y is defined by
pX,Y (x, y)
pX|Y (x | y) :“ P (X “ x | Y “ y) “
pY (y)
for all x P XpΩq and y P Y pΩq such that pY (y) ą 0,
pX,Y (x, y)
pY |X (y | x) :“ P (Y “ y | X “ x) “
pX (x)
for all y P Y pΩq and x P XpΩq such that pX (x) ą 0.
43
There’s nothing new here, apart from the notation: this is just a particular case of the usual definition
of conditional probability of one event given another, e.g.,
P (X “ x, Y “ y) pX,Y (x, y)
P (X “ x | Y “ y) “ “ .
P (Y “ y) pY (y)
Exercise 125 (*) See problem sheet.
There is also a version of P4 (partition theorem or, law of total probability) for discrete random
variables; again, only the notation is new here.
Theorem 7.9 (Partition Theorem for Discrete Random Variables) Let X and Y be discrete
random variables. Then,
X
pX (x) “ pX|Y (x | y) pY (y) .
yPY pΩq
Moreover, if X is real-valued, then for any functions g : Y pΩq Ñ R and h : Y pΩq Ñ R with g ď h, we
can write
X
P (gpY q ď X ď hpY q) “ P (gpyq ď X ď hpyq | Y “ y) P (Y “ y)
yPY pΩq
X X
“ pX|Y (x | y) pY (y) .
yPY pΩq xPrgpyq,hpyqsXXpΩq
It follows that for independent discrete random variables X and Y , pX|Y (x | y) “ pX (x) whenever
pY (y) ą 0, and pY |X (y | x) “ pY (y) whenever pX (x) ą 0.
Corollary 7.11 If two discrete random variables X and Y are independent, then
Later (when we talk about limit theorems such as the law of large numbers) we will need to talk
about infinite sequences of independent random variables. A (possibly infinite) collection of random
variables Xi , i P I, is independent if every finite nonempty subcollection A Ď I is independent,
which in the discrete case means
!
\ Y
P {Xi “ xi } “ pXi (xi ) .
iPA iPA
A similar definition applies in the continuous case, which is the subject of the next section.
44
7.3 Jointly continuously distributed random variables
Also see [BH15, Section 7.1], [DS02, Sections 3.4, 3.5, and 3.6], or [Sti03, Sections 8.1 and 8.3].
Definition 7.12 (Joint Probability Density Function) Consider two real-valued random variables
X : Ω Ñ R and Y : Ω Ñ R. We say that X and Y are jointly continuously distributed when there is
a non-negative piecewise continuous function f : R2 Ñ R, called the joint probability density function,
such that Z Z b d
P (X P ra, bs, Y P rc, ds) “ f (x, y) dy dx
a c
To avoid ambiguity we sometimes write fX,Y (x, y) for the joint probability density of X and Y .
The interpretation of f (x, y) is that, for infinitesimal dx and dy,
As before, the joint probability density function f (x, y) determines the joint probability P (pX, Y q P A)
for most events A. More precisely, remember the definition of type I and II regions in the theory of
multiple integration: a type I region is of the form
for some continuous functions φ1 and φ2 , and a type II region is of the form
type I type II
φ2 pxq
y y
b1
ψ1 pyq ψ2 pyq
φ1 pxq
b0
a0 a1 x x
As we know from calculus, unions of regions of these types are precisely the regions over which we can
integrate. So, we have:
Theorem 7.13 If X and Y are jointly continuously distributed, then for any A Ď R2 that is a finite
union of type I and type II regions:
ZZ
P (pX, Y q P A) “ f (x, y) dx dy. (7.2)
A
45
Corollary 7.14 Let X and Y be jointly continuously distributed random variables. Then their joint
probability density function integrates to one:
ZZ
f (x, y) dx dy “ 1.
R2
Example 7.15 Suppose that X and Y have joint probability density function f (x, y) “ cpx2 ` yq
for ´1 ď x ď 1 and 0 ď y ď 1 ´ x2 , with f (x, y) “ 0 otherwise. What is the value of c? As
R 1 R 1´x2 2 1 1
R 4
´1 0 px ` yq dy dx “ 2 ´1 p1 ´ x q dx “ 4{5 we find c “ 5{4.
otherwise.
So, if X and Y are jointly continuously distributed, then both X and Y are also continuously
distributed separately. Unlike the discrete case, the converse, however, is not true in general, as the
following example shows.
Example 7.18 Let X be a continuously distributed random variable, and let Y :“ 2X. Then both
X and Y are continuously distributed separately, however X and Y are not jointly continuously
distributed. Indeed, suppose that X and Y were jointly continuously distributed with density function
f px, yq. Then, with A “ {px, yq : 2x “ y},
ZZ Z `8 Z 2x
f px, yqdx dy “ f px, yq dy dx “ 0.
´8 2x
A
This implies that P (2X “ Y ) “ 0 for every two jointly continuously distributed random variables
X and Y . But, because for our choice of X and Y , obviously P (2X “ Y ) “ 1; so, by contradiction,
X and Y cannot be jointly continuously distributed.
Definition 7.19 (Conditional Probability Density Function) Let X and Y be jointly continu-
ously distributed random variables. The conditional probability density function of X at Y “ y is defined
by
fX,Y (x, y)
fX|Y (x | y) :“ for all x, y P R such that fY (y) ą 0.
fY (y)
and similarly, the conditional probability density function of Y at X “ x is
fX,Y (x, y)
fY |X (y | x) :“ for all x, y P R such that fX (x) ą 0.
fX (x)
46
Roughly speaking, fX|Y (x | y) should be thought of as the probability density function of X
conditional on Y “ y. Because the event Y “ y has probability 0, this interpretation needs a bit of
work to realize rigorously. A formal manipulation, that can be made rigorous using some analysis,
goes as follows:
Theorem 7.20 (Partition Theorem for Jointly Continuously Distributed Random Variables)
Let X and Y be jointly continuously distributed random variables. Then,
Z `8
fX (x) “ fX|Y (x | y) fY (y) dy.
´8
Moreover, for any piecewise continuous functions g : R Ñ R and h : R Ñ R with g ď h, we can write,
!
Z `8
Z hpyq
P (gpY q ď X ď hpY q) “ f (x | y) dx f (y) dy (7.3)
´8 gpyq
We can formulate the above partition theorem in a way that is more similar to Theorem 7.9
which covered the discrete case. However, to do this, we need to introduce some slightly contorted
notation, namely, for any event A, and any continuously distributed random variable Y , we define
whenever P (A | Y P ry, y ` hs) ą 0 for all h sufficiently small—this happens when f (y) is contin-
uous at y and f (y) ą 0. Note that our usual definition of conditional probability does not apply,
because P (Y “ y) “ 0 for continuously distributed Y , so in the above, P (A | Y “ y) is not really
a conditional probability.
One should be forewarned that the notation P (A | Y “ y) for continuously distributed Y can lead
to extremely confusing issues, such as for instance Borel’s paradox (see http://en.wikipedia.
org/wiki/Borel%E2%80%93Kolmogorov_paradox).
Anyway, for now, proceeding in ignorance about any pitfalls, with this notation we have that
Z hpyq
P (gpyq ď X ď hpyq | Y “ y) “ f (x | y) dx
gpyq
47
R Definition 7.21 (Independence of Jointly Continuously Distributed Random Variables)
Two jointly continuously distributed random variables X and Y are independent whenever
Clearly, for independent jointly continuously distributed X and Y , fX|Y (x | y) “ fX (x) whenever
fY (y) ą 0, and fY |X (y | x) “ fY (y) whenever fX (x) ą 0.
Corollary 7.22 If two jointly continuously distributed random variables X and Y are independent,
then
P (X P A, Y P B) “ P (X P A) P (Y P B)
for all A and B Ď R that are finite unions of intervals.
In other words, if X and Y are independent, then the events {X P A} and {Y P B} are independent
for most A and B Ď R of practical interest.
Because we can write f (x, y) as e´x ¨ 3e´3y (for x ě 0 and y ě 0), it follows that X „ Expp1q and
Y „ Expp3q, and they are independent. We can calculate things like, for a ą 0,
Z 8 Z 8
P (aY ă X) “ f (x | y) dx f (y) dy
0 ay
Z 8 Z 8 Z 8
´x ´3y
“ e dx 3e dy “ e´ay ¨ 3e´3y dy “ 3{p3 ` aq.
0 ay 0
Suppose X : Ω Ñ XpΩq and Y : Ω Ñ Y pΩq are (discrete or continuous) random variables, and
g : XpΩq ˆ Y pΩq Ñ S is some function assigning a value gpx, yq P S to each point px, yq. Then gpX, Y q
is also a random variable, namely the outcome to a ‘new experiment’ obtained by running the ‘old
experiments’ to produce values x for X and y for Y , and then evaluating gpx, yq.
X
Ω
X Y
Xpωq
pX, Y q pXpωq, Y pωqq
Y
ω X
Y
g Z
Y pωq
gpXpωq, Y pωqq
g ˝ pX, Y q
48
Formally, gpX, Y q :“ g ˝ pX, Y q, or in more specific terms, the random variable gpX, Y q : Ω Ñ S is
defined by:
gpX, Y qpωq :“ gpXpωq, Y pωqq for all ω P Ω.
For example:
P (gpX, Y q P A) “ P ({ω P Ω : gpXpωq, Y pωqq P A}) for all A Ď S.
Example 7.24 For any random variables X and Y , X ` Y , X ´ Y , XY , minpX, Y q, etpX`Y q , and
so on, are all random variables as well.
49
Chapter 8
Expectation
In a relative frequency interpretation (discussed earlier in Section 4.2), suppose that we run n trials
on an experiment where we observe the outcome of some real-valued random variable X : Ω Ñ R in
each trial. Let xi denote the observed value of X in the ith trial; P the sequence of observations x1 , x2 ,
. . . , xn is called a sample. The sample mean is then simply n1 ni“1 xi . As a mathematical idealization,
we may suppose that there is a unique, empirical limiting value for the sample mean, as n tends to
infinity, which we call the expectation of X.
In a betting interpretation (discussed earlier in Section 4.3), you can simply consider your ‘fair
price’ for a bet which pays X; that price, we call your expectation of X.
The idea of expectation is very interesting mathematically, and also provides ways to use probability
in a host of practical applications. Regardless of interpretation, the expectation of X can be connected
to the probability mass function p (x) (if X is discrete) or the probability density function f (x) (if X
is continuously distributed) in the following way:
50
R Definition 8.1 (Expectation) For any real-valued random variable X, the expectation (also
called expected value or mean) of X, denoted as E (X) , is defined as:
X
E (X) :“ x p (x) if X is discrete, and (8.1)
xPXpΩq
Z 8
E (X) :“ x f (x) dx if X is continuously distributed, (8.2)
´8
Example 8.2 Consider the following ‘game’. You pay Jimmy a pound and then you both throw a fair
die. If you get the higher number you get back the difference in pounds, otherwise you lose your pound.
Call the return from a game X, with possible values 0, 1, 2, 3, 4 and 5. By counting outcomes,
x 0 1 2 3 4 5
21 5 4 3 2 1
p (x) 36 36 36 36 36 36
so that
5
X
E (X) “ x p (x) “ p0 ˆ 21 ` 1 ˆ 5 ` 2 ˆ 4 ` 3 ˆ 3 ` 4 ˆ 2 ` 5 ˆ 1q{36 “ 35{36. (8.3)
x“0
We can interpret this value as meaning that over a long series of games you will get back £35 for every
£36 paid out.
Example 8.3 You wrapped 200 prizes for the Lucky Dip at the village fete – 4 prizes worth £10, 16
prizes worth £4, 30 worth 80p and 150 worth 20p. Some other people also wrapped prizes. You didn’t
see their prizes but think they were similar to yours so your fair price for a ticket is p4 ˆ 10 ` 16 ˆ 4 `
30 ˆ 0.8 ` 150 ˆ 0.2q{200 “ 158{200.
51
Exercise 136 (*) See problem sheet.
If the range of possible values for a random variable X is unbounded, then the sum or integral
in Definition 8.1 may fail to exist. In this case, the preceding formulas may still be used to assign
a meaningful expectation in some cases, provided we interpret them with care.
For example, if X is discrete with probability mass function p (x) , consider
X X X
E (X) “ xp (x) “ xp (x) ` xp (x) ;
xPXpΩq xPXpΩq:xě0 xPXpΩq:xď0
| {z } | {z }
S` S´
now the individual sums S` and S´ always exist, but may be equal to `8. In fact, if we write
X ` “ maxpX, 0q and X ´ “ maxp´X, 0q, then X “ X ` ´X ´ and E (X ` ) “ S` and E (X ´ ) “ S´ .
To see this, note for example that X ` is a random variable with pX ` (x) “ pX (x) for x ą 0 and
pX ` (0) “ P (X ď 0) , but only the positive terms contribute to E (X ` ) . It makes sense to say that
E (X) “ E (X ` ) ´ E (X ´ ) (being possibly ´8 or `8) as long as at most one of S` and S´ are
infinite, using the rules 8 ´ x “ 8 and x ´ 8 “ ´8 for finite x. (There is no sensible interpretation
of 8 ´ 8.) A similar argument applies in the continuous case, with integrals instead of sums. This
is summarized in the following table.
E (X) “ E (X ` ) ă 8 E (X ` ) “ 8
E (X ´ ) ă 8 E (X ` ) ´ E (X ´ ) `8
E (X ´ ) “ 8 ´8 undefined
Example 8.6 Suppose that X is continuously distributed with probability density function given by
(
αx´1´α if x ě 1,
f (x) “
0 otherwise,
Let X be a discrete random variable, and let g : XpΩq Ñ R be a real-valued function. As seen in
Section 6.11, gpXq :“ g ˝ X is again a random variable; in fact, gpXq is a real-valued random variable.
To find the expectation of gpXq, by Equation (8.1), according to the definition we need to find the
52
probability mass function pgpXq first. It turns out however that we can express E (gpXq) directly in
terms of pX , saving us the effort of having to calculate pgpXq from pX .
Define S :“ {gpxq : x P XpΩq}. Then, for any y P S,
X X
pgpXq (y) “ P (gpXq “ y) “ P (gpXq “ y | X “ x) P (X “ x) “ p (x) , (8.6)
xPXpΩq xPXpΩq:gpxq“y
so,
X X X X X
E (gpXq) “ ypgpXq (y) “ y p (x) “ yp (x)
yPS yPS xPXpΩq:gpxq“y yPS xPXpΩq:gpxq“y
X X X X X
“ yp (x) “ y p (x) “ gpxqp (x) ,
xPXpΩq yPS:y“gpxq xPXpΩq yPS:y“gpxq xPXpΩq
where we applied the definition of expectation, Equation (8.6), distributivity, change of order of sum-
mation, and distributivity again. A similar result can be proven when X is continuously distributed.
Concluding, we have the following result, which is sometimes known as the Law of the Unconscious
Statistician:
R Theorem 8.7 (Expectation of a Function of a Random Variable) For any random vari-
able X : Ω Ñ XpΩq, and any function g : XpΩq Ñ R,
X
E (gpXq) “ gpxqp (x) if X is discrete, and
xPXpΩq
Z 8
E (gpXq) “ gpxqf (x) dx if X is continuously distributed,
´8
Example 8.8 Suppose X takes values ´2, ´1, 0, 1, 2, 3 each with probability 1/6. Then
1 19
E X 2 “ pp´2q2 ` p´1q2 ` 0 ` 1 ` 22 ` 32 q “ ;
6 6
1 √ √ √ 1
E (sinpπX{4q) “ p´1 ´ 1{ 2 ` 0 ` 1{ 2 ` 1 ` 1{ 2q “ √ ;
6 6 2
and so on.
1
Example 8.9 If X „ Up´1, 1q then f (x) “ 2 for x P r´1, 1s, and zero elsewhere, so
Z 1
2 1
x2 ¨
E X “ dx “ 1{3.
´1 2
53
Exercise 148 (**) See problem sheet.
Similar comments apply here about extensions of E (gpXq) to include `8 or ´8 as at the end
of the previous section.
Example 8.10 Suppose X „ Up´1, 1q i.e., X is uniformly distributed on the interval p´1, 1q, and
we set gpxq “ 1{x for x ‰ 0 and gp0q “ 0. Then E (gpXq) is not defined because E (gpXq` ) “
R0 1 R1 1
0 dx ` 0 2x
´1 2
dx “ 8 and similarly E (gpXq´ ) “ 8.
For multiple random variables, the Law of the Unconscious Statistician reads as follows:
Example 8.12 Consider discrete random variables X and Y with joint probability mass function:
p (x, y) x “ ´1 x“0 x“1
y“0 1{4 0 1{4
y“1 0 1{4 1{4
Example 8.13 Let X and Y be jointly continuously distributed random variables, with
(
1 if px, yq P r0, 1s2 ,
f (x, y) “
0 otherwise.
R1R1 R1 R1
Then E (XY ) “ 0 0 xy dx dy “ p 0 x dxqp 0 y dyq “ p1{2q2 “ 1{4.
54
8.3 Linearity of expectation
Also see [BH15, Section 4.2], [DS02, Sections 4.2 and 4.6], or [Sti03, Section 5.3].
R Theorem 8.14 (Linearity of Expectation I) For any random variable X, and any con-
stants α and β P R,
E (αX ` β) “ αE (X) ` β.
R Theorem 8.15 (Linearity of Expectation II) For any two random variables X and Y ,
E (X ` Y ) “ E (X) ` E (Y ) .
Proof Consider the multiple random variable pX, Y q and the function gpx, yq “ x ` y. By the Law
of the Unconscious Statistician (Theorem 8.11) we get in the discrete case
X X X X X X
E (gpX, Y q) “ px ` yqp (x, y) “ x p (x, y) ` y p (x, y)
xPXpΩq yPY pΩq xPXpΩq yPY pΩq yPY pΩq xPXpΩq
X X
“ xpX (x) ` ypY (y) “ E (X) ` E (Y ) ,
xPXpΩq yPY pΩq
as claimed. A similar calculation applies in the jointly continuous case, and the extension to more than
two random variables follows by induction.
Example 8.16 Suppose that Y1 , Y2 , . . . , Yn are random variables such that each Yi „ Binp1, pq, i.e.
they take value 1 with probability p, and value
Pn 0 otherwise. Then E (Yi ) “ p and so E (Y1 ` ¨ ¨ ¨ ` Yn ) “
np. If the Yi are independent then X “ i“1 Yi „ Binpn, pq. So, we have shown that E (X) “ np for
any X „ Binpn, pq.
This is a much quicker proof than computing using the probability mass function; and note that the
independence is not necessary here.
We have proved these statements for discrete and continuous random variables (because we have
defined expectation only in those two cases!), but linearity of expectation is true for all random
variables, and follows from the general measure-theoretic approach to probability theory, in which
expectation is defined as a Lebesgue integral with respect to a probability measure; this includes
the discrete and continuous settings we study as special cases. Students who take later probability
courses will see some of this general approach. The fact that in this course we cannot event talk
properly about, say, a sum of a discrete plus a continuous random variable, is a serious limitation
that has no elegant fix in any approach that does not use measure theory.
55
8.4 Variance and covariance
Also see [BH15, Sections 4.6 and 7.3], [DS02, Sections 4.3 and 4.6], or [Sti03, Sections 4.3, 5.3 and 8.5].
As mentioned earlier, we can interpret the expectation of X as a long-run average of a sample from
distribution X. A popular and mathematically convenient way to measure the variability of X—i.e.
to measure how much X varies from E (X) in the long run—goes via the expectation of the random
variable pX ´ E (X) q2 :
R Definition 8.17 Let X be any real-valued random variable. The variance of X is defined as
2
Var (X) :“ E X ´ E (X) ,
Note that both Var (X) and SD (X) are non-negative numbers.
From Theorem 8.7, we immediately derive the following expressions for the variance:
X
Var (X) “ px ´ E (X) q2 p (x) if X is discrete, and
xPXpΩq
Z 8
Var (X) “ px ´ E (X) q2 f (x) dx if X is continuously distributed,
´8
Example 8.18 If X takes values 0, 10, 20 each with probability 1/3, then
1 1 1
E (X) “ 0 ˆ ` 10 ˆ ` 20 ˆ “ 10.
3 3 3
Consequently, using this value,
1 200
Var (X) “ ˆ p0 ´ 10q2 ` p10 ´ 10q2 ` p20 ´ 10q2 “ ,
3 3
q
200
and so SD (X) “ 3 « 8.16.
Example 8.19 Let Z „ Np0, 1q. By Equation (8.5), E (Z) “ 0 and by Exercise 147, E Z 2
“ 1.
Consequently,
Var (Z) “ E (pZ ´ EpZ) q2 q “ E Z 2 “ 1.
(8.7)
p
and SD (Z) “ Var (Z) “ 1.
For two real-valued random variables, we can ask ourselves how they vary jointly:
R Definition 8.20 (Covariance) Let X and Y be any real-valued random variables. The covariance
of X and Y is defined as
56
We say that X and Y are positively correlated, uncorrelated or negatively correlated according to
whether their covariance is positive, zero, or negative, respectively.
From Theorem 8.11, we immediately derive the following expressions for the covariance:
X X
Cov (X, Y ) “ px ´ E (X) qpy ´ E (Y ) qp (x, y)
xPXpΩq yPY pΩq
if X and Y are jointly continuously distributed, provided that the double sum or double integral exists.
Property 8.22 (Symmetry of Covariance) For any real-valued random variables X and Y ,
As immediate consequences of linearity of expectation (Theorems 8.14 and 8.15) we obtain the
formulæ:
Corollary 8.23 For any real-valued random variable X, and any constants α and β P R,
For any real-valued random variables X, Y , and Z, and any constants α, β, γ, and δ P R,
Equations (8.8a), (8.8b), and (8.8c) mean that Cov is a bilinear operator.
Example 8.24 Let X „ Npµ, σ 2 q. Then, Z :“ X´µ σ „ Np0, 1q by Theorem 6.31, E (Z) “ 0 by
Equation (8.5), and Var (Z) “ 1 by Equation (8.7). Consequently,
In other words, the parameters µ and σ 2 of a normal distribution correspond to the expectation and
variance, respectively.
57
Example 8.26 If X takes values 0, 10, 20 each with probability 1/3, then
1 1 1
E (X) “ 0 ˆ ` 10 ˆ ` 20 ˆ “ 10,
3 3 3
1 1 1
E X 2 “ 02 ˆ ` 102 ˆ ` 202 ˆ “ 500{3,
3 3 3
Consequently,
Var (X) “ E X 2 ´ pE (X) q2 “ 100 ´ 500{3 “ 200{3,
Example 8.28 Suppose X takes values 0, 1, 2 with probabilities 1/4, 1/2, 1/4. Let Y :“ X 2 . What
is Var (X) , Var (Y ) , and Cov (X, Y ) ?
Clearly, E (X) “ 1. Next, Y “ X 2 takes values 0, 1, 4 with probabilities 1/4, 1/2, 1/4, so E X 2 “
Finally, we can now also say something about the variance of sums of random variables:
Corollary 8.29 (Variance of a Sum) For any real-valued random variables X and Y ,
So, in general, the variance of a sum is not equal to the sum of the variances, unless all covariances are
zero (zero covariance occurs under independence, covered later).
58
Example 8.30 For three real-valued random variables X, Y , and Z,
Var (X ` Y ` Z) “ Var (X) ` Var (Y ) ` Var (Z) ` 2 Cov (X, Y ) ` Cov (X, Z) ` Cov (Y, Z) .
Example 8.31 Suppose X takes values 0, 1, 2 with probabilities 1/4, 1/2, 1/4. Let Y :“ X 2 . We
already found earlier that Var (X) “ 1{2, Var (Y ) “ 9{4, and Cov (X, Y ) “ 1. Now also find
Var (X ` Y ) and Var (X ´ Y ) .
By the above, Var (X ` Y ) “ Var (X) ` Var (Y ) ` 2Cov (X, Y ) “ 1{2 ` 9{4 ` 2 ˆ 1 “ 19{4.
(Note that in this very simple example we can easily calculate Var (X ` Y ) directly, as X ` Y takes
possible values 0, 2, 6 with probabilities 1/4, 1/2, 1/4 so we can confirm by direct calculation that
Var (X ` Y ) “ 19{4.)
Next, note that Var (X ´ Y ) “ Var (X ` Z) , where Z “ ´Y . As Var (Z) “ p´1q2 Var (Y ) “
Var (Y ) and Cov (X, Z) “ Cov (X, ´Y ) “ ´Cov (X, Y ) we have
E (X 1A )
E (X | A) :“ whenever P (A) ą 0.
P (A)
R Theorem 8.34 For any discrete random variable X and any event A Ď Ω,
X
E (X | A) “ xP (X “ x | A) .
xPXpΩq
59
Proof This is an exercise in tracking definitions. Indeed, if Y “ X 1A , then for any x P XpΩq with
x ‰ 0,
P (Y “ x) “ P ({X “ x} X A) “ P (X “ x | A) P (A) .
Hence X X
E (Y ) “ xP (Y “ x) “ xP (X “ x | A) P (A) ,
xPXpΩq xPXpΩq
Example 8.35 Suppose that X and Y are discrete random variables, g : XpΩq
P Ñ R, and B Ď Y pΩq.
Then choosing the event A “ {Y P B}, we see that whenever P (Y P B) “ yPB pY (y) ą 0,
P P
xPXpΩq gpxq yPB pX,Y (x, y)
E (gpXq | Y P B) “ P .
yPB pY (y)
R Theorem 8.36 (Partition Theorem for Expectation) Let X be any real-valued random
variable. Let E1 , E2 , . . . , Ek be any events that form a partition. Then,
k
X
E (X) “ E (X | Ei ) P (Ei ) .
i“1
Proof Because E1 , . . . , Ek constitute a partition, Yki“1 Ei “ Ω and the Ei are pairwise disjoint, so
that
k
1 “ 1Ω “ 1Yk Ei “ 1Ei ,
X
i“1
i“1
and hence, by linearity of expectation,
k k
!
1Ei E (X 1Ei ) ,
X X
E (X) “ E X “
i“1 i“1
60
Example 8.37 Ann and Bob play a sequence of independent games, each with outcomes A “ {Ann wins},
B “ {Bob wins}, D “ {game drawn} which have probabilities P (A) “ p, P (B) “ q and P (D) “ r ą 0
with p ` q ` r “ 1. Let N denote the number of games played until the first win by either Ann
or Bob. The results of the first game A, B, D form a partition and E (N | A) “ E (N | B) “ 1 (as
P (N “ 1 | A) “ P (N “ 1 | B) “ 1) while E (N | D) “ 1`E (N ) (the future after a drawn game looks
the same as at the start). Hence E (N ) “ 1 ˆ p ` 1 ˆ q ` p1 ` E (N ) q ˆ r or p1 ´ rqE (N ) “ p ` q ` r “ 1
i.e. E (N ) “ 1{p1 ´ rq.
In other words, in the case where Y is discrete, E (X | Y ) is the random variable that takes values
E (X | Y “ y) with probabilities P (Y “ y) .
R Corollary 8.39 (Partition Theorem for Expectation II) For any real-valued random vari-
able X and any discrete random variable Y ,
E (X) “ E (E (X | Y ) ) .
This is sometimes called the law of iterated expectation. We can use this result to calculate E (X) in
some tricky cases.
Example 8.40 You flip three fair coins and obtain H heads (so H „ Binp3, 1{2q, and E (H) “ 3{2).
Then you roll H fair dice to get total score T . The events {H “ 0}, {H “ 1}, {H “ 2}, and {H “ 3}
form a partition, and E (T | H “ h) “ 7h{2 since with h rolls of a fair die the expected total score is
7
2 h. Hence E (T | H) “ 7H{2 and so
7 7 3
E (T ) “ E (E (T | H) ) “ E (H) “ ˆ .
2 2 2
Exercise 178 (*) See problem sheet.
Remember that two discrete random variables are independent when their joint probability mass
function factorises, i.e. when p (x, y) “ p (x) p (y) . Similarly, two jointly continuously distributed
random variables are independent when their joint probability density function factorizes, i.e. when
f (x, y) “ f (x) f (y) . It turns out that in these cases, expectation factorizes as well:
61
R Theorem 8.41 (Independence Means Multiply) If X and Y are independent real-valued
random variables then
E (XY ) “ E (X) E (Y ) .
More generally, for any mutually independent real-valued random variables X1 , X2 , . . . , Xn ,
n
! n
Y Y
E Xi “ E (Xi ) .
i“1 i“1
Proof Suppose that X and Y are independent discrete random variables with joint probability mass
function p (x, y) “ pX (x) pY (y) . Then by the Law of the Unconscious Statistician,
X X X X
E (XY ) “ xyp (x, y) “ xpX (x) ypY (y) ,
xPXpΩq yPY pΩq xPXpΩq yPY pΩq
Now, if X and Y are independent, then so are any functions of X and Y . Consequently:
Corollary 8.42 If X and Y are independent random variables then for any functions g : XpΩq Ñ R
and h : Y pΩq Ñ R we have that
Corollary 8.43 If X and Y are independent random variables, then Cov (X, Y ) “ 0.
The converse of the last property is not true—X and Y may be dependent but uncorrelated. All
that we can say is that X and Y will have a large positive correlation if, as X increases, Y also tends to
increase in a roughly linear way, and a large negative correlation if, as X increases, Y tends to decrease
in a roughly linear way.
Example 8.44 Suppose pX, Y q are jointly distributed taking values p´1, 0q, p`1, 0q, p0, ´1q, and
p0, `1q with probability 1{4 of each. Then X and Y are not independent, because P (X “ 1, Y “ 1) “ 0
is not the same as P (X “ 1) P (Y “ 1) “ 1{16, for instance. However, X and Y are uncorrelated,
because E (XY ) “ 0 and E (X) “ E (Y ) “ 0 so Cov (X, Y ) “ 0. (In fact, in this case XY “ 0 with
probability 1.)
An important consequence of Corollary 8.43 is a simplification of the formula for the variance of a
sum (Corollary 8.29) for pairwise independent random variables:
Corollary 8.45 Consider random variables X1 , X2 , . . . , Xn . If these random variables are pairwise
independent, then
n n
!
X X
Var Xi “ Var (Xi ) .
i“1 i“1
Example
Pn 8.46 Remember from an earlier example that we can write any X „ Binpn, pq as X “
1 Xi where X1 , . . . , Xn are independent and each Xi „ Binp1, pq. In Exercise 158, you showed that
Var (Xi ) “ pp1 ´ pq. Consequently, as the Xi are independent,
n
X
Var (X) “ Var (Xi ) “ npp1 ´ pq.
i“1
Example 8.47 Let S be the total score and T be the product of the scores from throwing four Pdice. To
4
find E (S) , Var (S) and E (T ) let the individual scores be Xk , k “ 1, 2, 3, 4 so that S “ k“1 Xk ,
62
T “ 4k“1 Xk . We readily calculate E (Xk ) “ 6i“1 i ˆ 1{6 “ 7{2 and Var (Xk ) “ 6i“1 i2 ˆ 1{6 ´
Q P P
p7{2q2 “ 35{12. Therefore
4 4
!
X X
E (S) “ E Xk “ E (Xk ) “ 4 ˆ 7{2 “ 14,
k“1 k“1
Note that all of these calculations are possible without having to deal with the joint probability distri-
bution of the Xk (which is uniform on the 1296 possible outcomes).
Theorem 8.48 (Monotonicity of Expectation) For any random variable X, and any a P R, if
X ě a then E (X) ě a.
E (X)
P (X ě a) ď .
a
Example 8.51 If X equals s with chance p but otherwise equals 0 then E (X) “ sp. For a ą s we
have 0 “ P (X ě a) ď ps{a while for a ď s Markov’s inequality says p “ P (X ě a) ď p ˆ s{a which
is exact at a “ s so this bound is as strong as possible.
63
Corollary 8.52 (Chebyshev’s Inequality) For any random variable X and any a ą 0, we have
Var (X)
P (|X ´ E (X) | ě a) ď .
a2
E (X) q2 at a2 . Now observe that {pX ´ E (X) q2 ě a2 } “ {|X ´ E (X) | ě a} and of course
2
E pX ´ E (X) q “ Var (X) .
Chebyshev bounds are often too generous when distributional information is available, as seen in
the next example. Nevertheless, its generality and simplicity make this inequality very valuable for
complex probability calculations.
Example 8.53 If Z „ Np0, 1q then P (|Z ´ E (Z) | ě 2) “ P (|Z| ě 2) “ 1´P (´2 ă Z ă 2) “ 0.046,
while the Chebyshev bound on this probability is Var (Z) {22 “ 0.25.
Bibliographical notes
There are approaches to probability theory that start out from expectation of random variables directly,
rather than starting out from probability of events as we have done here; see e.g. [Whi92].
Pafnuty Chebyshev (1821–1894) and his student Andrei Markov (1856–1922) made several impor-
tant contributions to early probability theory. What we call Markov’s inequality was actually published
by Chebyshev, as was what we call Chebyshev’s inequality; our nomenclature is standard, and at least
has the benefit of distinguishing the two. A version of the inequality was first formulated by Irénée-Jules
Bienaymé (1796–1878).
64
Chapter 9
Limit theorems
R Goals: Understand and know how to prove the weak law of large numbers, for proportions
as well as for general random variables, and know under what conditions the weak law applies.
Understand and know how to prove (by means of moment generating functions) the central limit
theorem. Know under what conditions the central limit theorem applies. Know how to exploit the
central limit theorem to approximate the binomial distribution, and under what circumstances.
Know the definition and properties of moment generating functions. Know how to derive the
moment generating function of a given distribution. Know how to prove the central limit theorem.
The results of this section describe limiting properties of distributions of sums of random variables
using only some assumptions about means and variances.
Toss a coin n times, where the probability of heads is p, independently on each toss. Let X be
the number of heads and let Bn :“ X{n be the proportion of heads in the n tosses. Then, because
X „ Binpn, pq has expectation np and variance npp1 ´ pq,
pp1 ´ pq
P (|Bn ´ p| ě ) ď .
n2
Whence,
P (|Bn ´ p| ě ) Ñ 0 as n Ñ 8,
no matter how small is. In other words, as n Ñ 8, the sample proportion is with very high probability
within any tiny interval centred on p.
The same argument applies more generally.
65
R Theorem 9.1 (Weak Law of Large Numbers) Suppose we have an infinite sequence X1 ,
X2 , . . . of independent random variables with the same mean and variance:
In other words, the sample average has a very high probability of being very near the expected value
µ when n is large.
The convergence in (9.1) is called convergence in probability; the law of large numbers says that
‘X n Ñ µ in probability’.
Example 9.2 Measure the heights Hi of n randomly selected people from a very large population,
where µ, σ 2 are the average and the variance of heights over the whole population. Then E (Hi ) “ µ,
Var (Hi ) “ σ 2 and so, as long as the collection of heights is not too asymmetric, the chance of the
average height X n being more than a small amount from µ is very small i.e. almost all large samples
have average near µ.
If the law of large numbers is a ‘first order’ result, a ‘second order result’ is the famous central
limit theorem, which describes fluctuations around the law of large numbers. A sequence of random
variables X1 , X2 , . . . are independent and identically distributed (i.i.d. for short) if they are mutually
independent and all have the same (marginal) distribution.
R Theorem 9.3 (Central Limit Theorem) Suppose we have a sequence X1 , X2 , . . . of i.i.d. ran-
dom variables. Let
Sn ´ nµ Xn ´ µ
Zn :“ √ “ √
σ n σ{ n
Consequently, if we also invoke Theorem 6.31, we approximately have that Sn „ Npnµ, nσ 2 q and
√
X n „ Npµ, σ 2 {nq i.e. typical values of Sn are of order σ n from nµ while typical values of X n are of
66
√
order σ{ n from µ. In other words, for large enough n, by Corollary 6.32,
x ´ nµ
P (Sn ď x) « Φ √ ,
σ n
x´µ
P Xn ď x « Φ √ .
σ{ n
The conditions of the central limit theorem can be substantially weakened. In fact, the random
variables X1 , X2 , . . . need be neither identical nor independent: it is sufficient that we can turn
them into a normalized martingale. Without going into too much detail, it suffices to assume that
E (Xn`1 | X1 “ x1 . . . Xn “ xn ) “ E (Xn`1 )
Var (Xn`1 | X1 “ x1 . . . Xn “ xn ) “ Var (Xn`1 ) ą 0
for all n and all possible values for x1 , . . . , xn . In this case, the sequence of random variables
Pn
k“1 pXk ´ E (Xk ) q
Zn :“ p Pn
k“1 Var (Xk )
converges to a Np0, 1q random variable whenever the Lindeberg condition is satisfied. The Lindeberg
condition is a bit complicated to write down, but it is satisfied for example when there is a constant
α such that for all n we have that |Xn | ď α.
For further details, see for instance [Nel87, Chapters 14 & 18].
Example 9.4 Potatoes with an average weight of 100g and standard deviation of 40g are packed into
bags to contain at least 2500g. What is the chance that more than 30 potatoes will be needed to fill
a given bag? Let N be the number needed to exceed 2500 and W be the total weight of 30 potatoes.
√
Then {N ą 30} “ {W ă 2500} and so P (N ą 30) “ P (W ă 2500) « Φpp2500 ´ 3000q{40 30q “
Φp´2.282q “ 0.011.
Remember that we can write any binomially distributed random variable as a sum of independent
Bernoulli random variables. Consequently:
Corollary 9.5 (Normal Approximation to the Binomial Distribution) Let X „ Binpn, pq with
0 ă p ă 1, Bn :“ X{n, and
X ´ np Bn ´ p
Zn :“ p “p
npp1 ´ pq pp1 ´ pq{n
Then
lim FZn (z) “ lim P (Zn ď z) “ Φpzq.
nÑ8 nÑ8
This means that, when X „ Binpn, pq with 0 ă p ă 1 and large enough n, then for any x P R, we
approximately have that !
x ´ np
P (X ď x) « Φ p
npp1 ´ pq
67
For moderate n, a continuity correction improves the approximation; e.g. for k P N:
k ` 0.5 ´ µ
P (X ď k) “ P (X ď k ` 0.5) « Φ
σ
k ´ 0.5 ´ µ
P (k ď X) “ P (k ´ 0.5 ď X) « 1 ´ Φ
σ
Example 9.6 A plane has 110 seats and n business people book seats. They show up independently for
their flight with chance q “ 0.85. Let Xn be the number that arrive to take the flight (the others just
take a different flight) and find P (Xp
n ą 110) for n “110, 120, 130, 140. Using the CLT for binomials,
√
E (Xn ) “ 0.85n and SD (Xn ) “ nqp1 ´ qq “ 0.3571 n and so P (Xn ą 110) « 1 ´ Φpp110.5 ´
√
0.85nq{0.3571 nq which takes values 0, 0.015, 0.500 and 0.978 for n “110, 120, 130 and 140.
R Definition 9.7 (Moment Generating Function) For any real-valued random variable X,
the function MX : R Ñ R Y {`8} given by
MX ptq :“ E etX
Because etX ě 0, we have that MX ptq ě 0. From Theorem 8.7, we immediately derive the following
expressions for MX ptq:
X
MX ptq “ etx p (x) if X is discrete, and
xPXpΩq
Z 8
tx
MX ptq “ e f (x) dx if X is continuously distributed.
´8
The above sum and integral always exist, but can be `8.
Example 9.8 If X „ Binp1, pq then MX ptq “ pet ` p1 ´ pq.
Example 9.10 If U „ Upa, bq then MU ptq “ pebt ´ eat q{pb ´ aqt for t ‰ 0 and MU p0q “ 1.
2 {2
Example 9.11 If Z „ Np0, 1q then MZ ptq “ et .
The moment generating function has several useful properties. The property that gives the name
is revealed by considering the Taylor series for etX :
t2 X 2 t3 X 3
MX ptq “ E etX “ E 1 ` tX `
` ` ¨¨¨
2! 3!
t 2 t 3
“ 1 ` tE (X) ` E X 2 ` E X 3 ` ¨ ¨ ¨ .
2! 3!
(Some work is needed to justify this last step, which we omit.) This gives the first of our properties.
68
M1 (Generates moments.) For every k P N,
dk MX
E Xk “ p0q.
dtk
M2 (Moment generating function determines distribution.) Consider any two random variables X and
Y . If there is an h ą 0 such that
then
Conversely, if FX (x) “ FY (x) for all x P R then MX ptq “ MY ptq for all t P R.
then
Regarding M5 (convergence), one can show that the cumulative distribution function FX of any
random variable X has at most a countable number of points where FX is not continuous.
So, provided that MX is finite in an open interval around zero, FXn converges to FX on all but
an at most countable number of points. If, additionally, X is continuously distributed, then FX is
continuous everywhere, so in that case, FXn converges to FX everywhere.
Example 9.13 Suppose X „ Popλq. Then, by M1,
1
p0q “ λe0 exp λpe0 ´ 1q “ λ.
E (X) “ MX
We already know that µ ` σZ „ Npµ, σ 2 q (see Theorem 6.31), so the above expression gives us the
moment generating function of the normal distribution with mean µ and variance σ 2 .
Additionally, by M2, it follows that X „ Npµ, σ 2 q if and only if MX ptq “ exp µt ` 21 σ 2 t2 .
69
Example 9.15 Suppose that X1 , . . . , Xk are independent random variables with Xi „ Npµi , σi2 q. If
Y “ ki“1 Xi , then, by M4, the moment generating function of Y is
P
k
Y 1 2 2 1 2 2
MY ptq “ exp µi t ` σi t “ exp µt ` σ t
2 2
i“1
Pk Pk
where µ “ i“1 µi and σ 2 “ 2
i“1 σi . Thus, by M2, it must be that Y „ Npµ, σ 2 q
Bibliographical notes
The law of large numbers and the central limit theorem have long and interesting histories. The weak
law of large numbers for binomial (i.e. sums of Bernoulli) variables was first established by Jacob
Bernoulli (1654–1705) and published in 1713 [Ber13]. The name ‘law of large numbers’ was given by
Poisson. The modern version is due to Aleksandr Khinchin (1894–1959), and our Theorem 9.1 is only
a special case—the assumption on variances is unnecessary.
It was apparent to mathematicians in the mid 1700s that a more refined result than Bernoulli’s law
of large numbers could be obtained. A special case of the central limit theorem for binomial (i.e. sums
of Bernoulli) variables was first established by de Moivre in 1733, and extended by Laplace; hence the
normal approximation to the binomial (Corollary 9.5) is sometimes known as the de Moivre–Laplace
theorem. The name ‘central limit theorem’ was given by George Pólya (1887–1985) in 1920.
The first modern proof of the central limit theorem was given by Aleksandr Lyapunov (1857–1918)
around 1901 [Lya]. Lyapunov’s assumptions were relaxed by Jarl Waldemar Lindeberg (1876–1932)
in 1922 [Lin22]. Many different versions of the central limit theorem were subsequently proved. The
subject of Alan Turing’s (1912–1954) Cambridge University Fellowship Dissertation of 1934 was a
version of the central limit theorem similar to Lindeberg’s; Turing was unaware of the latter’s work.
70
Bibliography
[BA92] Roy Billinton and Ronald N. Allan. Reliability Evaluation of Engineering Systems: Concepts
and Techniques. Plenum Press, 2nd edition, 1992.
[Ber13] Jacob Bernoulli. Ars Conjectandi. Thurnisiorum, Basileæ, 1713.
[BH15] Joseph K. Blitzstein and Jessica Hwang. Introduction to probability. Texts in Statistical
Science Series. CRC Press, Boca Raton, FL, 2015.
[Boo54] George Boole. An Investigation of the Laws of Thought: on which are Founded the
Mathematical Theories of Logic and Probabilities. Walton and Maberly, London, 1854. Avail-
able at the internet text archive.
[dM56] Abraham de Moivre. The Doctrine of Chances: or, a Method for Calculating the Probabilities
of Events in Play. A. Millar, London, third edition, 1756. Available at the internet text archive.
[DS02] Morris H. DeGroot and Mark J. Schervish. Probability and Statistics. Addison Wesley, New
York, 3rd edition, 2002.
[Hac75] Ian Hacking. The Emergence of Probability: A Philosophical Study of Early Ideas about
Probability, Induction and Statistical Inference. Cambridge University Press, 1975.
[Háj12] Alan Hájek. Interpretations of probability. In Edward N. Zalta, editor, The Stanford
Encyclopedia of Philosophy. Winter 2012 edition, 2012. URL: http://plato.stanford.
edu/archives/win2012/entries/probability-interpret/.
[Kol50] Andrei N. Kolmogorov. Foundations of the Theory of Probability. Chelsea Publishing Com-
pany, New York, 1950.
[Lap25] Pierre Simon Laplace. Essai Philosophique sur les Probabilitiés. Bachelier, Paris, 1825.
Cinquième édition.
[Lin22] J. W. Lindeberg. Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeit-
srechnung. Mathematische Zeitschrift, 15(1):211–225, 1922. doi:10.1007/BF01494395.
[Lya] Lyapunov theorem. Encyclopedia of Mathematics. URL: http://www.encyclopediaofmath.
org/index.php?title=Lyapunov_theorem&oldid=25958.
[Nel87] Edward Nelson. Radically Elementary Probability Theory. Annals of Mathematical Studies.
Princeton University Press, New Jersey, 1987. URL: https://web.math.princeton.edu/
~nelson/books/rept.pdf.
[Ros07] Jeff Rosenthal. A First Look at Rigorous Probability Theory. World Scientific, New York,
second edition, 2007.
[SHE07] Saturnino Salas, Einar Hille, and Garret Etgen. Calculus: One and Several Variables. Wiley,
10th edition, 2007.
[Sti03] David Stirzaker. Elementary probability. Cambridge University Press, Cambridge, sec-
ond edition, 2003. URL: http://dx.doi.org/10.1017/CBO9780511755309, doi:10.1017/
CBO9780511755309.
71
[Ven88] John Venn. The Logic of Chance: An Essay on the Foundations and Province of the Theory of
Probability, with Especial Reference to Its Application to Moral and Social Science. Macmil-
lan, London, third edition, 1888. Available at the internet text archive.
[Whi01] William A Whitworth. Choice and Chance. Deighton Bell, Cambridge, third edition, 1901.
Available at the internet text archive.
[Whi92] Peter Whittle. Probability via Expectation. Springer, New York, third edition, 1992.
72