0% found this document useful (0 votes)

24 views73 pages

Probability I Course

This document provides details about a probability course taught by Dr. Andrew Wade. It outlines the course structure, including weekly lectures on Tuesdays, Thursdays and Fridays. It also describes tutorials, problem classes, homework assignments, and resources available to students, including summary notes, exercise sheets, and recommended books. The document appears to be the complete course syllabus, as it includes sections on all major probability topics to be covered over the term.

Uploaded by

MassimVfre5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views73 pages

Probability I Course

Uploaded by

MassimVfre5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Probability I

Andrew Wade

Epiphany 2017–2018
Course Details
Lecturer Dr Andrew Wade.
Office CM302.
Email [email protected].
Lectures There are lectures at 10:00 Tuesday, 10:00 Thursday and 15:00 Friday, all taking
place in CLC013. Each week, a brief outline of the topics to be covered in the upcoming week’s
lectures will be made available on DUO, in order to help you to guide your independent study.
Tutorials take place every two weeks, starting in week 12.
The purpose of tutorials is that students ask particular questions about the exercises and/or
do exercises under supervision. These meetings are as important as the lectures, so take full
advantage of them. A record of attendance is kept.
Problem Classes take place in the Friday lecture slot every two weeks, starting in week 13.
The purpose of problem classes is to go over exercises, on the board, in detail.
DUO will contain all supporting material for the course, including resources such as summary notes,
handouts, problem sheets, and solutions. Also you will find the schedule of problems set for
homework and tutorials, information about tutors and tutor groups, and announcements (which
are made by email as well).
Homework is set each Thursday, to hand in to your tutor at a time agreed with them (generally on
the Friday of the following week). Your tutor will return your marked work to you.

Handouts and Resources

Summary Notes You should take your own notes. Electronic summary notes are available on DUO,
but it is quite possible that we will cover in lectures material or examples not in the notes. Any
material we cover in lectures is examinable unless explicitly stated otherwise. On the other hand,
the notes contain some extra optional material that we will not cover in lectures, for those who
want to know a bit more about certain topics.
Exercise Sheets All the questions in the exercise sheets are there to help you understand this course.
Doing them is the only way to get a sound understanding. It is very important that you do these
exercises as the content of exercises and their solutions is also examinable. Some of the exercises
will be homework to be handed in and marked, as formative assessment. The rest of the exercises
are for you to do on your own. Solutions will be provided after a reasonable interval.
Books The library has several good books on aspects of the course. A full list can be found on DUO.
Some suggestions are:
• J.K. Blitzstein & J. Hwang, Introduction to Probability [BH15] covers the material in depth
and uses simulation code to illustrate the theory.
• M.H. DeGroot & M.J. Schervish, Probability and Statistics [DS02] has a statistical perspec-
tive, covering this course as well as most of 2H stats.
• D. Stirzaker, Elementary Probability [Sti03] is concise and the most mathematically ad-
vanced, and will be useful for students taking 2H probability.
Further Resources Various other resources and further reading will be made available during the
course via DUO.

1
About these lecture notes

R These boxes highlight particularly important material.

These boxes delimit advanced material for students who wish to dig a little deeper; some of the
content of these boxes may show up in an exam, but only as ‘unseen’ material.

Some warm-up exercises

Exercise 1 (*) See problem sheet.

Exercise 2 (*) See problem sheet.

Exercise 3 (*) See problem sheet.

Exercise 4 (*) See problem sheet.

Exercise 5 (*) See problem sheet.

Exercise 6 (**) See problem sheet.

Exercise 7 (**) See problem sheet.

Exercise 8 (**) See problem sheet.

Exercise 9 (***) See problem sheet.

2
Contents

1 Axioms of probability theory 5

1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Sample space and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Event calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 The axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Consequences of the axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Sigma-algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Equally likely outcomes and counting principles 14

2.1 Classical probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Counting principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Conditional probability and independence 19

3.1 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Properties of conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Independence of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Interpretations of probability 25
4.1 Equally likely outcomes interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Relative frequency interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Betting interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Interpretation and the axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Some applications of probability 27

5.1 Reliability of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Hardy–Weinberg equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Random variables 30
6.1 Definition and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 The binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 The Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6 The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.7 The exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.8 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.9 Cumulative distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.10 Standard normal tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.11 Functions of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Multiple random variables 41

7.1 Joint probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 Jointly distributed discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3 Jointly continuously distributed random variables . . . . . . . . . . . . . . . . . . . . . 45
7.4 Functions of multiple random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

8 Expectation 50
8.1 Definition and interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2 Expectation of functions of random variables . . . . . . . . . . . . . . . . . . . . . . . 52
8.3 Linearity of expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.4 Variance and covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.5 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.6 Independence: multiplication rule for expectation . . . . . . . . . . . . . . . . . . . . . 61
8.7 Expectation and probability inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9 Limit theorems 65
9.1 The weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.3 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4
Chapter 1

Axioms of probability theory

1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Sample space and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Event calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 The axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Consequences of the axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Sigma-algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

R Goals: Understand elementary set theory and how to use it to formulate probabilistic scenar-
ios and to describe the calculus of events. Be familiar with the axioms of probability and their
consequences, and how these properties may be deduced from the axioms.

In this chapter, we lay the foundations of probability calculus, and establish the main techniques for
practical calculations with probabilities. The mathematical theory of probability is based on axioms,
like Euclidean geometry. In classical geometry, the fundamental objects posited by the axioms are
points and lines; in probability, they are events and their probabilities. When real world applications
demand interpretations of these mathematical objects, there are several different approaches, which
we mention later in this course; for now, we simply make the important point that the mathematical
theory does not depend on any real-world interpretation that we may have in mind.
It is very important to get a good familiarity with the axioms and properties of probability, as we
will rely on them very extensively for the remainder of this course. An excellent understanding of the
foundations laid in this chapter is essential to obtain an understanding of all further chapters.

1.1 Sets
Also see [BH15, Appendix A.1], [DS02, Section 1.4], [Sti03, Appendix to Chapter 0], or [SHE07, Section 1.2
(Sets)].

In probability, we make extensive use of sets, therefore we first spend some time on them. In essence,
a set is an unordered collection of distinguishable objects; these objects can be numbers, functions,
other sets, and so on—any mathematical object can belong to a set.
The formal notation for a set is an opening curly bracket, followed by a list of elements that belong
to the set, followed by a closing curly bracket.

Example 1.1 The set containing the elements 2, 4, and 5 is denoted by:

{2, 4, 5}.

5
Example 1.2 As mentioned already, the ordering of the elements is irrelevant. For example, {2, 4, 5}
and {4, 5, 2} denote the same set.

Definition 1.3 (Empty Set) The set with no outcomes is called the empty set, and is denoted by H:

H :“ {}.

Variables that represent sets are typically denoted by a capital letters such A, B, C, and so on.
Definition 1.4 (Subset) For two sets A and B, we say that A is a subset of B, and we write A Ď B
(or B Ě A), whenever every element that belongs to A also belongs to B.

Example 1.5 For instance, {2, 4, 5} Ď {1, 2, 3, 4, 5}.

Example 1.6 For every set A, we have that H Ď A.

Definition 1.7 (Power Set) The set consisting of all subsets of a set A is called the power set of A,
and is denoted as 2A :
2A :“ {B : B Ď A}.

Example 1.8 The power set of the set A “ {1, 2, 3} is

2A “ {H, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}.

1.2 Sample space and events

Also see [BH15, Section 1.2], [DS02, Section 1.4], or [Sti03, Sections 1.1 and 1.2].

We consider a scenario in which there are various possible outcomes, but we are unsure about which
outcome will occur. A sample space is a set of outcomes for this scenario such that one and only one
will occur. In this course, we will usually denote the sample space by Ω, and a generic outcome by
ω P Ω.
Example 1.9 Consider rolling a standard six-sided die. The most obvious sample space is Ω “
{1, 2, 3, 4, 5, 6}, but if one was interested only in whether the die was odd or even, or a six or not,
one could use Ω “ {odd, even}, or Ω “ {not a 6, 6}.

Often, we may enumerate the elements of the sample space Ω in a finite or infinite list Ω “
{ω1 , ω2 , . . .}, in which case we say the set Ω is countable or discrete.
A set is said to be countable when its elements can be enumerated in a (possibly infinite) sequence.
Every finite set is countable, and so is the set of natural numbers N :“ {1, 2, 3, . . .}. The set of integers
Z is countable as well. The set of real numbers R is not countable, and neither is any interval ra, bs
when a ă b.
More formally:
Definition 1.10 A set A is countable if either:
(i) A is finite, or

(ii) there is a bijection (one-to-one and onto mapping) between A and the set of natural numbers N.
One can prove that the set of rational numbers Q is countable.
Definition 1.11 (Event) An event A associated with the sample space Ω is simply a subset of Ω:

event A Ď Ω,

and we say that an event A occurs when the outcome that occurs at the end of the scenario is in the
set A.

6
The empty set H represents the impossible event, i.e., it will never occur. The sample space Ω represents
the certain event, i.e., it will always occur. Most interesting events are somewhere in between.
The representation of an event as a set obviously depends on the choice of sample space Ω for the
specific scenario under study, as shown by Examples 1.12 and 1.13.
Example 1.12 For rolling a standard cubic die, with Ω “ {1, 2, 3, 4, 5, 6}, the event ‘throw odd number’
is the subset {1, 3, 5} consisting of three outcomes.
Example 1.13 For the same scenario, but with Ω “ {odd, even}, the event ‘throw odd number’ is the
subset {odd} consisting of just one outcome.

1.3 Event calculus

Also see [BH15, Section 1.2], [DS02, Section 1.4], or [Sti03, Section 1.2].

It is important to be able to refer to combinations of events. To do so, we use standard notation

from set theory.
Definition 1.14 (Complement) For an event A Ď Ω, we define its complement, denoted Ac (or
sometimes Ā) and read “not A”, to be Ac :“ ΩzA “ {ω P Ω : ω R A}.
Notice that pAc qc “ A, A X Ac “ H and A Y Ac “ Ω.
For any two events A and B that are associated with the same sample space (i.e. A Ď Ω and B Ď Ω
for the same Ω), we write:

notation we say we say

R meaning
(as sets) (as events) (as events)
AYB A union B A or B A occurs or B occurs or both A and B occur
AXB A intersect B A and B A occurs and B occurs
Ac :“ ΩzA A complement not A A does not occur
AzB A minus B A but not B A occurs but B does not
AĎB A is a subset of B A implies B if A occurs, then B must occur

R Definition 1.15 (Disjoint) We say that events A and B are disjoint, mutually exclusive, or
incompatible if A X B “ H, i.e., it is impossible for A and B both to occur.

Example 1.16 Consider the sample space Ω :“ {1, 2, 3, 4, 5, 6}, and the events
A :“ {2, 4, 6} (even)
B :“ {1, 3, 5} (odd)
C :“ {1, 2, 3} (at most 3)
Then
AYB “Ω (even or odd)
AXB “H (even and odd)
c
A “B (not even)
CzA “ {1, 3} (at most 3 but not even)
A Y C “ {1, 2, 3, 4, 6} (even or at most 3)
A X C “ {2} (even and at most 3)
The events A and B are disjoint as A X B “ H.

7
Exercise 10 (*) See problem sheet.
P Q
As with sums ( ) and products ( ) of multiple numbers, for unions and intersections of multiple
sets we also write:
n
[
Ai :“ A1 Y A2 Y ¨ ¨ ¨ Y An ,
i“1
\n
Ai :“ A1 X A2 Y ¨ ¨ ¨ X An .
i“1

Occasionally, we will also need to take infinite unions and intersections over sequences of sets:
8
[
Ai :“ A1 Y A2 Y A3 Y ¨ ¨ ¨ “ {ω : ω P Ai for at least one i P N};
i“1
\8
Ai :“ A1 X A2 X A3 X ¨ ¨ ¨ “ {ω : ω P Ai for every i P N}.
i“1

We will also sometimes need De Morgan’s Laws: for a (possibly infinite) collection of events Ai ,
(a) ( i Ai )c “ i Aci , and
S T

(b) ( i Ai )c “ i Aci .
T S

It is often useful to visualize the sample space in a Venn diagram. Then events such as A are subsets
of the sample space. It is a helpful analogy to imagine the probability of an event as the area in the
Venn diagram.

A B

This analogy is more apt than it first appears, since the mathematical foundations of rigor-
ous probability theory are built on measure theory, which is the same theory that gives rigorous
foundation to the concepts of length, area, and volume.

1.4 The axioms of probability

Also see [BH15, Section 1.6], [DS02, Section 1.5], or [Sti03, Section 1.3].

R Definition 1.17 (Probability) A probability P on a sample space Ω is a function mapping

every event A Ď Ω to a real number P (A) , obeying the following axioms:

A1 P (A) ě 0 for every event A;

A2 P (Ω) “ 1; and

8
A3 if A and B are disjoint events (i.e. if A X B “ H) then

P (A Y B) “ P (A) ` P (B) .

We call the number P (A) the probability of A.

To complete our set, there is a fourth—more technical—probability axiom, which is in essence a

stronger version of A3 (see C7 below):

R A4 Countable additivity. For any infinite sequence A1 , A2 , . . . of pairwise disjoint events (so
Ai X Aj “ H for all i ‰ j), !
[8 X8
P Ai “ P (Ai ) .
i“1 i“1

For this course, we will usually assume that the probability distribution is given (and satisfies the
axioms), without worrying too much about how the important practical task of finding the probabilities
was carried out.

1.5 Consequences of the axioms

Also see [BH15, Section 1.6], [DS02, Section 1.5 and 1.10], or [Sti03, Sections 1.4 and 1.5].

A host of useful results can be derived from A1–4.

R C1 For any two events A and B,

P (BzA) “ P (B) ´ P (A X B) .

C2 For any event A, P (Ac ) “ 1 ´ P (A) .

C3 P (H) “ 0.
C4 For any event A, P (A) ď 1.
C5 Monotonicity. If A Ď B then P (A) ď P (B) .
C6 For any two events A and B,

P (A Y B) “ P (A) ` P (B) ´ P (A X B) .

C7 Finite additivity. If A1 , A2 , . . . , Ak are pairwise disjoint (so Ai X Aj “ H if i ‰ j) then

k
! k
[ X
P Ai “ P (Ai ) .
i“1 i“1

C8 Boole’s inequality. For any events A1 , A2 , . . ., (these need not be pairwise disjoint),
8
! 8
[ X
P Ai ď P (Ai ) .
i“1 i“1

C9 Continuity along monotone limits. If A1 Ď A2 Ď ¨ ¨ ¨ is an increasing sequence of events, then

8
!
[
P An “ lim P (An ) .
nÑ8
n“1

9
If A1 Ě A2 Ě ¨ ¨ ¨ is a decreasing sequence of events, then
8
!
\
P An “ lim P (An ) .
nÑ8
n“1

The following simple notion is extremely useful.

R Definition 1.18 (Partition) We say that the events E1 , E2 , . . . , Ek (all subsets of the same
sample space Ω) form a partition when:

(i) they all have positive probability, i.e., P (Ei ) ą 0 for all i;
(ii) they are pairwise disjoint, i.e., Ei X Ej “ H whenever i ‰ j; and
(iii) their union is the whole sample space: Yki“1 Ei “ Ω.

Example 1.19 Consider a sample space Ω “ {1, 2, 3, 4, 5, 6}. Possible partitions are:

{1}, {2}, {3}, {4}, {5}, {6}

{1, 2}, {3, 4}, {5, 6}
{1, 2, 3}, {4, 5, 6}
{1}, {2, 3}, {4, 5, 6}
{1, 2, 3, 4, 5, 6}

and so on.

R C10 If E1 , E2 , . . . , Ek form a partition then

k
X
P (Ei ) “ 1.
i“1

Exercise 11 (**) See problem sheet.

Exercise 12 (*) See problem sheet.

Exercise 13 (*) See problem sheet.

Exercise 14 (*) See problem sheet.

Exercise 15 (**) See problem sheet.

Exercise 16 (*) See problem sheet.

Exercise 17 (**) See problem sheet.

Exercise 18 (***) See problem sheet.

Exercise 19 (***) See problem sheet.

Exercise 20 (**) See problem sheet.

10
Exercise 21 (**) See problem sheet.

Exercise 22 (*) See problem sheet.

Exercise 23 (***) See problem sheet.

Exercise 24 (*) See problem sheet.

These consequences have an enormous effect on the way we work with probability. In particular, it
turns out that we can solve most problems without ever having to explicitly write down the outcomes
in our sample space, as in the next example. In fact, some people do probability without even defining
a sample space.
Example 1.20 Jimmy’s die has the numbers 2,2,2,2,5,5. Your die has numbers 1,1,4,4,4,4. You both
throw and the highest number wins. Assuming all outcomes are equally likely, what is the probability
that Jimmy wins?
The event, J, that Jimmy wins happens if either Jimmy throws a 5 (call this event F ), or if you
throw a 1 (call this event A). Therefore J “ A Y F and by C6, P (J) “ P (A) ` P (F ) ´ P (A X F ) .
As P (F ) “ 1{3, P (A) “ 1{3 and P (A X F ) “ 4{36 “ 1{9 (by counting equally likely outcomes) we
have P (J) “ 1{3 ` 1{3 ´ 1{9 “ 5{9.

Exercise 25 (*) See problem sheet.

Exercise 26 (**) See problem sheet.

Exercise 27 (***) See problem sheet.

1.6 Sigma-algebras
Also see [Sti03, Section 1.2].

We have described the axioms that a probability should satisfy, formulated in terms of events, but
we have hidden one important issue. In general, it is too much to demand that P satisfying A1–4
should be defined on all subsets of Ω. The reason why this is a problem goes beyond the scope of this
course (see the Bibliographical notes at the end of this chapter for references), but the essence is that
for uncountable sample spaces, such as Ω “ r0, 1s, there exist subsets of Ω that cannot be assigned a
probability in a way that is consistent with the axioms. The construction of such non-measurable sets
is also the basis of the famous Banach–Tarski paradox.
The upshot of all this is that we can, in general, only demand that P satisfies A1–4 for all events
in some collection F of subsets of Ω (i.e., for some F Ď 2Ω ). What properties should the collection F
of events possess? Consideration of A4 and C2 suggests the following definition.
Definition 1.21 (σ-algebra) A collection F of subsets of Ω is called a σ-algebra over Ω if it satisfies
the following properties.
S1 Ω P F;
S2 A P F implies that Ac P F;
if A1 , A2 , . . . P F then 8
S
S3 n“1 An P F.

Property S2 says that F is closed under complementation, while S3 says that F is closed under
countable unions. Note that S1 and S2 combined imply that H P F.
Example 1.22 The power set 2Ω is a σ-algebra over Ω, and it is easy to see that it is the biggest
possible σ-algebra over Ω. As described above, for uncountable Ω the set 2Ω may be too unwieldy, in
which case we would consider a smaller σ-algebra.
The trivial σ-algebra {H, Ω} is the smallest possible σ-algebra over Ω, but it carries no information
about the outcome of the experiment.

11
Definition 1.23 (Probability Space) If Ω is a set and F is a σ-algebra of subsets of Ω, and if P
satisfies A1–4 for events in F, then the triple pΩ, F, Pq is called a probability space.

Example 1.24 Consider the sample space Ω “ {1, 2, 3, 4, 5, 6} for the experiment of rolling a fair die.
The choice of σ-algebra may depend on exactly what we are interested in:

• F0 “ {H, Ω} (carries no information);

• F1 “ {H, {1, 3, 5}, {2, 4, 6}, Ω} (if we only care whether the score is odd or even);

• F2 “ 2Ω (if we are interested in the exact score).

Note the inclusions F0 Ă F1 Ă F2 . You should check that F1 is indeed a σ-algebra.

Exercise 28 (*) See problem sheet.

Exercise 29 (**) See problem sheet.

Exercise 30 (**) See problem sheet.

Exercise 31 (**) See problem sheet.

Exercise 32 (**) See problem sheet.

Exercise 33 (***) See problem sheet.

Bibliographical notes
Sets are important not only for probability theory, but for all of mathematics. In fact, all of standard
mathematics can be formulated in terms of set theory, under the assumption that sets satisfy the ZFC
axioms; see for instance http://en.wikipedia.org/wiki/Zermelo%E2%80%93Fraenkel_set_theory.
The foundations of probability have a long and interesting history [Hac75]. The classical theory owes
much to Pierre-Simon Laplace (1749–1827): see [Lap25]. However, a rigorous mathematical foundation
for the theory was lacking, and was posed as part of one of David Hilbert’s (1862–1943) famous list of
problems in 1900 (the 5th problem). After important work in measure theory by Henri Lebesgue (1875–
1941) and Émile Borel (1871–1956), it was Andrey Kolmogorov who succeeded in 1933 in providing
the axioms that we use today (see the 1950 edition of his book [Kol50]).

Figure 1.1: (left to right) Laplace, Boole, Venn, and Kolmogorov.

George Boole (1815–1864) and John Venn (1834–1923) both wrote books concerned with probability
theory [Boo54, Ven88]; both were working before the formulation of Kolmogorov’s axioms.
As mentioned in Section 1.6, it is necessary in the general theory of probability to restricting events
to some σ-algebra. The reason for this is that in standard ZFC set theory, when Ω is uncountable (such
as Ω “ r0, 1s the unit interval), it follows from an argument by Vitali (1905) that many natural prob-
ability assessments, such as the continuous uniform distribution, cannot be modelled by a probability

12
defined on all subsets of Ω satisfying A1–4: see for instance Chapter 1 of [Ros07]. In the case where Ω is
countable, one can always define P on the whole of 2Ω . In the case where Ω is uncountable, we usually
do not explicitly mention Ω at all (when we work with continuous random variables, for example).
An alternative approach to probability theory is to do away with axiom A4, in which case some of
these technical issues can be avoided, at the expense of certain pathologies; however, in the standard
approach to modern probability, based on measure theory, A4 is a central part of the theory.

13
Chapter 2

Equally likely outcomes and counting

principles

2.1 Classical probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Counting principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

R Goals: Understand the equally likely outcomes model of classical probability. Know counting
principles, and when and how to apply them on specific problems.

In Chapter 1 we have seen the abstract formulation of probability theory; next we turn to the
question of how the probabilities themselves may be assigned. The most basic scenario occurs when
our experiment has a finite number of possible outcomes which we deem to be equally likely. Such
situations rarely—not to say never—occur in practice, but serve as good models in extremely controlled
environments such as in gambling or games of chance. However, this situation (which will essentially
come down to counting) gives us a good initial setting in which to obtain some very useful insight into
the nature and calculus of probability.

2.1 Classical probability

Also see [BH15, Section 1.3] or [DS02, Section 1.2, classical interpretation].

Suppose that we have a finite sample space Ω, which we may thus list as a finite collection of
m “ |Ω| possible outcomes Ω “ {ω1 , . . . , ωm }. In the equally likely outcomes model (also sometimes
known as classical probability) we suppose that each outcome has the same probability:
1
P (ω) “ for each ω P Ω,
|Ω|
or, in the notation above, P (ωi ) “ 1{m for each i.
The axioms of probability then allow us to determine the probability of any event A Ď Ω: by C7,
X |A|
P (A) “ P (ω) “ for any event A Ď Ω.
|Ω|
ωPA

(This is a particular case of the discrete sample space discussed in Chapter 1.)
Definition 2.1 (Equally likely outcomes) Consider a scenario with m equally likely outcomes {ω1 , . . . , ωm } “
Ω. In the equally likely outcomes model, the probability of an event A Ď Ω is then
mA
P (A) :“ ,
m

14
where mA is the number of outcomes in the event which we are calling A (i.e. mA “ |A|).

Note that the rules for counting show that axioms A1–4 are satisfied; in the case of a finite state
space, we may always take the σ-algebra F to be the whole of 2Ω .

Example 2.2 Draw a card at random from a well-shuffled pack, so that each of the 52 cards is equally
likely to be chosen. Typical events are that the card is a spade (a set of 13 outcomes), the card is a
queen (a set of 4 outcomes), the card is the queen of spades (a set of a single outcome). In the equally
likely outcomes model, the probability of drawing the queen of spades (or any other specified card) is
1/52, the probability of drawing a spade is 13/52, and the probability of drawing a queen is 4/52.

Example 2.3 Flip a coin and see whether it falls heads or tails, each assumed equally likely; then
‘heads’ or ‘tails’ each has probability 1/2.

Example 2.4 Roll a fair cubic die to get a number from 1 to 6. Here the word ‘fair’ is used to mean
each outcome is equally likely.

Example 2.5 If we roll a pair of fair dice then outcomes are pairs pi, jq so there are 36 possible
outcomes. If we assume that the outcomes are equally likely, then the probability of getting a pair of 6’s
is 1/36, for example.

Exercise 34 (*) See problem sheet.

Exercise 35 (*) See problem sheet.

The classical interpretation of probability is related to general probability theory in the same way
that counting is to mathematics. It is a good place to start and there are many important situations
where intuitively it seems reasonable to say that each outcome of a particular collection is equally
likely.
To extend the theory or apply it in practice we have to address situations where there are no
candidates for equally likely outcomes or where there are infinitely many possible outcomes and work
out how to find probabilities to put into calculations that give useful predictions. We will come back
to some of these issues later; but bear in mind that however we come up with our probability model,
the same system of axioms that we saw in Chapter 1 applies.

2.2 Counting principles

Also see [BH15, Section 1.4], [DS02, Sections 1.7 and 1.8], or [Sti03, Chapter 3].

Given a finite sample space and assuming that outcomes are equally likely, to determine probabilities
of certain events comes down to counting. For example, in drawing a poker hand of five cards from
a well-shuffled deck of 52 cards, the probability of having a ‘full house’ (meaning two cards of one
denomination and three of another, e.g., two Kings and three 7s) is given by the number of hands that
are full houses divided by the total number of hands (each hand being equally likely).
These counting problems need careful thought, and we will describe some counting principles for
some of the most common situations. Those of you who have done Discrete Maths will be familiar
with most of this material; we have a slightly different emphasis.

Counting Principle 2.6 (Multiplication Principle) Suppose that we must make k choices in suc-
cession where there are:

• m1 possibilities for the first choice,

• m2 possibilities for the second choice,
..
.

15
• mk possibilities for the kth choice.

and all combinations of the possibilities are allowed. The total number of distinct possible selections is
k
Y
m1 ˆ m2 ˆ m3 ˆ ¨ ¨ ¨ ˆ mk “ mi .
i“1

Example 2.7 In a standard deck of playing cards, each card has a denomination and a suit. There
are 13 possible denominations: 2, 3, . . . , 10, J(ack), Q(ueen), K(ing), and A(ce). There are 4 possible
suits: ♥ (heart), ♦ (diamond), ♣ (club), ♠ (spade). Because all combinations of denomination and
suit are allowed, the multiplication principle applies: there are 13 ˆ 4 “ 52 cards in a standard deck.

Example 2.8 A hotel serves 3 choices of breakfast, 4 choices of lunch and 5 choices of dinner so a
guest selects from 3 ˆ 4 ˆ 5 different combinations of the three meals (assuming we opt to have all
three).

All of the following counting principles are effectively consequences of the multiplication principle.

Example 2.9 A coffee bar has 5 different papers to choose from, 19 types of coffee and 7 different
snacks. This means there are 6 ˆ 20 ˆ 8 “ 960 distinct selections of coffee, snack and paper. Of these
5 involve no coffee or snack (which the staff may object to) plus one has no coffee, snack or paper!

First, we look at ordered choices of distinct objects. In this case, we distinguish between selection
with replacement, where the same object can be selected multiple times, and selection without replacement,
where each object can only be selected at most once.

Counting Principle 2.10 (Ordered Choices of Distinct Objects With Replacement) Suppose
that we have a collection of m distinct objects and we select r ď m of them with replacement. The num-
ber of different ordered lists (ordered r-tuples) is

mr .

Exercise 36 (**) See problem sheet.

Exercise 37 (***) See problem sheet.

Counting Principle 2.11 (Ordered Choices of Distinct Objects Without Replacement) Suppose
that we have a collection of m distinct objects and we select r ď m of them without replacement. The
number of different ordered lists (ordered r-tuples) is

m!
pmqr :“ m ˆ pm ´ 1q ˆ pm ´ 2q ˆ ¨ ¨ ¨ ˆ pm ´ r ` 1q “ .
pm ´ rq!

So, the falling factorial notation pmqr (sometimes also denoted mr ) is simply a convenient way to
m!
write pm´rq! . In the special case where r “ m we set 0! “ 1 and then pmqm “ m! is the number of
permutations of the m objects. If m is large, and r is much smaller than m, then pmqr « mr .

Example 2.12 The number of ways we can deal out four cards in order from a pack of cards is p52q4
and the number of ways we can arrange the four aces in order is 4! so the probability of finding the
four aces on top of a well-shuffled deck is
4! 4ˆ3ˆ2ˆ1
“ .
p52q4 52 ˆ 51 ˆ 50 ˆ 49

This probability is approximately 3.7 ˆ 10´6 or about 1 in 270, 000.

Exercise 38 (*) See problem sheet.

16
Exercise 39 (**) See problem sheet.

Counting Principle 2.13 (Unordered Choices of Distinct Objects Without Replacement)

Suppose that we have a collection of m distinct objects and we select a subset of r ď m of them without
replacement. The number of distinct subsets of size r is

m pmqr m!
:“ “
r r! r! pm ´ rq!

To see this, first count the number of distinct ordered lists of r objects—this is pmqr . Each unordered
subset has been counted prqr “ r! times as this is the number of distinct ways of arranging r dif-
ferent objects. Therefore the pmqr ordered selections can be grouped into collections of size r!, each
representing a particular
subset, and the result follows by dividing.
m
The expression r is the binomial coefficient for choosing r objects from m and is often called
‘m-choose-r’. Note that
m m
“
r m´r
as we can choose to take r objects from m in exactly the same number of ways that we can choose to
leave behind r objects i.e., take m ´ r objects.
Example 2.14 Suppose that in a hand of cards their order is not important. Then there are 52

4 “
p52 ˆ 51 ˆ 50ˆ 49q{p4 ˆ 3 ˆ 2 ˆ 1q “ 270 725 distinct hands of four cards. The number of these with
no aces is 48
4 “ p48 ˆ 47 ˆ 46 ˆ 45q{p4 ˆ 3 ˆ 2 ˆ 1q “ 194 580 and so the probability of finding no aces
in a four card hand is
48 52 48 ˆ 47 ˆ 46 ˆ 45
“ « 0.7187.
4 4 52 ˆ 51 ˆ 50 ˆ 49

Exercise 40 (*) See problem sheet.

Exercise 41 (*) See problem sheet.

Exercise 42 (**) See problem sheet.

Exercise 43 (**) See problem sheet.

Exercise 44 (**) See problem sheet.

Exercise 45 (**) See problem sheet.

Exercise 46 (***) See problem sheet.

The same counting arguments can be used when we need to divide m objects into k ą 2 groups:
arranging m distinguishable objects into k groups with sizes r1 , . . . , rk where r1 ` ¨ ¨ ¨ ` rk “ m
can be done in
m m!
:“
r1 , . . . , rk r1 ! ¨ ¨ ¨ rk !
m

ways. The expression r1 , ... ,rk is called the multinomial coefficient [DS02, Section 1.9].

Counting Principle 2.15 (Ordered Choices of 2 Types of Objects) Suppose that we have m
objects, r of type 1 and m ´ r of type 2, where objects are indistinguishable from others of their type.
The number of distinct, ordered choices of the m objects is

m
.
r

17
To see why, note that each distinct order for laying out the m objects in a row is precisely the same as
choosing r of the m positions for type 1 objects, i.e. it is an unordered choice of r positions from the
m distinct positions.

Example 2.16 With four red tokens and three black (none distinguishable from others of the same
colour) there are 7!{p4! 3!q “ 35 different ways to lay them out in a row. The probability that they will
be alternately red and black is 1{35 as there is only one such ordering.

Exercise 47 (**) See problem sheet.

Exercise 48 (**) See problem sheet.

Exercise 49 (***) See problem sheet.

More generally, using the positions argument again, the multinomial coefficient is the number of
ordered choices of objects with k types, ri of type i, which are indistinguishable within each type.

Counting Principle 2.17 (Ordered Grouping of Indistinguishable Objects) The number of ways
to divide m indistinguishable objects into k distinct groups is

m`k´1 m`k´1
“ .
m k´1

When we have m indistinguishable objects to put into k distinct groups e.g. identical balls into num-
bered bags, we can count the number of choices with the ‘sheep-and-fences’ method. Along a line the
k groups have k ´ 1 boundaries so, representing each of these with a ‘fence’, the number of groupings
of the objects is the same as the number of choices of the objects and fences in a line e.g. ˚ ˚ ˚ | ˚ || ˚˚
for 6 objects in 4 groups, but this number is m`k´1
m “ m`k´1
k´1 by Counting Principle 2.15.

Bibliographical notes
Classical probability theory originated in calculation of odds for games of chance; as well as contribu-
tions by Pierre de Fermat (1601–1665) and Blaise Pascal (1623–1662), comprehensive approaches were
given by Abraham de Moivre (1667–1754) [dM56], Laplace (1749–1827) [Lap25], and Siméon Pois-
son (1781–1840). A collection of these classical methods made just before the advent of the modern
axiomatic theory can be found in [Whi01].

18
Chapter 3

Conditional probability and independence

3.1 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Properties of conditional probability . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Independence of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

R Goals: Know the definition of conditional probability and its properties. Have a solid knowledge
of the partition theorem and Bayes’s theorem, recognizing situations where one can apply them.
Understand the concept of independence.

3.1 Conditional probability

Also see [BH15, Section 2.2], [DS02, Section 2.1], or [Sti03, Section 2.1].

R Definition 3.1 For events A, B Ď Ω, the conditional probability of A given B is

P (A X B)
P (A | B) :“ whenever P (B) ą 0.
P (B)

In this course, when P (B) “ 0, P (A | B) is undefined. The usual interpretation is that P (A | B)

represents our probability for A after we have observed B. Conditional probability is therefore very
important for statistical reasoning.

Example 3.2 Throw three fair coins. What is the conditional probability of at least one head (event
A) given at least one tail (event B)? Let H be the event ‘all heads’, T the event ‘all tails’. Then
P (B) “ 1 ´ P (H) “ 7{8 and P (A X B) “ 1 ´ P (H) ´ P (T ) “ 6{8 so that

P (A X B) 6{8 6
P (A | B) “ “ “ .
P (B) 7{8 7

Exercise 50 (*) See problem sheet.

Exercise 51 (*) See problem sheet.

Exercise 52 (**) See problem sheet.

19
3.2 Properties of conditional probability
Also see [BH15, Sections 2.3 and 2.4], [DS02, Sections 2.1 and 2.3], or [Sti03, Section 2.1].

R P1 For any event B Ď Ω for which P (B) ą 0, P ( ¨ | B) satisfies axioms A1–4 (i.e., is a
probability on Ω) and therefore satisfies also C1–10.

Exercise 53 (*) See problem sheet.

Exercise 54 (*) See problem sheet.
Exercise 55 (*) See problem sheet.

R P2 The multiplication rule for probabilities (simple version): for any events A and B with
P (A) ą 0 and P (B) ą 0,

P (A X B) “ P (B) P (A | B) “ P (A) P (B | A) .

More generally, for any A, B, and C:

P (A X B | C) “ P (B | C) P (A | B X C) , if P (B X C) ą 0. (3.1)

These properties are immediate from the definition; for instance (3.1) follows from the fact that
P (B X C) P (A X B X C) P (A X B X C)
P (B | C) P (A | B X C) “ ¨ “ “ P (A X B | C) .
P (C) P (B X C) P (C)
Exercise 56 (*) See problem sheet.
Exercise 57 (**) See problem sheet.
Exercise 58 (*) See problem sheet.
Exercise 59 (**) See problem sheet.

R P3 The multiplication rule for probabilities (general version): for any events A0 , A1 , . . . , Ak
with P Xk´1

i“0 Ai ą 0,

k
!
\
Ai A0 “ P (A1 | A0 ) ˆ P (A2 | A1 X A0 ) ˆ ¨ ¨ ¨

P

i“1
! ! !
k´3 k´2 k´1
\ \ \
¨ ¨ ¨ ˆ P Ak´2 Ai ˆ P Ak´1 Ai ˆ P Ak Ai .

i“0 i“0 i“0

For example, for k “ 3 and E0 “ Ω,

P (A X B X C) “ P (A) P (B | A) P (C | A X B) .
Exercise 60 (*) See problem sheet.
Exercise 61 (**) See problem sheet.
Exercise 62 (**) See problem sheet.
Recall our definition of a partition (Definition 1.18). The next result is extremely important.

20
R P4 The partition theorem or law of total probability. If E1 , E2 , . . . , Ek form a partition then,
for any event A, we have
Xk
P (A) “ P (A | Ei ) P (Ei ) . (3.2)
i“1

More generally, if P (B) ą 0,

k
X
P (A | B) “ P (A | Ei X B) P (Ei | B) .
i“1

To prove (3.2) for example, we first use P2 to get

k
X k
X
P (A | Ei ) P (Ei ) “ P (A X Ei ) .
i“1 i“1

But since the Ei are a partition, they are pairwise disjoint, and hence so are the A X Ei , so by C7
k
X
P (A X Ei ) “ P Yki“1 pA X Ei q “ P A X pYki“1 Ei q ,
i“1

but since the Ei form a partition, Yki“1 Ei “ Ω, giving the result.

Example 3.3 Three machines, A, B and C, produce components. 10% of components from A are
faulty, 20% of components from B are faulty and 30% of components from C are faulty. Equal numbers
from each machine are collected in a packet. One component is selected at random from the packet.
What is the probability that it is faulty?
Let F be the event that the component is faulty. Let MA , MB , MC be the events that the component
is from machines A, B, C respectively. MA , MB , MC form a partition so

P (F ) “ P (MA ) P (F | MA ) ` P (MB ) P (F | MB ) ` P (MC ) P (F | MC )

1 1 1
“ 0.1 ˆ ` 0.2 ˆ ` 0.3 ˆ “ 0.2.
3 3 3
Exercise 63 (*) See problem sheet.

R P5 Bayes’s theorem (first version): for any events A and B with P (A) ą 0 and P (B) ą 0,

P (A) P (B | A)
P (A | B) “ .
P (B)

More generally, if P (A | C) ą 0 and P (B | C) ą 0,

P (A | C) P (B | A X C)
P (A | B X C) “ .
P (B | C)

Exercise 64 (*) See problem sheet.

Exercise 65 (*) See problem sheet.

P6 Bayes’s theorem (second version): for any partition A1 , . . . , Ak and any B with P (B) ą 0,

P (Ai ) P (B | Ai )
P (Ai | B) “ Pk .
j“1 P (Aj ) P (B | Aj )

21
More generally, if P (B | C) ą 0,

P (Ai | C) P (B | Ai X C)
P (Ai | B X C) “ Pk .
j“1 P (Aj | C) P (B | Aj X C)

P6 is an immediate consequence of P4 and P5.

Example 3.4 In the previous example, a component is drawn from the packet and found to be faulty.
What is the probability that it was made by machine A?

Exercise 66 (***) See problem sheet.

Exercise 67 (***) See problem sheet.

Exercise 68 (****) See problem sheet.

Exercise 69 (****) See problem sheet.

3.3 Independence of events

Also see [BH15, Section 2.5], [DS02, Section 2.2], or [Sti03, Section 2.2].

R Definition 3.5 (Independence of Two Events) We say that two events A and B are independent
whenever
P (A X B) “ P (A) P (B) .
We say that two events A and B are conditionally independent given a third event C whenever

P (A X B | C) “ P (A | C) P (B | C) .

Example 3.6 Roll two standard dice. Let E be the event that we have an even outcome on the first
die. Let F be the event that we have a 4 or 5 on the second die. Are E and F independent? Let us
verify, using a counting argument.
There are 36 equally likely outcomes, namely:

Ω “ {pi, jq : i P {1, . . . , 6} and j P {1, . . . , 6}}

Of those, 3 ˆ 6 are in E, and 6 ˆ 2 are in F , so P (E) “ 18{36 “ 1{2 and P (F ) “ 12{36 “ 1{3.
Moreover, 3 ˆ 2 of these outcomes belong to both E and F , so P (E X F ) “ 6{36 “ 1{6. Indeed,

P (E X F ) “ 1{6 “ 1{2 ˆ 1{3 “ P (E) P (F ) .

so E and F are independent.

Exercise 70 (*) See problem sheet.

Exercise 71 (*) See problem sheet.

The next theorem explains why independence is called independence:

22
Theorem 3.7 Consider any two events A and B with P (A) ą 0 and P (B) ą 0. The following
statements are equivalent.

(i) P (A X B) “ P (A) P (B) .

(ii) P (A | B) “ P (A) .

(iii) P (B | A) “ P (B) .

In other words, learning about B will not tell us anything new about A, and similarly, learning about
A will not tell us anything new about B.
For conditional independence, we have a similar result.

Theorem 3.8 Consider any three events A, B, and C, with P (A X B X C) ą 0. The following state-
ments are equivalent.

(i) P (A X B | C) “ P (A | C) P (B | C) .

(ii) P (A | B X C) “ P (A | C) .

(iii) P (B | A X C) “ P (B | C) .

In other words, if we know C then learning about B will not tell us anything new about A, and
similarly, if we know C then learning about A will not tell us anything new about B.

R It is possible for two events to be conditionally independent on particular events, but not to
be (unconditionally) independent. We will see an example of this when we discuss genetics, in
Section 5.2.

It can be extremely useful to recognize situations where (conditional) independence can be applied.
Of course, it is equally important not to assume (conditional) independence where there really are
dependencies.

Exercise 72 (**) See problem sheet.

Exercise 73 (**) See problem sheet.

Definition 3.9 (Independence of Multiple Events) A (possibly infinite) collection of events Ai ,

i P I are mutually independent if for every finite non-empty subcollection A Ď I,
!
\ Y
P A “ P (A) .
APA APA

A collection of events Ai , i P I are mutually conditionally independent given another event C if for
every finite non-empty subcollection A Ď I,
!
\ Y
P AC “ P (A | C) .

APA APA

Example 3.10 Three events A, B, and C are mutually independent if all of the following equalities
are satisfied:

P (A X B X C) “ P (A) P (B) P (C)

P (A X B) “ P (A) P (B)
P (B X C) “ P (B) P (C)
P (C X A) “ P (C) P (A)

23
Example 3.11 Suppose we roll 4 dice and their values are independent. To find the probability of no
sixes let Ai be the event ‘throw i not 6’. By assumption A1 , . . . , A4 are independent so
4 4
!
\ Y 5 4
P (no sixes on 4 dice) “ P Ai “ P (Ai ) “
6
i“1 i“1

The same result is obtained from the classical model, by selection with replacement.

Exercise 74 (**) See problem sheet.

Exercise 75 (**) See problem sheet.

Exercise 76 (**) See problem sheet.

Exercise 77 (*) See problem sheet.

Exercise 78 (***) See problem sheet.

It is possible for events to be independent pairwise without being mutually independent.

Example 3.12 Toss two fair coins. The sample space is Ω “ {HH, HT, T H, T T } and each out-
come has probability 1{4. Let A “ {HH, HT } be the event that the first coin comes up ‘heads’,
B “ {HH, T H} the event that the second coin comes up ‘heads’, and C “ {HH, T T } the event
that the coins come up the same. Then since P (A) “ P (B) “ P (C) “ 1{2 and each pairwise in-
tersection has probability 1/4, it is easy to see that the events are pairwise independent. However,
P (A X B X C) “ P (HH) “ 1{4 which is not the same as P (A) P (B) P (C) “ 1{8, so the three events
are not mutually independent.

Never confuse disjoint events with independent events! For independent events, we have that
P (A X B) “ P (A) P (B) , but for disjoint events, P (A X B) “ 0 because A X B “ H.

Bibliographical notes
Bayes’s theorem is named after the Reverend Thomas Bayes (1701–1761). In our modern approach to
probability, the theorem is a very simple consequence of our definitions; however, the result may be
interpreted more widely, and is one of the most important results regarding statistical reasoning.

Figure 3.1: Bayes.

24
Chapter 4

Interpretations of probability

4.1 Equally likely outcomes interpretation . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Relative frequency interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Betting interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Interpretation and the axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

R Goals: Understand that there are different ways to interpret probability values.

Also see [DS02, Section 1.2] or [Sti03, Chapter 0].

4.1 Equally likely outcomes interpretation

As already mentioned in Chapter 1, the mathematical framework of probability is pretty much inde-
pendent of the interpretation of probability. We have already discussed in Chapter 2 the classical or
equally likely outcomes approach, in which each outcome in the sample space is assigned the same
probability. This interpretation has some obvious limitations in practice. Often we cannot find a set
of outcomes that it is reasonable to think of as a priori equally likely. Therefore, it is essential to have
more widely applicable models to deal with uncertainty.
Below we discuss two (also now classical!) approaches to determining probabilities in real-world
applications, the so called relative frequency interpretation and the subjective probability or betting
interpretation. Again, the probability calculus already described applies equally well to these interpre-
tations.

4.2 Relative frequency interpretation

This interpretation applies to trials giving chance outcomes of an experiment that can be repeated
indefinitely under essentially unchanged conditions and which exhibits long term regularity. Suppose
that we run n trials of an experiment with a known list of possible outcomes and the number of trials on
which event A occurs is nA (A is again a set of possible outcomes). The relative frequency of occurrence
of A is nA {n.

Example 4.1 Suppose we toss a coin 1000 times and observe 490 heads, then the relative frequency
of heads is 490/1000.

25
For some experiments, it may be reasonable to suppose that relative frequencies are stable for very
large n.
Example 4.2 If we toss a fair coin one billion times we might expect that the relative frequency of
heads after the first few thousand throws would remain very close to 1/2.
As a mathematical idealization, we suppose that there is a unique, empirical limiting value for nA {n,
as n tends to infinity, which we call the relative frequency probability of A.
Example 4.3 For our coin, the statement P (heads) “ 1{2 means ‘if we tossed the coin an extremely
large number of times, then the proportion of heads would be arbitrarily close to 1/2’.
This interpretation is widely used, especially in physics, where experiments are designed for repeata-
bility and we can expect future trials to behave like those in the past. In this view, probability is a
property of the experimental setup and may be “objectively” discovered by sufficient repetitions of the
experiment.
Amongst the problems with this interpretation are: it is often impossible to decide what “essentially
unchanged conditions” are; we often have no way of knowing when limiting frequencies become stable
(how many trials should we do to test this?); we can only use it in situations that are repeatable.

4.3 Betting interpretation

A very different way of interpreting probability goes by considering probability as a quantification of
someone’s (yours, mine, your neighbour, . . . ) belief that an event will occur. There are various different
ways in which we can measure this belief numerically. Here is one of the simplest.
Your subjective probability that A will occur is measured by the amount £pA that you would
consider to be a fair price for the following gamble:
• if A occurs, you receive £1;
• if A does not occur, you receive nothing.
In this interpretation, there are no “true” probabilities. Different individuals will have different infor-
mation relevant to a problem and so may validly make different probability assessments.
Example 4.4 If you say your probability that ‘Your Team’ wins its next match is 1/2 this means that
you view £1/2 as a fair price for the gamble winning you £1 if Your Team wins but otherwise nothing.
Others may disagree with you.
Subjective probability ideas are often used by decision makers who have to consider problems concern-
ing unique, non-repeatable events, based on their informed, but subjective judgements.

4.4 Interpretation and the axioms

We claimed that the axioms of probability are the same regardless of the interpretation of the proba-
bilities that we are using. A1 and A2 are clearly very sensible in any interpretation. The justification
of A3 (and, by extension, A4) needs some more thought.
A3 is obvious for the classical model of probability by its relation to counting: in the classical
model if A contains mA outcomes and B contains mB outcomes, with none in common with A, then
mAYB “ mA ` mB . The argument is very similar for the relative frequency model and only slightly
more subtle for the betting model.
Exercise 79 (***) See problem sheet.
Exercise 80 (***) See problem sheet.

Bibliographical notes
For further reading on the different interpretations of probability, see [Háj12].

26
Chapter 5

Some applications of probability

5.1 Reliability of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Hardy–Weinberg equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

R Goals: Understand the meaning of a reliability network. Know how to evaluate the reliability
of networks. Understand the probabilistic nature of genetics. Know how to derive the probability
of genotypes of children given the genotypes of parents. Know how and when to exploit the general
rules of (conditional) probability to solve complex reliability networks and problems in genetics.

5.1 Reliability of networks

Also see [BA92, Sections 4.1–4.4].

Consider a system consisting of k components, such as an electronic system, or a network of roads,

railways, telephone lines, and so on. Whether the system functions depends on whether the components
function. Formally:

Definition 5.1 (Reliability network) A reliability network is a diagram of nodes and arcs. The
nodes represent components of a multi-component system, where each node is either working or is
broken, and where the entire system works if it is possible to get from the left end to the right of the
diagram through working components only.

Suppose the ith component functions with probability pi , and different components are independent.
The probability that the system works is then clearly a function of the probabilities p1 , . . . , pk . We
denote this function by rpp1 , p2 , . . . , pk q, and call it the reliability function. It is determined by the
layout of the reliability network.

Example 5.2 The figure below shows (a) two components in series, (b) three in parallel, (c) a five
component bridge system. In each case assume the system works if it is possible to get from the left end
to the right through functioning components.

3 1 3

1 2 2 5

(a) (b) (c)

1 2 4

27
By independence of failures, in case (a) we have rpp1 , p2 q “ p1 p2 , and in case (b), setting qi “ 1 ´ pi
we have rpp1 , p2 , p3 q “ 1 ´ q1 q2 q3 . Case (c) is the subject of Exercise 82.

Exercise 81 (**) See problem sheet.

Exercise 82 (***) See problem sheet.

5.2 Genetics
Also see https://en.wikipedia.org/wiki/Mendelian_inheritance.

Inherited characteristics are determined by genes. The mechanism governing inheritance is random
and so the laws of probability are crucial to understanding genetics.
Your cells contain 23 pairs of chromosomes, each containing many genes (while 23 pairs is specific
to humans the idea is similar for all animals and plants). The genes take different forms called alleles
and this is one reason why people differ (there are also environmental factors). Of the 23 pairs of
chromosomes, 22 pairs are homologous (each of the pair has an allele for any gene located on this
pair). People with different alleles are grouped by visible characteristics into phenotypes; often one
allele, A say, is dominant and another, a, is recessive in which case AA and Aa are of the same
phenotype while aa is distinct. Sometimes, the recessive gene is rare and the corresponding phenotype
is harmful, for example haemophilia or sickle-cell anaemia.
Example 5.3 In certain strains of mice the gene for coat colour (a phenotype) has alleles B (black)
or b (brown). B is dominant, so BB or Bb mice are black, while bb mice are brown with no difference
between Bb and bB.
With sickle-cell anaemia, allele A produces normal red blood cells but a produces deformed cells.
Genotype aa is fatal but Aa provides protection against malarial infection (which is often fatal) and so
allele a is common in some areas of high malaria risk.
The final chromosome pair are the sex chromosomes. Each may be X, a long chromosome, or Y, a
short chromosome. Most of the genes on X do not occur on Y. Your sex is determined as XX (female)
or XY (male) but YY is not possible.1
In our simple model, each individual receives an allele for each gene on its homologous chromosomes
from each parent, chosen independently and at random from each parent’s two alleles for that gene. It
is extremely important to note that genotypes of siblings are dependent unless conditioned on parental
genotypes.
Example 5.4 For example, in a certain type of flowering pea, flower colour is determined by a gene
with alleles R and W , with phenotypes RR (red), RW (pink) and W W (white). The table of offspring
genotype probabilities given parental genotypes is
Parental genotype RR RR RR RW RR WW RW RW RW WW WW WW
offspring RR 1 1/2 0 1/4 0 0
genotype RW 0 1/2 1 1/2 1/2 0
WW 0 0 0 1/4 1/2 1
e.g. parents RR and RW produce RR offspring with chance 1/2 (the RR parent must supply an R but
the RW parent supplies either R or W, each with probability 1/2). When we cross red and white peas,
all the offspring will be pink but when we cross red and pink peas, about half of the peas will be red, half
pink (Mendel carried out experiments like these to establish the genetic basis of inheritance.)
Exercise 83 (**) See problem sheet.

Exercise 84 (**) See problem sheet.

1
X, XXX, XXY, and XYY can occur, but are very rare.

28
Similar but larger tables are relevant when there are more than two alleles.

Genes of course can affect reproductive fitness.

Example 5.5 A gene has alleles A, a but a is recessive and harmful so genotype aa does not reproduce
while AA, Aa are indistinguishable. With proportions 1 ´ λ, λ of AA, Aa in the healthy population,
the probability of an aa offspring is λ2 {4. To show this use the partition FAA , FAa (father AA, Aa
respectively) to calculate

P (F a) “ P (FAA ) P (F a | FAA ) ` P (FAa ) P (F a | FAa ) “ 0 ` λ ˆ p1{2q “ λ{2 (5.1)

for the event F a that the father provides allele a. By symmetry, the mother also supplies allele a with
probability λ{2 and by independence (random mating) the probability that they both supply allele a is
λ2 {4 e.g. when λ « 1{2, about 6% of offspring will be aa. Over time the proportion of allele a will
decrease unless Aa has a reproductive advantage over AA.

Things are slightly different for genes on the X or Y chromosomes (sex-linked genes).

5.3 Hardy–Weinberg equilibrium

Also see https://en.wikipedia.org/wiki/Population_genetics.

Consider a gene (on a homologous chromosome) with two alleles A and a. Suppose the proportions
of genotypes AA, Aa and aa in the population at generation 0 are u0 , 2v0 and w0 respectively (u0 `
2v0 ` w0 “ 1) both for males and females. Are these proportions stable over successive generations in a
large population where (i) mating in each generation is random with respect to this gene; (ii) different
genotypes have equal reproductive success?
The proportion of A in the gene pool at generation 0 is p “ u0 ` v0 . In generation 1, by random
mating, each child has probability p2 , 2pp1 ´ pq, p1 ´ pq2 to be AA, Aa, aa, respectively. As long as
the population is large enough (see the law of large numbers later in the course) these will also be the
generation 1 proportions of AA, Aa, aa i.e.
u1 “ p2 , v1 “ pp1 ´ pq, w1 “ p1 ´ pq2
Now let p1 “ u1 ` v1 be the proportion of A in the gene pool at generation 1. Substituting the values
of u1 , v1 we find that
p1 “ p2 ` pp1 ´ pq “ p
i.e. the proportions of A and a in the gene pool are constant. This repeats in later generations so
the proportions of the three genotypes remain constant from generation 1 onward (though perhaps
different to generation 0). This is called the Hardy–Weinberg equilibrium.

Exercise 85 (***) See problem sheet.

29
Chapter 6

Random variables

6.1 Definition and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 The binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 The Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6 The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.7 The exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.8 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.9 Cumulative distribution functions . . . . . . . . . . . . . . . . . . . . . . . . 37
6.10 Standard normal tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.11 Functions of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

R Goals: To understand the definition of a random variable as a function on the sample space.
Master the notation for events and probabilities of events relating to random variables.
Know how to recognize a discrete random variable, how to identify its probability mass function,
and how to derive probabilities of associated events. Know how to recognize a binomial scenario,
the assumptions behind it, and how to identify the parameters of the corresponding binomial
distribution. The same for the Poisson distribution. Know when and how the binomial distribution
can be approximated by a Poisson distribution.
Know how to recognize a continuous real-valued random variable, how to identify its probability
density function, and how to derive probabilities of associated events. Know the uniform distribu-
tion. Know the exponential distribution and how it arises from the Poisson distribution. Know the
normal distribution, and how we can derive probabilities of events using its cumulative distribution
function and standard normal tables. Work with functions of random variables.

In many experiments we are often interested in a numerical value rather than the elementary event
ω P Ω per se: for example, in the financial industry we may not care about the behaviour of the stock
price throughout a given period, only whether it reached a certain level or not; in weather forecasting
we may not be interested in the detailed variation of atmospheric pressure and temperature, only in
how much rain is going to fall, and so on. These uncertain quantities associated with random scenarios
have as their mathematical idealization the concept of random variable.
Put simply, a random variable is a function or mapping of the sample space: for each ω P Ω,
the random variable X gives the output Xpωq. In this chapter, we study both discrete and continu-
ous univariate random variables, and discuss some important examples: binomial, Poisson, uniform,
exponential, and normal distributions. To enable practical calculations, we also discuss cumulative
distribution functions, standard tables, and how probabilities behave under transformations.

30
6.1 Definition and notation
Also see [BH15, Section 3.1], [DS02, Section 3.1], or [Sti03, Section 4.1].

R Definition 6.1 (Random Variable on a Sample Space) A random variable on Ω is a

mapping from the sample space Ω to some set of possible values XpΩq :“ {Xpωq : ω P Ω}:

X : Ω Ñ XpΩq given by ω ÞÑ Xpωq.

Typically, we find ourselves in one of the following situations:

• XpΩq Ď R, in which case we say that X is a real-valued random variable or a univariate random
variable; or

• XpΩq Ď Rd , in which case we say that X is a vector-valued random variable, or a multivariate

random variable; in this case we can identify X with a vector pX1 , . . . , Xd q of real-valued random
variables, where Xi pωq :“ rXpωqsi , that is, Xi pωq is the ith component of Xpωq.

Example 6.2 Consider throwing two standard dice, with sample space

Ω “ {pi, jq : i P {1, 2, 3, 4, 5, 6}, j P {1, 2, 3, 4, 5, 6}},

so elements of Ω correspond to pairs pi, jq. The sum of the numbers that show on the dice corresponds
to a real-valued random variable X defined by:

Xpi, jq :“ i ` j, for all pi, jq P Ω.

For any B Ď XpΩq, we write ‘X P B’ to denote the event {ω P Ω : Xpωq P B}. For any x P XpΩq,
we write ‘X “ x’ to mean the event X P {x}, that is, {ω P Ω : Xpωq “ x}. We sometimes also write
{X “ x} and {X P B} to emphasize that these are sets.

Example 6.3 Continuing the example where X denotes the sum of the outcomes of two dice, the
notation X “ 10 denotes the event {p4, 6q, p5, 5q, p6, 4q}, and X P r0, 4s denotes the event

{p1, 1q, p1, 2q, p2, 1q, p1, 3q, p2, 2q, p3, 1q}.

Example 6.4 A simple but important class of random variable is formed by the indicator random
variables, denoted for an event A Ď Ω by 1A and given by the mapping
(
1 if ω P A,
1A pωq “
0 otherwise.

Exercise 86 (*) See problem sheet.

Recall Definition 1.17. Given a probability distribution P on the sample space Ω, a random variable
X : Ω Ñ XpΩq induces a probability distribution PX on the sample space XpΩq as follows.

Rby
Theorem 6.5 The function PX , mapping sets A Ď XpΩq to a real number PX (A) , defined

PX (A) :“ P (X P A) “ P ({ω P Ω : Xpωq P A}) ,

is a probability on XpΩq, that is, PX satisfies the probability axioms A1–A4.

Note that PX therefore also has properties C1–C10.

31
As mentioned in Chapter 1, in general sample spaces Ω one cannot expect to be able to assign
probabilities to to every A P 2Ω , and one must restrict to some F Ă 2Ω of ‘nice events’. This has
consequences for the definition of random variable we gave above, as can be gleaned by examination
of Theorem 6.5: we must have that {ω P Ω : Xpωq P A} is a genuine event in F, at least for a large
enough class of A Ď XpΩq to tell us everything we would like to know about the random variable
X. Those of you who do Probability II (and later courses) will see that the relevant concept for
a more complete definition of random variable is that of a measurable function.

Exercise 87 (**) See problem sheet.

Exercise 88 (*) See problem sheet.

6.2 Discrete random variables

Also see [BH15, Section 3.2], [DS02, Section 3.1], or [Sti03, Section 4.2].

R Definition 6.6 (Discrete Random Variable & Probability Mass Function) A random
variable X : Ω Ñ XpΩq is said to be discrete when there is a function p : XpΩq Ñ R such that
X
P (X P B) “ p (x) ,
xPB

for all B Ď XpΩq. In this case, p is called the probability mass function of X.

R The probability mass function of a discrete random variable summarizes all information we
have about X. Specifically, it allows us to calculate the probability of every event of the form
{X P B}.

Clearly, if X is discrete, it must hold that p (x) “ P (X “ x) . If the random variable under
consideration is not clear from the context, we may write pX for the probability mass function of X.
Recall that a set is countable if there exists a bijection between that set and a subset of the natural
numbers N.
Theorem 6.7 A random variable X : Ω Ñ XpΩq is discrete whenever XpΩq is countable.
Exercise 89 (***) See problem sheet.

Example 6.8 Toss three fair coins and let X be the total number of heads obtained. This example is
discrete since the possible values are 0, 1, 2, 3. We have P (X “ 1) “ P (X P {1}) “ P ({HT T, T HT, T T H}) “
3{8 since each of the eight sequences of 3 tosses has probability 1/8. Similarly,

P (X ě 2) “ P ({HHT, HT H, T HH, HHH}) “ 4{8 “ 1{2.

Example 6.9 For the three coins example the possible values of X are 0, 1, 2, 3. Grouping the 8
outcomes, as above, gives P (X “ x) “ x3 81 for x P XpΩq “ {0, 1, 2, 3} and so the probability mass
function of X is
1 3 3 1
p (0) “ , p (1) “ , p (2) “ , p (3) “ .
8 8 8 8
Exercise 90 (*) See problem sheet.

32
Exercise 91 (**) See problem sheet.

Exercise 92 (***) See problem sheet.

Exercise 93 (*) See problem sheet.

By A2, we recover a version of C10 for probability mass functions:

Corollary 6.10 Let X be a discrete random variable. Then the values of its probability mass function
p are non-negative and sum to one: X
p (x) “ 1.
xPXpΩq

Exercise 94 (*) See problem sheet.

6.3 The binomial distribution

Also see [BH15, Section 3.3], [DS02, Sections 3.1 and 5.2], or [Sti03, Section 4.2].

Consider the following random experiment, called a binomial scenario:

• a sequence of n trials will be carried out, where n is known in advance of the experiment;

• each trial has only two outcomes, usually denoted ‘success’ or ‘failure’; and finally,

• each trial succeeds independently with the same probability p.

Typically, we might be interested in the number of successes X in the n trials.

The usual sample space Ω for the binomial scenario is the set of all the possible length n sequences
of successes and failures; if we represent a success by 1 and a failure by 0, then each ω P Ω is a string
ω “ ω1 ω2 ¨ ¨ ¨ ωn with each ωi P {0, 1}. The random variable X takes values in XpΩq :“ {0, 1, . . . , n},
which isP finite, so X is a discrete random variable. As a mapping on the sample space, we have
Xpωq “ ni“1 ωi , the total numbers of 1s in the string.
For each x P {0, 1, . . . , n}:

• because trials are independent (see Definition 3.9), every sequence ω P Ω with exactly x successes
and n ´ x failures has probability P ({ω}) “ px p1 ´ pqn´x ; and

• there are nx sequences with exactly x successes.

By C7, we can sum the probabilities of the outcomes in the event

X
{X “ x} “ {ω P Ω : ωi “ x}
i

to obtain X
p (x) “ P (X “ x) “ P (ω) .
P
ωPΩ: i ωi “x

Putting everything together, the probability mass function of X is:

R Definition 6.11 (Binomial Distribution) We say a random variable X is binomially distributed

with parameters n P N and p P r0, 1s, and we write X „ Binpn, pq, when XpΩq “ {0, 1, . . . , n} (so
X is discrete) and
n x
p (x) “ p p1 ´ pqn´x for all x P XpΩq.
x

33
In case of just a single trial (n “ 1), the binomial scenario is often referred to as a Bernoulli trial,
and X „ Binp1, pq is often referred to as a Bernoulli random variable.

Example 6.12 If we roll 4 fair cubic dice and let X be the number of 6s then P (X “ x) “ x4 p 16 qx p 56 q4´x

so the probability mass function of X is

625 500 150 20 1
p (0) “ , p (1) “ , p (2) “ , p (3) “ , p (4) “ .
1296 1296 1296 1296 1296
Exercise 95 (*) See problem sheet.

Exercise 96 (*) See problem sheet.

Exercise 97 (*) See problem sheet.

Exercise 98 (**) See problem sheet.

Exercise 99 (*) See problem sheet.

Exercise 100 (**) See problem sheet.

Exercise 101 (***) See problem sheet.

6.4 The Poisson distribution

Also see [DS02, Section 5.4] or [Sti03, Section 4.2].

R Definition 6.13 (Poisson Distribution) We say a random variable X is Poisson distributed

with parameter λ, and we write X „ Popλq, when XpΩq “ Z` :“ {0, 1, 2, . . .} and

e´λ λx
p (x) “ for x P Z` .
x!

The Poisson distribution is used to model counts of events which occur randomly in time at rate
r per unit time (specifically P (event in px, x ` hq) « rh for small enough h). If X is the count over a
period of length t then it has distribution Poprtq. Typical applications are in models of call numbers
to telephone exchanges, numbers of accidents at busy traffic intersections, earthquakes at a tectonic
boundary, etc.

Exercise 102 (*) See problem sheet.

Exercise 103 (*) See problem sheet.

Exercise 104 (**) See problem sheet.

Exercise 105 (*) See problem sheet.

Exercise 106 (**) See problem sheet.

The Poisson distribution can be used to approximate the binomial distribution for small p and
large n.

Theorem 6.14 Consider any λ ą 0. Let Xn „ Binpn, λ{nq, and Y „ Popλq. Then for all x P Z` ,

lim pXn (x) “ pY (x)

nÑ8

34
Proof For fixed x we have that

n n!
pn´1 λqx p1 ´ n´1 λqn´x “ n´x λx p1 ´ n´1 λqn´x ,
x pn ´ xq!x!
where
n!
lim n´x “ 1 and lim p1 ´ n´1 λqn´x “ e´λ .
nÑ8 pn ´ xq! nÑ8

Collecting up terms gives

e´λ λx

n
lim pn´1 λqx p1 ´ n´1 λqn´x “ ,
nÑ8 x x!
as claimed.

This means that if X „ Binpn, pq, where n is large and p is small, then approximately X „ Popnpq.
As a rule of thumb, with p ď 0.05, we find n “ 20 gives a reasonable approximation, while n “ 100
gives a good approximation. The approximation is useful when it allows us to sidestep calculating nx
for large values of n and x.
Example 6.15 A typist produces a page of 1000 symbols but has probability 0.001 of mistyping any
single symbol and such errors are independent (note: neither assumption is particularly realistic). The
probability that a page contains more than two mistakes is P (X ą 2) where X is the number of mistakes
on the page.
Clearly, X „ Binp1000, 0.001q. As n is large and p is small, we approximately have that X „
Pop1000 ˆ 0.001q “ Pop1q. Therefore,
0
11 12

´1 1
P (X ą 2) “ 1 ´ P (X ď 2) « 1 ´ e ` ` “ 0.0803.
0! 1! 2!
Exercise 107 (***) See problem sheet.

Exercise 108 (****) See problem sheet.

6.5 Continuous random variables

Also see [BH15, Section 5.1], [DS02, Section 3.2], or [Sti03, Section 7.1].

R Definition 6.16 (Continuous Random Variable & Probability Density Function) Consider
a real-valued random variable X : Ω Ñ R. We say that X is a continuous random variable, or that
X has a continuous probability distribution, or that X is continuously distributed, when there is a
non-negative piecewise continuous function f : R Ñ R such that
Z b
P (X P ra, bs) “ f (t) dt, (6.1)
a

for all ra, bs Ď R. In this case, f is called the probability density function of X.

If X is continuously distributed, then taking a “ b “ x in (6.1) we see that P (X “ x) “ 0 for all

x P R, and so for any a ă b we have

P (X P ra, bs) “ P (X P ra, bq) “ P (X P pa, bs) “ P (X P pa, bq) .

If the random variable under consideration is not clear from the context, we may write fX for the
probability density function of X.
The probability density function of a random variable determines its probability distribution for a
very wide range of events—but not all, unlike the discrete case (Definition 6.6):

35
Theorem 6.17 If X is continuously distributed with probability densitity function f, then for any
A Ď R that is a finite union of intervals:
Z
P (X P A) “ f (x) dx
A

R The probability density function of a continuous random variable summarizes practically all
information we have about X. Specifically, it allows us to calculate the probability of every event
of the form {X P A} where A is a finite union of intervals.

As before, again by A2, and Theorem 6.17, we recover a version of C10 for probability density
functions:
Corollary 6.18 Let X be a continuous random variable. Then its probability density function f inte-
grates to one: Z 8
f (x) dx “ 1.
´8

Example 6.19 Suppose that X can take values between R0 and 2 and f (x) R 2 “ kx for some constant
8
k, and f (x) “ 0 for x R r0, 2s. To find k, note that 1 “ ´8 f (x) dx “ 0 kx dx “ 2k and therefore
R1
k “ 1{2. With A “ r0, 1s, we have P (X P A) “ 0 x{2 dx “ 1{4.

6.6 The uniform distribution

Also see [BH15, Section 5.2], [DS02, Section 3.2], or [Sti03, Section 7.1].

R Definition 6.20 (Uniform Distribution) Let a and b be real numbers with a ă b. We say
a continuous random variable X is uniformly distributed on ra, bs, and we write X „ Upa, bq, when
(
1{pb ´ aq for all x P ra, bs,
f (x) “
0 elsewhere.

When X „ Upa, bq then X can take any value in the continuous range of values from a to b and the
probability of finding X in any interval rx, x ` hs Ď ra, bs does not depend on x.
Exercise 109 (*) See problem sheet.

6.7 The exponential distribution

Also see [BH15, Section 5.5], [DS02, Section 5.9, p. 298–301], where the exponential distribution is discussed as
a special case of the Gamma distribution, or [Sti03, Section 7.1].

Suppose a bell chimes randomly in time at rate β. Let T ą 0 denote the time of the first chime
(a random variable). The events {no chimes in r0, τ s} and {T ą τ } are the same. We know that the
number of chimes in the interval r0, τ s is Popβτ q and so the probability of no chimes is e´βτ . Hence

P (T ą τ ) “ e´βτ or P (T ď τ ) “ 1 ´ e´βτ for all τ ě 0

Rτ
As 1 ´ e´βτ “ 0 βe´βt dt we see that T is a continuous random variable with probability density
function f (t) “ βe´βt for all t ě 0:

36
R Definition 6.21 (Exponential Distribution) Let β be a strictly positive constant. We say
a continuous random variable X is exponentially distributed with parameter β, and we write X „
Exppβq, when (
βe´βx for all x ě 0,
f (x) “
0 elsewhere.

Exercise 110 (*) See problem sheet.

Exercise 111 (*) See problem sheet.

6.8 The normal distribution

Also see [BH15, Section 5.4], [DS02, Section 5.6], or [Sti03, Section 7.1].

This is one of the most important probability distributions, for several reasons, some of which we
will see later in this course.

R Definition 6.22 (Normal Distribution) Let µ, σ be real numbers with σ ą 0. We say a

continuous random variable X is normally distributed with parameters µ and σ 2 , and we write
X „ Npµ, σ 2 q, when
1 1 x´µ 2
f (x) “ √ e´ 2 ( σ ) for x P R.
σ 2π

The probability density function of the normal distribution is bell shaped. The parameter µ deter-
mines the location of the bell, and σ determines the spread of the bell, e.g. for smaller σ, the bell is
narrower. They are related to the expectation and standard deviation, which will be introduced later
on.

Exercise 112 (*) See problem sheet.

Rb
Unfortunately, there is no closed analytical form for P (X P ra, bs) “ a f (x) dx when X is normally
distributed. To solve this problem effectively, in the following, we introduce cumulative distribution
functions, the standard normal distribution, standard normal tables, and functions of random variables.

6.9 Cumulative distribution functions

Also see [BH15, Section 3.6], [DS02, Section 3.3], or [Sti03, Section 7.2]. Note that the last two books, as with
much of the literature, say just ‘distribution function’ for what we call ‘cumulative distribution function’.

R Definition 6.23 (Cumulative Distribution Function) For any real-valued random vari-
able X, the function F : R Ñ r0, 1s defined by

Fpxq :“ P (X ď x) for x P R

is called the cumulative distribution function of X.

If the random variable under consideration is not clear from the context, we may write FX for the
cumulative distribution function of X.

37
Exercise 113 (*) See problem sheet.
Exercise 114 (**) See problem sheet.
Exercise 115 (*) See problem sheet.
Exercise 116 (**) See problem sheet.
By C1, it follows immediately that:
P (X P pa, bs) “ P (X ď b) ´ P (X ď a) “ F (b) ´ F (a) .
This, together with C7, implies that knowledge of the cumulative distribution function is sufficient to
calculate the probability of virtually any event of practical interest, that is, any finite union of intervals.
If X is a discrete real-valued random variable, then F is piecewise constant, and
X
p (x) “ F (x) ´ lim F x ´ n´1 ,

F (x) “ p (t) ,
nÑ8
t : tďx

where the first equality follows from Definition 6.6.

In the statement p (x) “ F (x) ´ limnÑ8 F x ´ n´1 we have used C9 (continuity along mono-
tone limits), because the events An “ {x ´ n´1 ă X ď x} are decreasing in n and have as their
limit X8
n“1 An “ {X “ x}.

Similarly, if X is a continuous random variable, then F is continuous, and

Z x
dF
F (x) “ f (t) dt, f (x) “ pxq when f is continuous at x, (6.2)
´8 dx
where the first equality follows from Definition 6.16, and the second equality is an immediate conse-
quence of the fundamental theorem of calculus (correspondence between derivative and integral).
Example 6.24 For the continuous random variable X of the earlier example with f (x) “ x{2 for
x P r0, 2s and f (x) “ 0 elsewhere, the cumulative distribution function is

Z x 0
 if x ă 0,
F (x) “ 2
f (t) dt “ x {4 if x P r0, 2s,
´8 
1 if x ą 2.


6.10 Standard normal tables

Also see [DS02, Section 5.6].

Definition 6.25 (Standard Normal Distribution) A continuous random variable Z is standard

normally distributed when Z „ Np0, 1q.
In other words, a standard normal is normal with µ “ 0 and σ “ 1.
Because the standard normal distribution plays such a central role in many practical probabil-
ity calculations, we use a special symbol to denote its probability density function and cumulative
distribution function:1
Z z
1 ´z 2 {2
φpzq :“ f (z) “ √ e , Φpzq :“ F (z) “ φptq dt.
2π ´8

As already mentioned, Φ has no closed analytical form, however, it can be tabulated. Such tabula-
tion is called a standard normal table. Because φpzq “ φp´zq, it follows that Φpzq “ 1 ´ Φp´zq, and
so we only need to tabulate Φ for non-negative values of z. Some values that are often useful are:
1
In the programming language R, φ is the function dnorm, and Φ is the function pnorm.

38
z 0 1.28 1.64 1.96 2.58
Φpzq 0.5 0.9 0.95 0.975 0.995
The following bound is sometimes useful:
1 ´ Φpaq ď φpaq{a for all a ą 0.
Interestingly, we can use normal tables also to calculate P (X P ra, bs) for any X „ Npµ, σ 2 q, using
functions of random variables, introduced next.

6.11 Functions of random variables

Also see [BH15, Section 3.7], [DS02, Sections 3.8 and 5.6], or [Sti03, Section 7.2].

Suppose X : Ω Ñ XpΩq is a random variable, and g : XpΩq Ñ S is some function. Then gpXq
is also a random variable, namely the outcome to a ‘new experiment’ obtained by running the ‘old
experiment’ to produce a value x for X, and then evaluating gpxq. Formally, as a function of ω P Ω,
gpXq :“ g ˝ X, i.e.,
gpXqpωq :“ gpXpωqq for all ω P Ω.
For example,
P (gpXq P A) “ P ({ω P Ω : gpXpωqq P A}) for all A Ď S.
Example 6.26 For any random variable X, we can consider sinpXq, e3X , X 3 , and so on, which are
all again random variables.
Example 6.27 Let X be the score when you roll a fair die and let Y “ pX ´ 3q2 . Then Y is a discrete
random variable with probability mass function
y 0 1 4 9
p (y) 1{6 1{3 1{3 1{6
and zero elsewhere. To see this, note that {Y “ 4} “ {X P {1, 5}}, and so on.
Example 6.28 If X is Binpn, pq, then n ´ X is Binpn, 1 ´ pq, as shown in Exercise 97.
Example 6.29 Let X „ Up0, 1q. For any constants a and b ą 0, define Y :“ a ` bX. Then Y „
Upa, a ` bq, because for any x P r0, 1s,
{Y ď a ` bx} “ {a ` bX ď a ` bx} “ {X ď x}
so that P (Y ď a ` bx) “ P (X ď x) “ x, and consequently F (y) “ P (Y ď y) “ y´a b whenever
y P ra, a ` bs. Therefore, by Equation (6.2), indeed, f (y) “ 1{b for y P ra, a ` bs (and zero elsewhere),
so Y is uniformly distributed on ra, a ` bs by Definition 6.20.
A similar result holds when b ă 0.
Exercise 117 (*) See problem sheet.
Example 6.30 If U „ Up0, 1q and X is a continuous random variable with a cumulative distribution
function FX which is strictly increasing on ra, bs, with FX (a) “ 0 and FX (b) “ 1, then FX ´1 exists
and FX ´1 pU q has the same distribution as X because for any x P ra, bs,
P FX´1 pU q ď x “ P (U ď FX pxq) “ FX pxq “ P (X ď x) .

This is a special case of the probability integral transform and is very useful for generating random
samples of X with computer generated ‘uniform random numbers’. For example, ´ β1 logp1 ´ U q is
Exppβq and so is β1 log 1{U (as U and 1 ´ U are both Up0, 1q).
There is one particularly important function which enables us to get the cumulative distribution
function of any normally distributed random variable, using just the the standard normal tables (i.e.
Φ, the cumulative distribution function of the standard normal).

39
R Theorem 6.31 (Standardizing the Normal Distribution) Suppose µ P R and σ ą 0. If
X „ Npµ, σ 2 q and Z „ Np0, 1q, then

X ´µ
„ Np0, 1q, and σZ ` µ „ Npµ, σ 2 q.
σ

We can prove this is via change of variable in the integral for the cumulative distribution function.
A much shorter proof goes via the moment generating function, which will be introduced later.

Exercise 118 (*) See problem sheet.

Corollary 6.32 If X „ Npµ, σ 2 q then

x´µ
F (x) “ Φ .
σ

Exercise 119 (*) See problem sheet.

Bibliographical notes
The normal distribution appeared already in work of de Moivre, and is sometimes known as the
Gaussian distribution after Carl Friedrich Gauss (1777–1855). The name ‘normal distribution’ was
applied by biometrician Sir Francis Galton (1822–1911) and statistician Karl Pearson (1857–1936) to
mark the distribution’s ubiquity in biometric data.

Figure 6.1: (left to right) de Moivre, Gauss, Galton, and Pearson.

There is a great deal of subtle and interesting mathematics on the subject of what functions are
integrable over what sets. You may see some of this in the third year probability course. The Riemann
integral that we use here is sufficient for integrating piecewise continuous functions over finite unions of
intervals. Here, we will only consider continuous random variables which have a piecewise continuous
probability density function. Other approaches to integration are required to deal with more general
functions.
For instance, for infinite countable unions of intervals, we would need the Lebesgue integral (see
for instance [Ros07]). More precisely, f still determines the value of P (X P A) when A is an infinite
countable union of intervals, but that value is not necessarily given by the Riemann integral.
Many random variables are neither discrete nor continuous: a simple example is to take the sum
X ` Y of a discrete random variable X and a continuous random variable Y . To deal with these (and
other) general random variables we need the mathematical framework of measure theory; you will see
some of this if you take later probability courses.

40
Chapter 7

Multiple random variables

7.1 Joint probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.2 Jointly distributed discrete random variables . . . . . . . . . . . . . . . . . 42
7.3 Jointly continuously distributed random variables . . . . . . . . . . . . . . . 45
7.4 Functions of multiple random variables . . . . . . . . . . . . . . . . . . . . . 48

R Goals: Understand a multivariate random variable as a function from the sample space to a
higher dimensional space.
Know and recognize jointly distributed discrete random variables, their joint probability mass
function, marginal probability mass functions, and conditional probability mass functions. Know
the partition theorem for discrete random variables, as well as independence, and how to apply
these. Understand the properties of and links between these different probability mass functions,
and how those properties and links arise from the axioms of probability distributions.
Know and recognize continuously distributed random variables, their joint probability density func-
tion, marginal probability density functions, and conditional probability density functions. Know
the partition theorem for jointly continuously distributed random variables, as well as independence,
and how to apply these. Understand the properties of and links among these different probability
density functions.
Understand how to work with functions of multiple random variables.

7.1 Joint probability distributions

Also see [BH15, Section 7.1] or [DS02, Section 3.4].

It is essential for most useful applications of probability to have a theory which can handle many
random variables simultaneously. To start, we consider having two random variables. The theory for
more than two random variables is an obvious extension of the bivariate case covered below.
Remember that, formally, random variables are simply mappings from Ω into some set. A bivariate
random variable is a mapping from Ω into a Cartesian product of two sets. Of course, a bivariate
random variable is a random variable according to our original definition, just with a special kind of
set of possible values. However, the concept of bivariate random variable is a useful one if the individual
components of the bivariate variable have their own meaning or interest.
Definition 7.1 (Bivariate Random Variable) Consider two random variables, X : Ω Ñ XpΩq and
Y : Ω Ñ Y pΩq. The mapping pX, Y q : Ω Ñ XpΩq ˆ Y pΩq defined by
pX, Y qpωq :“ pXpωq, Y pωqq
is then a bivariate random variable.

41
X
Ω
X Y
Xpωq
pX, Y q pXpωq, Y pωqq
Y
ω X
Y
Y pωq

Example 7.2 If we throw two dice, then X and Y can be the values from each die, and the joint
outcome pX, Y q forms a bivariate random variable.

Similarly to before, for any A Ď XpΩq ˆ Y pΩq, we write ‘pX, Y q P A’ to denote the event

{ω P Ω : pXpωq, Y pωqq P A}.

For any x P XpΩq and y P Y pΩq, we write ‘X “ x, Y “ y’ to mean the event pX, Y q P {px, yq}.
We sometimes also write {pX, Y q “ px, yq} and {pX, Y q P A} to emphasize that these are sets; for
instance:

{X “ x, Y “ y} :“ {pX, Y q P {px, yq}}

“ {X “ x} X {Y “ y}
“ {ω P Ω : Xpωq “ x and Y pωq “ y}.

We may also write more complex expressions like:

{0 ď X ď Y 2 ď 1} “ {ω P Ω : 0 ď Xpωq ď pY pωqq2 ď 1},

and so on.

7.2 Jointly distributed discrete random variables

Also see [BH15, Sections 3.8 and 7.1], [DS02, Sections 3.4, 3.5, and 3.6], or [Sti03, Sections 5.1 and 5.2].

Definition 7.3 (Joint Probability Mass Function) The joint probability mass function p of two
discrete random variables X : Ω Ñ XpΩq and Y : Ω Ñ Y pΩq is defined by

p (x, y) :“ P (X “ x, Y “ y) for all x P XpΩq and y P Y pΩq

To avoid ambiguity, we sometime write if pX,Y if it is not clear from the context which random variables
the two arguments refer to; note that pX,Y is not the same as pY,X .

Exercise 120 (*) See problem sheet.

Exercise 121 (**) See problem sheet.

Again, by C7, the joint probability mass function determines the joint probability distribution of
X and Y , and the values of the joint probability mass function sum to one:

Theorem 7.4 Let X and Y be discrete random variables. Then we have that
X
P (pX, Y q P A) “ p (x, y) for all A Ď XpΩq ˆ Y pΩq. (7.1)
px,yqPA

42
In other words, the bivariate random variable pX, Y q is a discrete random variable with p (x, y) as
its probability mass function: a bivariate random variable whose two components are discrete is itself
discrete.

Corollary 7.5 Let X and Y be discrete random variables. Then the values of the joint probability
mass function are non-negative and sum to one:
X
p (x, y) “ 1.
px,yqPXpΩqˆY pΩq

Example 7.6 In a card game played with a standard 52 card deck, hearts are worth 1, the queen of
spades is worth 13 and all other cards worth 0. Let X, Y denote the values of the first and second cards
dealt (without replacement, as usual). The possible outcomes of the bivariate random variable pX, Y q
are
{p0, 0q, p1, 0q, p0, 1q, p1, 1q, p0, 13q, p13, 0q, p1, 13q, p13, 1q}
(not p13, 13q). With a well shuffled deck, the event A that the pair does not include the queen of spades,
is
{X ď 1, Y ď 1} “ {pX, Y q P A}
where
A “ {p0, 0q, p1, 0q, p0, 1q, p1, 1q}.
So, by Equation (7.1), and a few simple counting arguments,
38 37 1 38 38 13 1 12 51 ˆ 50 50
P (pX, Y q P A) “ ` ` ` “ “ .
52 51 4 51 52 51 4 51 52 ˆ 51 52
In a multivariate context, the probability mass functions p (x) and p (y) are also called marginal
probability mass functions. They are connected to the joint probability mass function p (x, y) as follows.

Corollary 7.7 (Marginal Probability Mass Function) Let X and Y be jointly distributed dis-
crete random variables. Then X and Y are (each separately) discretely distributed, with marginal prob-
ability mass functions
X X
pX (x) “ p (x, y) for all x P XpΩq, pY (y) “ p (x, y) for all y P Y pΩq.
yPY pΩq xPXpΩq
S
To prove for instance the first equality, simply note that {X “ x} “ yPY pΩq {X “ x, Y “ y}, and
apply A4.

Exercise 122 (*) See problem sheet.

Exercise 123 (*) See problem sheet.

Exercise 124 (*) See problem sheet.

Definition 7.8 (Conditional Probability Mass Function) Let X and Y be discrete random vari-
ables. For y P Y pΩq, the conditional probability mass function of X given Y “ y is defined by

pX,Y (x, y)
pX|Y (x | y) :“ P (X “ x | Y “ y) “
pY (y)
for all x P XpΩq and y P Y pΩq such that pY (y) ą 0,

and similarly, the conditional probability mass function of Y given X “ x is

pX,Y (x, y)
pY |X (y | x) :“ P (Y “ y | X “ x) “
pX (x)
for all y P Y pΩq and x P XpΩq such that pX (x) ą 0.

43
There’s nothing new here, apart from the notation: this is just a particular case of the usual definition
of conditional probability of one event given another, e.g.,
P (X “ x, Y “ y) pX,Y (x, y)
P (X “ x | Y “ y) “ “ .
P (Y “ y) pY (y)
Exercise 125 (*) See problem sheet.

Exercise 126 (*) See problem sheet.

Exercise 127 (*) See problem sheet.

There is also a version of P4 (partition theorem or, law of total probability) for discrete random
variables; again, only the notation is new here.
Theorem 7.9 (Partition Theorem for Discrete Random Variables) Let X and Y be discrete
random variables. Then,
X
pX (x) “ pX|Y (x | y) pY (y) .
yPY pΩq

Moreover, if X is real-valued, then for any functions g : Y pΩq Ñ R and h : Y pΩq Ñ R with g ď h, we
can write
X
P (gpY q ď X ď hpY q) “ P (gpyq ď X ď hpyq | Y “ y) P (Y “ y)
yPY pΩq
X X
“ pX|Y (x | y) pY (y) .
yPY pΩq xPrgpyq,hpyqsXXpΩq

Obviously, a version of the above result holds with X and Y swapped.

R Definition 7.10 (Independence of Discrete Random Variables) Two discrete random

variables X and Y are independent whenever

pX,Y (x, y) “ pX (x) pY (y) for all x P XpΩq and y P Y pΩq.

It follows that for independent discrete random variables X and Y , pX|Y (x | y) “ pX (x) whenever
pY (y) ą 0, and pY |X (y | x) “ pY (y) whenever pX (x) ą 0.
Corollary 7.11 If two discrete random variables X and Y are independent, then

P (X P A, Y P B) “ P (X P A) P (Y P B) for all A Ď XpΩq and B Ď Y pΩq.

Exercise 128 (*) See problem sheet.

Exercise 129 (**) See problem sheet.

Later (when we talk about limit theorems such as the law of large numbers) we will need to talk
about infinite sequences of independent random variables. A (possibly infinite) collection of random
variables Xi , i P I, is independent if every finite nonempty subcollection A Ď I is independent,
which in the discrete case means
!
\ Y
P {Xi “ xi } “ pXi (xi ) .
iPA iPA

A similar definition applies in the continuous case, which is the subject of the next section.

44
7.3 Jointly continuously distributed random variables
Also see [BH15, Section 7.1], [DS02, Sections 3.4, 3.5, and 3.6], or [Sti03, Sections 8.1 and 8.3].

Definition 7.12 (Joint Probability Density Function) Consider two real-valued random variables
X : Ω Ñ R and Y : Ω Ñ R. We say that X and Y are jointly continuously distributed when there is
a non-negative piecewise continuous function f : R2 Ñ R, called the joint probability density function,
such that Z Z b d
P (X P ra, bs, Y P rc, ds) “ f (x, y) dy dx
a c

for all ra, bs ˆ rc, ds Ď R2 .

To avoid ambiguity we sometimes write fX,Y (x, y) for the joint probability density of X and Y .
The interpretation of f (x, y) is that, for infinitesimal dx and dy,

P (X P rx, x ` dxs, Y P ry, y ` dys) “ f (x, y) dx dy.

As before, the joint probability density function f (x, y) determines the joint probability P (pX, Y q P A)
for most events A. More precisely, remember the definition of type I and II regions in the theory of
multiple integration: a type I region is of the form

D “ {a0 ď x ď a1 , φ1 pxq ď y ď φ2 pxq}

for some continuous functions φ1 and φ2 , and a type II region is of the form

D “ {ψ1 pyq ď x ď ψ2 pyq, b0 ď y ď b1 }

for some continuous functions ψ1 and ψ2 .

type I type II
φ2 pxq
y y
b1

ψ1 pyq ψ2 pyq

φ1 pxq
b0
a0 a1 x x

As we know from calculus, unions of regions of these types are precisely the regions over which we can
integrate. So, we have:

Theorem 7.13 If X and Y are jointly continuously distributed, then for any A Ď R2 that is a finite
union of type I and type II regions:
ZZ
P (pX, Y q P A) “ f (x, y) dx dy. (7.2)
A

Again as before, f (x, y) integrates to one:

45
Corollary 7.14 Let X and Y be jointly continuously distributed random variables. Then their joint
probability density function integrates to one:
ZZ
f (x, y) dx dy “ 1.
R2

Example 7.15 Suppose that X and Y have joint probability density function f (x, y) “ cpx2 ` yq
for ´1 ď x ď 1 and 0 ď y ď 1 ´ x2 , with f (x, y) “ 0 otherwise. What is the value of c? As
R 1 R 1´x2 2 1 1
R 4
´1 0 px ` yq dy dx “ 2 ´1 p1 ´ x q dx “ 4{5 we find c “ 5{4.

Exercise 130 (*) See problem sheet.

Note that, if X and Y are jointly continuously distributed, then for any interval ra, bs,
Z b Z 8
P (X P ra, bs) “ P (X P ra, bs, Y P R) “ f (x, y) dy dx,
a ´8
so, by Definition 6.16, X is continuously distributed as well, as is Y by a similar argument. We have
shown the following:
Corollary 7.16 (Marginal Probability Density Function) Let X and Y be jointly continuously
distributed random variables. Then X and Y are (each separately) continuously distributed, with
Z 8 Z 8
fX (x) “ f (x, y) dy for all x P R, fY (y) “ f (x, y) dx for all y P R.
´8 ´8
In a multivariate context, the probability density functions fX (x) and fY (y) are also called marginal
probability density functions.
R 1´x2 2
Example 7.17 Continuing from Example 7.15, we find f (x) “ 54 0 px ` yq dy “ 58 p1 ´ x4 q for
√ √
1´y
´1 ď x ď 1 and 0 otherwise; f (y) “ 54 ´√1´y px2 ` yq dx “ 65 p1 ` 2yq 1 ´ y for 0 ď y ď 1 and 0
R

otherwise.

So, if X and Y are jointly continuously distributed, then both X and Y are also continuously
distributed separately. Unlike the discrete case, the converse, however, is not true in general, as the
following example shows.

Example 7.18 Let X be a continuously distributed random variable, and let Y :“ 2X. Then both
X and Y are continuously distributed separately, however X and Y are not jointly continuously
distributed. Indeed, suppose that X and Y were jointly continuously distributed with density function
f px, yq. Then, with A “ {px, yq : 2x “ y},
ZZ Z `8 Z 2x
f px, yqdx dy “ f px, yq dy dx “ 0.
´8 2x
A

This implies that P (2X “ Y ) “ 0 for every two jointly continuously distributed random variables
X and Y . But, because for our choice of X and Y , obviously P (2X “ Y ) “ 1; so, by contradiction,
X and Y cannot be jointly continuously distributed.

Definition 7.19 (Conditional Probability Density Function) Let X and Y be jointly continu-
ously distributed random variables. The conditional probability density function of X at Y “ y is defined
by
fX,Y (x, y)
fX|Y (x | y) :“ for all x, y P R such that fY (y) ą 0.
fY (y)
and similarly, the conditional probability density function of Y at X “ x is
fX,Y (x, y)
fY |X (y | x) :“ for all x, y P R such that fX (x) ą 0.
fX (x)

46
Roughly speaking, fX|Y (x | y) should be thought of as the probability density function of X
conditional on Y “ y. Because the event Y “ y has probability 0, this interpretation needs a bit of
work to realize rigorously. A formal manipulation, that can be made rigorous using some analysis,
goes as follows:

P (X P rx, x ` dxs, Y P ry, y ` dys)

P (X P rx, x ` dxs | Y P ry, y ` dys) “
P (Y P ry, y ` dys)
fX,Y (x, y) dx dy
“ “ fX|Y (x | y) dx,
fY (y) dy

so fX|Y (x | y) is the probability density of X conditional on Y P ry, y ` dys.

Theorem 7.20 (Partition Theorem for Jointly Continuously Distributed Random Variables)
Let X and Y be jointly continuously distributed random variables. Then,
Z `8
fX (x) “ fX|Y (x | y) fY (y) dy.
´8

Moreover, for any piecewise continuous functions g : R Ñ R and h : R Ñ R with g ď h, we can write,
!
Z `8
Z hpyq
P (gpY q ď X ď hpY q) “ f (x | y) dx f (y) dy (7.3)
´8 gpyq

Similar equalities hold with X and Y swapped.

This result is sometimes very useful as a way of calculating the double integral in Equation (7.2).

We can formulate the above partition theorem in a way that is more similar to Theorem 7.9
which covered the discrete case. However, to do this, we need to introduce some slightly contorted
notation, namely, for any event A, and any continuously distributed random variable Y , we define

P (A | Y “ y) :“ lim P (A | Y P ry, y ` hs)

hÑ0

whenever P (A | Y P ry, y ` hs) ą 0 for all h sufficiently small—this happens when f (y) is contin-
uous at y and f (y) ą 0. Note that our usual definition of conditional probability does not apply,
because P (Y “ y) “ 0 for continuously distributed Y , so in the above, P (A | Y “ y) is not really
a conditional probability.
One should be forewarned that the notation P (A | Y “ y) for continuously distributed Y can lead
to extremely confusing issues, such as for instance Borel’s paradox (see http://en.wikipedia.
org/wiki/Borel%E2%80%93Kolmogorov_paradox).
Anyway, for now, proceeding in ignorance about any pitfalls, with this notation we have that
Z hpyq
P (gpyq ď X ď hpyq | Y “ y) “ f (x | y) dx
gpyq

and consequently, we can rewrite Equation (7.3) as

Z `8
P (gpY q ď X ď hpY q) “ P (gpyq ď X ď hpyq | Y “ y) f (y) dy.
´8

47
R Definition 7.21 (Independence of Jointly Continuously Distributed Random Variables)
Two jointly continuously distributed random variables X and Y are independent whenever

fX,Y (x, y) “ fX (x) fY (y) for all x and y P R.

Clearly, for independent jointly continuously distributed X and Y , fX|Y (x | y) “ fX (x) whenever
fY (y) ą 0, and fY |X (y | x) “ fY (y) whenever fX (x) ą 0.

Corollary 7.22 If two jointly continuously distributed random variables X and Y are independent,
then
P (X P A, Y P B) “ P (X P A) P (Y P B)
for all A and B Ď R that are finite unions of intervals.

In other words, if X and Y are independent, then the events {X P A} and {Y P B} are independent
for most A and B Ď R of practical interest.

Example 7.23 Suppose X and Y have joint probability density function

(
3e´px`3yq if x ě 0 and y ě 0,
f (x, y) “
0 otherwise.

Because we can write f (x, y) as e´x ¨ 3e´3y (for x ě 0 and y ě 0), it follows that X „ Expp1q and
Y „ Expp3q, and they are independent. We can calculate things like, for a ą 0,
Z 8 Z 8
P (aY ă X) “ f (x | y) dx f (y) dy
0 ay
Z 8 Z 8 Z 8
´x ´3y
“ e dx 3e dy “ e´ay ¨ 3e´3y dy “ 3{p3 ` aq.
0 ay 0

Exercise 131 (*) See problem sheet.

7.4 Functions of multiple random variables

Also see [DS02, Section 3.9].

Suppose X : Ω Ñ XpΩq and Y : Ω Ñ Y pΩq are (discrete or continuous) random variables, and
g : XpΩq ˆ Y pΩq Ñ S is some function assigning a value gpx, yq P S to each point px, yq. Then gpX, Y q
is also a random variable, namely the outcome to a ‘new experiment’ obtained by running the ‘old
experiments’ to produce values x for X and y for Y , and then evaluating gpx, yq.

X
Ω
X Y
Xpωq
pX, Y q pXpωq, Y pωqq
Y
ω X
Y
g Z
Y pωq

gpXpωq, Y pωqq
g ˝ pX, Y q

48
Formally, gpX, Y q :“ g ˝ pX, Y q, or in more specific terms, the random variable gpX, Y q : Ω Ñ S is
defined by:
gpX, Y qpωq :“ gpXpωq, Y pωqq for all ω P Ω.
For example:
P (gpX, Y q P A) “ P ({ω P Ω : gpXpωq, Y pωqq P A}) for all A Ď S.

Example 7.24 For any random variables X and Y , X ` Y , X ´ Y , XY , minpX, Y q, etpX`Y q , and
so on, are all random variables as well.

Exercise 132 (*) See problem sheet.

Exercise 133 (*) See problem sheet.

49
Chapter 8

Expectation

8.1 Definition and interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8.2 Expectation of functions of random variables . . . . . . . . . . . . . . . . . . 52
8.3 Linearity of expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.4 Variance and covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.5 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.6 Independence: multiplication rule for expectation . . . . . . . . . . . . . . . 61
8.7 Expectation and probability inequalities . . . . . . . . . . . . . . . . . . . . 63
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

R Goals: Have an intuitive as well as mathematical understanding of expectation, variance, and

covariance. Know how expectation, variance, and covariance, behave under linear transformations
and sums. Know how to evaluate expectation of functions of random variables. Know the properties
of expectation, variance, and covariance, and the relations between them. Understand the difference
between variance and standard deviation.
Know conditional expectation, the partition theorem for conditional expectation and the special
notation associated with it. Know how expectation, variance, and standard deviation behave under
independence, and for sums of independent random variables in particular.
Know the Markov and Chebyshev inequalities, where they come from, and how to apply them.

8.1 Definition and interpretation

Also see [BH15, Sections 4.1 and 5.1], [DS02, Section 4.1], or [Sti03, Sections 4.3 and 7.4].

In a relative frequency interpretation (discussed earlier in Section 4.2), suppose that we run n trials
on an experiment where we observe the outcome of some real-valued random variable X : Ω Ñ R in
each trial. Let xi denote the observed value of X in the ith trial; P the sequence of observations x1 , x2 ,
. . . , xn is called a sample. The sample mean is then simply n1 ni“1 xi . As a mathematical idealization,
we may suppose that there is a unique, empirical limiting value for the sample mean, as n tends to
infinity, which we call the expectation of X.
In a betting interpretation (discussed earlier in Section 4.3), you can simply consider your ‘fair
price’ for a bet which pays X; that price, we call your expectation of X.
The idea of expectation is very interesting mathematically, and also provides ways to use probability
in a host of practical applications. Regardless of interpretation, the expectation of X can be connected
to the probability mass function p (x) (if X is discrete) or the probability density function f (x) (if X
is continuously distributed) in the following way:

50
R Definition 8.1 (Expectation) For any real-valued random variable X, the expectation (also
called expected value or mean) of X, denoted as E (X) , is defined as:
X
E (X) :“ x p (x) if X is discrete, and (8.1)
xPXpΩq
Z 8
E (X) :“ x f (x) dx if X is continuously distributed, (8.2)
´8

provided that the sum or integral exists.

Example 8.2 Consider the following ‘game’. You pay Jimmy a pound and then you both throw a fair
die. If you get the higher number you get back the difference in pounds, otherwise you lose your pound.
Call the return from a game X, with possible values 0, 1, 2, 3, 4 and 5. By counting outcomes,
x 0 1 2 3 4 5
21 5 4 3 2 1
p (x) 36 36 36 36 36 36

so that
5
X
E (X) “ x p (x) “ p0 ˆ 21 ` 1 ˆ 5 ` 2 ˆ 4 ` 3 ˆ 3 ` 4 ˆ 2 ` 5 ˆ 1q{36 “ 35{36. (8.3)
x“0

We can interpret this value as meaning that over a long series of games you will get back £35 for every
£36 paid out.

Example 8.3 You wrapped 200 prizes for the Lucky Dip at the village fete – 4 prizes worth £10, 16
prizes worth £4, 30 worth 80p and 150 worth 20p. Some other people also wrapped prizes. You didn’t
see their prizes but think they were similar to yours so your fair price for a ticket is p4 ˆ 10 ` 16 ˆ 4 `
30 ˆ 0.8 ` 150 ˆ 0.2q{200 “ 158{200.

Example 8.4 If X „ Upa, bq then

Z b 2 b
x x {2
E (X) “ dx “ “ pb ` aq{2. (8.4)
a b´a b´a a

Example 8.5 If Z „ Np0, 1q then

E (Z) “ 0, (8.5)
essentially because φpzq “ φp´zq:
Z `8 Z 0 Z `8
E (Z) “ zφpzq dz “ zφpzq dz ` zφpzq dz
´8 ´8 0

and with the substitution u “ ´z in the first term,

Z 0 Z `8 Z `8 Z `8
“ uφp´uq du ` zφpzq dz “ ´ uφp´uq du ` zφpzq dz
`8 0 0 0

and because φpuq “ φp´uq,

Z `8 Z `8
“´ uφpuq du ` zφpzq dz “ 0.
0 0

Exercise 134 (*) See problem sheet.

Exercise 135 (*) See problem sheet.

51
Exercise 136 (*) See problem sheet.

Exercise 137 (*) See problem sheet.

Exercise 138 (**) See problem sheet.

Exercise 139 (**) See problem sheet.

Exercise 140 (**) See problem sheet.

Exercise 141 (**) See problem sheet.

If the range of possible values for a random variable X is unbounded, then the sum or integral
in Definition 8.1 may fail to exist. In this case, the preceding formulas may still be used to assign
a meaningful expectation in some cases, provided we interpret them with care.
For example, if X is discrete with probability mass function p (x) , consider
X X X
E (X) “ xp (x) “ xp (x) ` xp (x) ;
xPXpΩq xPXpΩq:xě0 xPXpΩq:xď0
| {z } | {z }
S` S´

now the individual sums S` and S´ always exist, but may be equal to `8. In fact, if we write
X ` “ maxpX, 0q and X ´ “ maxp´X, 0q, then X “ X ` ´X ´ and E (X ` ) “ S` and E (X ´ ) “ S´ .
To see this, note for example that X ` is a random variable with pX ` (x) “ pX (x) for x ą 0 and
pX ` (0) “ P (X ď 0) , but only the positive terms contribute to E (X ` ) . It makes sense to say that
E (X) “ E (X ` ) ´ E (X ´ ) (being possibly ´8 or `8) as long as at most one of S` and S´ are
infinite, using the rules 8 ´ x “ 8 and x ´ 8 “ ´8 for finite x. (There is no sensible interpretation
of 8 ´ 8.) A similar argument applies in the continuous case, with integrals instead of sums. This
is summarized in the following table.

E (X) “ E (X ` ) ă 8 E (X ` ) “ 8
E (X ´ ) ă 8 E (X ` ) ´ E (X ´ ) `8
E (X ´ ) “ 8 ´8 undefined

Example 8.6 Suppose that X is continuously distributed with probability density function given by
(
αx´1´α if x ě 1,
f (x) “
0 otherwise,

where α ą 0 is a parameter. If also α ă 1, then E (X ´ ) “ 0 and

Z 8

E X` “ αx´α “ `8,
1

so this is an example with E (X) “ `8.

8.2 Expectation of functions of random variables

Also see [BH15, Sections 4.5 and 5.1], [DS02, Section 4.1], or [Sti03, Sections 4.3, 5.3, 7.4, and 8.5].

Let X be a discrete random variable, and let g : XpΩq Ñ R be a real-valued function. As seen in
Section 6.11, gpXq :“ g ˝ X is again a random variable; in fact, gpXq is a real-valued random variable.
To find the expectation of gpXq, by Equation (8.1), according to the definition we need to find the

52
probability mass function pgpXq first. It turns out however that we can express E (gpXq) directly in
terms of pX , saving us the effort of having to calculate pgpXq from pX .
Define S :“ {gpxq : x P XpΩq}. Then, for any y P S,
X X
pgpXq (y) “ P (gpXq “ y) “ P (gpXq “ y | X “ x) P (X “ x) “ p (x) , (8.6)
xPXpΩq xPXpΩq:gpxq“y

so,
   
X X X X X
E (gpXq) “ ypgpXq (y) “ y p (x)  “  yp (x) 
yPS yPS xPXpΩq:gpxq“y yPS xPXpΩq:gpxq“y
   
X X X X X
“  yp (x)  “  y  p (x) “ gpxqp (x) ,
xPXpΩq yPS:y“gpxq xPXpΩq yPS:y“gpxq xPXpΩq

where we applied the definition of expectation, Equation (8.6), distributivity, change of order of sum-
mation, and distributivity again. A similar result can be proven when X is continuously distributed.
Concluding, we have the following result, which is sometimes known as the Law of the Unconscious
Statistician:

R Theorem 8.7 (Expectation of a Function of a Random Variable) For any random vari-
able X : Ω Ñ XpΩq, and any function g : XpΩq Ñ R,
X
E (gpXq) “ gpxqp (x) if X is discrete, and
xPXpΩq
Z 8
E (gpXq) “ gpxqf (x) dx if X is continuously distributed,
´8

provided that the sum or integral exists.

Example 8.8 Suppose X takes values ´2, ´1, 0, 1, 2, 3 each with probability 1/6. Then
1 19
E X 2 “ pp´2q2 ` p´1q2 ` 0 ` 1 ` 22 ` 32 q “ ;

6 6
1 √ √ √ 1
E (sinpπX{4q) “ p´1 ´ 1{ 2 ` 0 ` 1{ 2 ` 1 ` 1{ 2q “ √ ;
6 6 2
and so on.
1
Example 8.9 If X „ Up´1, 1q then f (x) “ 2 for x P r´1, 1s, and zero elsewhere, so
Z 1
2 1
x2 ¨

E X “ dx “ 1{3.
´1 2

Exercise 142 (*) See problem sheet.

Exercise 143 (*) See problem sheet.

Exercise 144 (*) See problem sheet.

Exercise 145 (*) See problem sheet.

Exercise 146 (*) See problem sheet.

Exercise 147 (*) See problem sheet.

53
Exercise 148 (**) See problem sheet.

Exercise 149 (**) See problem sheet.

Exercise 150 (**) See problem sheet.

Exercise 151 (**) See problem sheet.

Exercise 152 (**) See problem sheet.

Exercise 153 (**) See problem sheet.

Exercise 154 (***) See problem sheet.

Exercise 155 (***) See problem sheet.

Similar comments apply here about extensions of E (gpXq) to include `8 or ´8 as at the end
of the previous section.

Example 8.10 Suppose X „ Up´1, 1q i.e., X is uniformly distributed on the interval p´1, 1q, and
we set gpxq “ 1{x for x ‰ 0 and gp0q “ 0. Then E (gpXq) is not defined because E (gpXq` ) “
R0 1 R1 1
0 dx ` 0 2x
´1 2
dx “ 8 and similarly E (gpXq´ ) “ 8.

For multiple random variables, the Law of the Unconscious Statistician reads as follows:

R Theorem 8.11 (Expectation of a Function of Multiple Random Variables) For any

random variables X and Y , and any function g : XpΩq ˆ Y pΩq Ñ R,
X X
E (gpX, Y ) q “ gpx, yqp (x, y) if X and Y are discrete, and
xPXpΩq yPY pΩq
ZZ
E (gpX, Y ) q “ gpx, yqf (x, y) dx dy if X and Y are jointly continuously distributed,
R2

provided that the sum or integral exists.

Exercise 156 (**) See problem sheet.

Example 8.12 Consider discrete random variables X and Y with joint probability mass function:
p (x, y) x “ ´1 x“0 x“1
y“0 1{4 0 1{4
y“1 0 1{4 1{4

Then E (XY ) “ 41 pp´1q ˆ 0 ` 0 ˆ 1 ` 1 ˆ 0 ` 1 ˆ 1q “ 1{4.

Example 8.13 Let X and Y be jointly continuously distributed random variables, with
(
1 if px, yq P r0, 1s2 ,
f (x, y) “
0 otherwise.
R1R1 R1 R1
Then E (XY ) “ 0 0 xy dx dy “ p 0 x dxqp 0 y dyq “ p1{2q2 “ 1{4.

Exercise 157 (*) See problem sheet.

54
8.3 Linearity of expectation
Also see [BH15, Section 4.2], [DS02, Sections 4.2 and 4.6], or [Sti03, Section 5.3].

Remember that summation and integration are linear operators. Consequently:

R Theorem 8.14 (Linearity of Expectation I) For any random variable X, and any con-
stants α and β P R,
E (αX ` β) “ αE (X) ` β.

R Theorem 8.15 (Linearity of Expectation II) For any two random variables X and Y ,

E (X ` Y ) “ E (X) ` E (Y ) .

More generally, for any real-valued random variables X1 , X2 , . . . , Xn ,

n
! n
X X
E Xi “ E (Xi ) .
i“1 i“1

Proof Consider the multiple random variable pX, Y q and the function gpx, yq “ x ` y. By the Law
of the Unconscious Statistician (Theorem 8.11) we get in the discrete case
X X X X X X
E (gpX, Y q) “ px ` yqp (x, y) “ x p (x, y) ` y p (x, y)
xPXpΩq yPY pΩq xPXpΩq yPY pΩq yPY pΩq xPXpΩq
X X
“ xpX (x) ` ypY (y) “ E (X) ` E (Y ) ,
xPXpΩq yPY pΩq

as claimed. A similar calculation applies in the jointly continuous case, and the extension to more than
two random variables follows by induction.

Example 8.16 Suppose that Y1 , Y2 , . . . , Yn are random variables such that each Yi „ Binp1, pq, i.e.
they take value 1 with probability p, and value
Pn 0 otherwise. Then E (Yi ) “ p and so E (Y1 ` ¨ ¨ ¨ ` Yn ) “
np. If the Yi are independent then X “ i“1 Yi „ Binpn, pq. So, we have shown that E (X) “ np for
any X „ Binpn, pq.
This is a much quicker proof than computing using the probability mass function; and note that the
independence is not necessary here.

We have proved these statements for discrete and continuous random variables (because we have
defined expectation only in those two cases!), but linearity of expectation is true for all random
variables, and follows from the general measure-theoretic approach to probability theory, in which
expectation is defined as a Lebesgue integral with respect to a probability measure; this includes
the discrete and continuous settings we study as special cases. Students who take later probability
courses will see some of this general approach. The fact that in this course we cannot event talk
properly about, say, a sum of a discrete plus a continuous random variable, is a serious limitation
that has no elegant fix in any approach that does not use measure theory.

55
8.4 Variance and covariance
Also see [BH15, Sections 4.6 and 7.3], [DS02, Sections 4.3 and 4.6], or [Sti03, Sections 4.3, 5.3 and 8.5].

As mentioned earlier, we can interpret the expectation of X as a long-run average of a sample from
distribution X. A popular and mathematically convenient way to measure the variability of X—i.e.
to measure how much X varies from E (X) in the long run—goes via the expectation of the random
variable pX ´ E (X) q2 :

R Definition 8.17 Let X be any real-valued random variable. The variance of X is defined as
2
Var (X) :“ E X ´ E (X) ,

and the standard deviation of X is defined as

p
SD (X) :“ Var (X) .

Note that both Var (X) and SD (X) are non-negative numbers.
From Theorem 8.7, we immediately derive the following expressions for the variance:
X
Var (X) “ px ´ E (X) q2 p (x) if X is discrete, and
xPXpΩq
Z 8
Var (X) “ px ´ E (X) q2 f (x) dx if X is continuously distributed,
´8

provided that the sum or integral exists.

Example 8.18 If X takes values 0, 10, 20 each with probability 1/3, then
1 1 1
E (X) “ 0 ˆ ` 10 ˆ ` 20 ˆ “ 10.
3 3 3
Consequently, using this value,
1 200
Var (X) “ ˆ p0 ´ 10q2 ` p10 ´ 10q2 ` p20 ´ 10q2 “ ,
3 3
q
200
and so SD (X) “ 3 « 8.16.

Example 8.19 Let Z „ Np0, 1q. By Equation (8.5), E (Z) “ 0 and by Exercise 147, E Z 2

“ 1.
Consequently,
Var (Z) “ E (pZ ´ EpZ) q2 q “ E Z 2 “ 1.

(8.7)
p
and SD (Z) “ Var (Z) “ 1.

Exercise 158 (*) See problem sheet.

For two real-valued random variables, we can ask ourselves how they vary jointly:

R Definition 8.20 (Covariance) Let X and Y be any real-valued random variables. The covariance
of X and Y is defined as

Cov (X, Y ) :“ E (pX ´ E (X) qpY ´ E (Y ) q) .

56
We say that X and Y are positively correlated, uncorrelated or negatively correlated according to
whether their covariance is positive, zero, or negative, respectively.
From Theorem 8.11, we immediately derive the following expressions for the covariance:
X X
Cov (X, Y ) “ px ´ E (X) qpy ´ E (Y ) qp (x, y)
xPXpΩq yPY pΩq

if X and Y are discrete, and

ZZ
Cov (X, Y ) “ px ´ E (X) qpy ´ E (Y ) qf (x, y) dx dy
R2

if X and Y are jointly continuously distributed, provided that the double sum or double integral exists.

Exercise 159 (***) See problem sheet.

Property 8.21 (Variance as Covariance) For any real-valued random variable X,

Var (X) “ Cov (X, X) .

Property 8.22 (Symmetry of Covariance) For any real-valued random variables X and Y ,

Cov (X, Y ) “ Cov (Y, X) .

As immediate consequences of linearity of expectation (Theorems 8.14 and 8.15) we obtain the
formulæ:

Corollary 8.23 For any real-valued random variable X, and any constants α and β P R,

Var (α ` βX) “ β 2 Var (X) .

For any real-valued random variables X, Y , and Z, and any constants α, β, γ, and δ P R,

Cov (α ` βX, γ ` δY ) “ βδCov (X, Y ) (8.8a)

Cov (X ` Y, Z) “ Cov (X, Z) ` Cov (Y, Z) (8.8b)
Cov (X, Y ` Z) “ Cov (X, Y ) ` Cov (X, Z) (8.8c)

Equations (8.8a), (8.8b), and (8.8c) mean that Cov is a bilinear operator.

Example 8.24 Let X „ Npµ, σ 2 q. Then, Z :“ X´µ σ „ Np0, 1q by Theorem 6.31, E (Z) “ 0 by
Equation (8.5), and Var (Z) “ 1 by Equation (8.7). Consequently,

E (X) “ E (µ ` σZ) “ µ ` σE (Z) “ µ,

Var (X) “ σ 2 Var (Z) “ σ 2 .

In other words, the parameters µ and σ 2 of a normal distribution correspond to the expectation and
variance, respectively.

Exercise 160 (*) See problem sheet.

Exercise 161 (*) See problem sheet.

We also obtain a slightly different way of calculating the variance:

Corollary 8.25 For any real-valued random variable X,

2
Var (X) “ E X 2 ´ E (X) .

57
Example 8.26 If X takes values 0, 10, 20 each with probability 1/3, then
1 1 1
E (X) “ 0 ˆ ` 10 ˆ ` 20 ˆ “ 10,
3 3 3
1 1 1
E X 2 “ 02 ˆ ` 102 ˆ ` 202 ˆ “ 500{3,

3 3 3
Consequently,
Var (X) “ E X 2 ´ pE (X) q2 “ 100 ´ 500{3 “ 200{3,

which agrees with the value that we found earlier.

Exercise 162 (**) See problem sheet.

Exercise 163 (*) See problem sheet.

Exercise 164 (**) See problem sheet.

Exercise 165 (**) See problem sheet.

We also obtain a slightly different way of calculating the covariance:

Corollary 8.27 For any real-valued random variables X and Y ,

Cov (X, Y ) “ E (XY ) ´ E (X) E (Y ) .

Example 8.28 Suppose X takes values 0, 1, 2 with probabilities 1/4, 1/2, 1/4. Let Y :“ X 2 . What
is Var (X) , Var (Y ) , and Cov (X, Y ) ?
Clearly, E (X) “ 1. Next, Y “ X 2 takes values 0, 1, 4 with probabilities 1/4, 1/2, 1/4, so E X 2 “

E (Y ) “ 3{2. Similarly, Y 2 “ X 4 takes values 0, 1, 16 with probabilities 1/4, 1/2, 1/4, so E Y 2 “

9{2. Finally, XY “ X 3 takes values 0, 1, 8 with probabilities 1/4, 1/2, 1/4, so E (XY ) “ 5{2.
Concluding,

Var (X) “ E X 2 ´ pE (X) q2 “ 3{2 ´ 1 “ 1{2;

Var (Y ) “ E Y 2 ´ pE (Y ) q2 “ 9{2 ´ 9{4 “ 9{4;

Cov (X, Y ) “ E (XY ) ´ E (X) E (Y ) “ 5{2 ´ 3{2 “ 1.

Exercise 166 (*) See problem sheet.

Exercise 167 (*) See problem sheet.

Exercise 168 (*) See problem sheet.

Exercise 169 (**) See problem sheet.

Finally, we can now also say something about the variance of sums of random variables:

Corollary 8.29 (Variance of a Sum) For any real-valued random variables X and Y ,

Var (X ` Y ) “ Var (X) ` Var (Y ) ` 2Cov (X, Y ) .

More generally, for any real-valued random variables X1 , X2 , . . . , Xn ,

n n n´1 n
!
X X X X
Var Xi “ Var (Xi ) ` 2 Cov (Xi , Xj )
i“1 i“1 i“1 j“i`1

So, in general, the variance of a sum is not equal to the sum of the variances, unless all covariances are
zero (zero covariance occurs under independence, covered later).

58
Example 8.30 For three real-valued random variables X, Y , and Z,

Var (X ` Y ` Z) “ Var (X) ` Var (Y ) ` Var (Z) ` 2 Cov (X, Y ) ` Cov (X, Z) ` Cov (Y, Z) .

Example 8.31 Suppose X takes values 0, 1, 2 with probabilities 1/4, 1/2, 1/4. Let Y :“ X 2 . We
already found earlier that Var (X) “ 1{2, Var (Y ) “ 9{4, and Cov (X, Y ) “ 1. Now also find
Var (X ` Y ) and Var (X ´ Y ) .
By the above, Var (X ` Y ) “ Var (X) ` Var (Y ) ` 2Cov (X, Y ) “ 1{2 ` 9{4 ` 2 ˆ 1 “ 19{4.
(Note that in this very simple example we can easily calculate Var (X ` Y ) directly, as X ` Y takes
possible values 0, 2, 6 with probabilities 1/4, 1/2, 1/4 so we can confirm by direct calculation that
Var (X ` Y ) “ 19{4.)
Next, note that Var (X ´ Y ) “ Var (X ` Z) , where Z “ Ý . As Var (Z) “ p´1q2 Var (Y ) “
Var (Y ) and Cov (X, Z) “ Cov (X, Ý ) “ Ćov (X, Y ) we have

Var (X ´ Y ) “ Var (X) ` Var (Y ) ´ 2Cov (X, Y )

“ 1{2 ` 9{4 ´ 2 ˆ 1 “ 3{4.

Exercise 170 (*) See problem sheet.

Exercise 171 (*) See problem sheet.

Exercise 172 (*) See problem sheet.

Exercise 173 (***) See problem sheet.

8.5 Conditional expectation

Also see [BH15, Sections 9.1 and 9.2], [DS02, Section 4.7], or [Sti03, Sections 4.4 and 5.5].

Recall that the indicator random variable of an event A is given by

(
1 if ω P A,
1A pωq “
0 otherwise.

Then it is easy to see that E (1A ) “ P (A) .

R Definition 8.32 (Conditional Expectation with Respect to an Event) Let X be a real-

valued random variable, and let A Ď Ω be an event. The conditional expectation of X given A is:

E (X 1A )
E (X | A) :“ whenever P (A) ą 0.
P (A)

Conditional expectation generalizes the concept of conditional probability:

Example 8.33 Let A and B be any events, such that P (B) ą 0. Because 1A 1B “ 1AXB , it follows
that E (1A | B) “ E (1A 1B ) {P (B) “ P (A X B) {P (B) “ P (A | B) .
One may also view E ( ¨ | A) as expectation with respect to the conditional probability P ( ¨ | A) :

R Theorem 8.34 For any discrete random variable X and any event A Ď Ω,
X
E (X | A) “ xP (X “ x | A) .
xPXpΩq

59
Proof This is an exercise in tracking definitions. Indeed, if Y “ X 1A , then for any x P XpΩq with
x ‰ 0,
P (Y “ x) “ P ({X “ x} X A) “ P (X “ x | A) P (A) .
Hence X X
E (Y ) “ xP (Y “ x) “ xP (X “ x | A) P (A) ,
xPXpΩq xPXpΩq

and the claim follows.

Example 8.35 Suppose that X and Y are discrete random variables, g : XpΩq
P Ñ R, and B Ď Y pΩq.
Then choosing the event A “ {Y P B}, we see that whenever P (Y P B) “ yPB pY (y) ą 0,
P P
xPXpΩq gpxq yPB pX,Y (x, y)
E (gpXq | Y P B) “ P .
yPB pY (y)

As a special case, we have

X
E (gpXq | Y “ y) “ gpxqpX|Y (x | y) . (8.9)
xPXpΩq

Similarly, if X and Y are jointly continuously distributed, then

R R
R gpxq RB fX,Y (x, y) dy dx
E (gpXq | Y P B) “ , (8.10)
B fY (y) dy
R
provided P (Y P B) “ B fY (y) dy ą 0.
To summarize, conditional expectation is just like ordinary expectation but with probabilities replaced
by conditional probabilities.

Exercise 174 (*) See problem sheet.

Exercise 175 (**) See problem sheet.

We also have a partition theorem:

R Theorem 8.36 (Partition Theorem for Expectation) Let X be any real-valued random
variable. Let E1 , E2 , . . . , Ek be any events that form a partition. Then,
k
X
E (X) “ E (X | Ei ) P (Ei ) .
i“1

Proof Because E1 , . . . , Ek constitute a partition, Yki“1 Ei “ Ω and the Ei are pairwise disjoint, so
that
k
1 “ 1Ω “ 1Yk Ei “ 1Ei ,
X
i“1
i“1
and hence, by linearity of expectation,
k k
!
1Ei E (X 1Ei ) ,
X X
E (X) “ E X “
i“1 i“1

which gives the result.

60
Example 8.37 Ann and Bob play a sequence of independent games, each with outcomes A “ {Ann wins},
B “ {Bob wins}, D “ {game drawn} which have probabilities P (A) “ p, P (B) “ q and P (D) “ r ą 0
with p ` q ` r “ 1. Let N denote the number of games played until the first win by either Ann
or Bob. The results of the first game A, B, D form a partition and E (N | A) “ E (N | B) “ 1 (as
P (N “ 1 | A) “ P (N “ 1 | B) “ 1) while E (N | D) “ 1`E (N ) (the future after a drawn game looks
the same as at the start). Hence E (N ) “ 1 ˆ p ` 1 ˆ q ` p1 ` E (N ) q ˆ r or p1 ´ rqE (N ) “ p ` q ` r “ 1
i.e. E (N ) “ 1{p1 ´ rq.

A more subtle concept is the following:

R Definition 8.38 (Conditional Expectation with Respect to a Random Variable) Let

X be a real-valued random variable, and let Y be another random variable. The conditional expectation
of X given Y is the random variable denoted by E (X | Y ) given by

E (X | Y ) pωq :“ E (X | Y “ Y pωq) for all ω P Ω.

In other words, in the case where Y is discrete, E (X | Y ) is the random variable that takes values
E (X | Y “ y) with probabilities P (Y “ y) .

Exercise 176 (**) See problem sheet.

Exercise 177 (**) See problem sheet.

With the above notation, Theorem 8.36 becomes in this case:

R Corollary 8.39 (Partition Theorem for Expectation II) For any real-valued random vari-
able X and any discrete random variable Y ,

E (X) “ E (E (X | Y ) ) .

This is sometimes called the law of iterated expectation. We can use this result to calculate E (X) in
some tricky cases.

Example 8.40 You flip three fair coins and obtain H heads (so H „ Binp3, 1{2q, and E (H) “ 3{2).
Then you roll H fair dice to get total score T . The events {H “ 0}, {H “ 1}, {H “ 2}, and {H “ 3}
form a partition, and E (T | H “ h) “ 7h{2 since with h rolls of a fair die the expected total score is
7
2 h. Hence E (T | H) “ 7H{2 and so

7 7 3
E (T ) “ E (E (T | H) ) “ E (H) “ ˆ .
2 2 2
Exercise 178 (*) See problem sheet.

8.6 Independence: multiplication rule for expectation

Also see [DS02, Section 4.2 (Expectation of a Product), 4.3 (Properties of the Variance), and 4.6 (Properties of
Covariance and Correlation)].

Remember that two discrete random variables are independent when their joint probability mass
function factorises, i.e. when p (x, y) “ p (x) p (y) . Similarly, two jointly continuously distributed
random variables are independent when their joint probability density function factorizes, i.e. when
f (x, y) “ f (x) f (y) . It turns out that in these cases, expectation factorizes as well:

61
R Theorem 8.41 (Independence Means Multiply) If X and Y are independent real-valued
random variables then
E (XY ) “ E (X) E (Y ) .
More generally, for any mutually independent real-valued random variables X1 , X2 , . . . , Xn ,
n
! n
Y Y
E Xi “ E (Xi ) .
i“1 i“1

Proof Suppose that X and Y are independent discrete random variables with joint probability mass
function p (x, y) “ pX (x) pY (y) . Then by the Law of the Unconscious Statistician,
X X X X
E (XY ) “ xyp (x, y) “ xpX (x) ypY (y) ,
xPXpΩq yPY pΩq xPXpΩq yPY pΩq

which is E (X) E (Y ) . A similar argument works in the continuous case.

Now, if X and Y are independent, then so are any functions of X and Y . Consequently:

Corollary 8.42 If X and Y are independent random variables then for any functions g : XpΩq Ñ R
and h : Y pΩq Ñ R we have that

E (gpXqhpY q) “ E (gpXq) E (hpY q) .

Corollary 8.43 If X and Y are independent random variables, then Cov (X, Y ) “ 0.

The converse of the last property is not true—X and Y may be dependent but uncorrelated. All
that we can say is that X and Y will have a large positive correlation if, as X increases, Y also tends to
increase in a roughly linear way, and a large negative correlation if, as X increases, Y tends to decrease
in a roughly linear way.

Example 8.44 Suppose pX, Y q are jointly distributed taking values p´1, 0q, p`1, 0q, p0, ´1q, and
p0, `1q with probability 1{4 of each. Then X and Y are not independent, because P (X “ 1, Y “ 1) “ 0
is not the same as P (X “ 1) P (Y “ 1) “ 1{16, for instance. However, X and Y are uncorrelated,
because E (XY ) “ 0 and E (X) “ E (Y ) “ 0 so Cov (X, Y ) “ 0. (In fact, in this case XY “ 0 with
probability 1.)

An important consequence of Corollary 8.43 is a simplification of the formula for the variance of a
sum (Corollary 8.29) for pairwise independent random variables:

Corollary 8.45 Consider random variables X1 , X2 , . . . , Xn . If these random variables are pairwise
independent, then
n n
!
X X
Var Xi “ Var (Xi ) .
i“1 i“1

Example
Pn 8.46 Remember from an earlier example that we can write any X „ Binpn, pq as X “
1 Xi where X1 , . . . , Xn are independent and each Xi „ Binp1, pq. In Exercise 158, you showed that
Var (Xi ) “ pp1 ´ pq. Consequently, as the Xi are independent,
n
X
Var (X) “ Var (Xi ) “ npp1 ´ pq.
i“1

Example 8.47 Let S be the total score and T be the product of the scores from throwing four Pdice. To
4
find E (S) , Var (S) and E (T ) let the individual scores be Xk , k “ 1, 2, 3, 4 so that S “ k“1 Xk ,

62
T “ 4k“1 Xk . We readily calculate E (Xk ) “ 6i“1 i ˆ 1{6 “ 7{2 and Var (Xk ) “ 6i“1 i2 ˆ 1{6 ´
Q P P
p7{2q2 “ 35{12. Therefore
4 4
!
X X
E (S) “ E Xk “ E (Xk ) “ 4 ˆ 7{2 “ 14,
k“1 k“1

and further, as the Xk are independent,

4 4
!
X X
Var (S) “ Var Xk “ Var (Xk ) “ 4 ˆ 35{12 “ 35{3.
k“1 k“1

Again as the Xk are independent,

4 4
! 4
Y Y 7
E (T ) “ E Xk “ E (Xk ) “ “ 2401{16.
2
k“1 k“1

Note that all of these calculations are possible without having to deal with the joint probability distri-
bution of the Xk (which is uniform on the 1296 possible outcomes).

Exercise 179 (***) See problem sheet.

8.7 Expectation and probability inequalities

Also see [BH15, Section 10.1], [DS02, Section 4.8 (The Markov and Chebyshev Inequalities)], or [Sti03, Sec-
tion 4.6].

By the monotonicity properties of summation and integration we immediately infer that:

Theorem 8.48 (Monotonicity of Expectation) For any random variable X, and any a P R, if
X ě a then E (X) ě a.

In particular, if X and Y are random variables with P (X ď Y ) “ 1, then E (X) ď E (Y ) .

Exercise 180 (**) See problem sheet.

Exercise 181 (***) See problem sheet.

This simple property has various interesting consequences:

Corollary 8.49 For any random variable X, Var (X) ě 0.

R Corollary 8.50 (Markov’s Inequality) If X ě 0 then, for any a ą 0,

E (X)
P (X ě a) ď .
a

Proof Note that X ě a1{Xěa} , and consequently 0 ď E X ´ a1{Xěa} “ E (X) ´ aP (X ě a) .

Example 8.51 If X equals s with chance p but otherwise equals 0 then E (X) “ sp. For a ą s we
have 0 “ P (X ě a) ď ps{a while for a ď s Markov’s inequality says p “ P (X ě a) ď p ˆ s{a which
is exact at a “ s so this bound is as strong as possible.

Exercise 182 (*) See problem sheet.

63
Corollary 8.52 (Chebyshev’s Inequality) For any random variable X and any a ą 0, we have

Var (X)
P (|X ´ E (X) | ě a) ď .
a2

Proof P pX ´ E (X) q2 ě a2 ď E pX ´ E (X) q2 {a2 by Markov’s inequality applied to pX ´

E (X) q2 at a2 . Now observe that {pX ´ E (X) q2 ě a2 } “ {|X ´ E (X) | ě a} and of course
2

E pX ´ E (X) q “ Var (X) .

Chebyshev bounds are often too generous when distributional information is available, as seen in
the next example. Nevertheless, its generality and simplicity make this inequality very valuable for
complex probability calculations.

Example 8.53 If Z „ Np0, 1q then P (|Z ´ E (Z) | ě 2) “ P (|Z| ě 2) “ 1´P (´2 ă Z ă 2) “ 0.046,
while the Chebyshev bound on this probability is Var (Z) {22 “ 0.25.

Exercise 183 (*) See problem sheet.

Exercise 184 (**) See problem sheet.

Bibliographical notes
There are approaches to probability theory that start out from expectation of random variables directly,
rather than starting out from probability of events as we have done here; see e.g. [Whi92].
Pafnuty Chebyshev (1821–1894) and his student Andrei Markov (1856–1922) made several impor-
tant contributions to early probability theory. What we call Markov’s inequality was actually published
by Chebyshev, as was what we call Chebyshev’s inequality; our nomenclature is standard, and at least
has the benefit of distinguishing the two. A version of the inequality was first formulated by Irénée-Jules
Bienaymé (1796–1878).

Figure 8.1: (left to right) Chebyshev and Markov.

64
Chapter 9

Limit theorems

9.1 The weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 65

9.2 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.3 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

R Goals: Understand and know how to prove the weak law of large numbers, for proportions
as well as for general random variables, and know under what conditions the weak law applies.
Understand and know how to prove (by means of moment generating functions) the central limit
theorem. Know under what conditions the central limit theorem applies. Know how to exploit the
central limit theorem to approximate the binomial distribution, and under what circumstances.
Know the definition and properties of moment generating functions. Know how to derive the
moment generating function of a given distribution. Know how to prove the central limit theorem.

9.1 The weak law of large numbers

Also see [BH15, Section 10.2], [DS02, Section 4.8], or [Sti03, Section 5.8].

The results of this section describe limiting properties of distributions of sums of random variables
using only some assumptions about means and variances.
Toss a coin n times, where the probability of heads is p, independently on each toss. Let X be
the number of heads and let Bn :“ X{n be the proportion of heads in the n tosses. Then, because
X „ Binpn, pq has expectation np and variance npp1 ´ pq,

E (Bn ) “ E (X) {n “ p, Var (Bn ) “ Var (X) {n2 “ pp1 ´ pq{n.

So, by Chebyshev’s inequality, for any ą 0,

pp1 ´ pq
P (|Bn ´ p| ě ) ď .
n2
Whence,
P (|Bn ´ p| ě ) Ñ 0 as n Ñ 8,
no matter how small is. In other words, as n Ñ 8, the sample proportion is with very high probability
within any tiny interval centred on p.
The same argument applies more generally.

65
R Theorem 9.1 (Weak Law of Large Numbers) Suppose we have an infinite sequence X1 ,
X2 , . . . of independent random variables with the same mean and variance:

µ :“ E (X1 ) “ E (X2 ) “ ¨ ¨ ¨ σ 2 :“ Var (X1 ) “ Var (X2 ) “ ¨ ¨ ¨

1
Pn
Consider the sample average X n :“ n i“1 Xi . Then, for any ą 0,

lim P µ ´ ă X n ă µ ` “ 1. (9.1)
nÑ8

In other words, the sample average has a very high probability of being very near the expected value
µ when n is large.

The convergence in (9.1) is called convergence in probability; the law of large numbers says that
‘X n Ñ µ in probability’.

Exercise 185 (*) See problem sheet.

Exercise 186 (**) See problem sheet.

Example 9.2 Measure the heights Hi of n randomly selected people from a very large population,
where µ, σ 2 are the average and the variance of heights over the whole population. Then E (Hi ) “ µ,
Var (Hi ) “ σ 2 and so, as long as the collection of heights is not too asymmetric, the chance of the
average height X n being more than a small amount from µ is very small i.e. almost all large samples
have average near µ.

9.2 The central limit theorem

Also see [BH15, Section 10.3], [DS02, Section 5.7], or [Sti03, Section 8.9].

If the law of large numbers is a ‘first order’ result, a ‘second order result’ is the famous central
limit theorem, which describes fluctuations around the law of large numbers. A sequence of random
variables X1 , X2 , . . . are independent and identically distributed (i.i.d. for short) if they are mutually
independent and all have the same (marginal) distribution.

R Theorem 9.3 (Central Limit Theorem) Suppose we have a sequence X1 , X2 , . . . of i.i.d. ran-
dom variables. Let

µ :“ E (X1 ) “ E (X2 ) “ . . . σ 2 :“ Var (X1 ) “ Var (X2 ) “ . . .

Pn
with σ ą 0. Let Sn :“ i“1 Xi , X n :“ Sn {n, and

Sn ´ nµ Xn ´ µ
Zn :“ √ “ √
σ n σ{ n

Then, for any z P R,

lim FZn (z) “ lim P (Zn ď z) “ Φpzq.
nÑ8 nÑ8

Consequently, if we also invoke Theorem 6.31, we approximately have that Sn „ Npnµ, nσ 2 q and
√
X n „ Npµ, σ 2 {nq i.e. typical values of Sn are of order σ n from nµ while typical values of X n are of

66
√
order σ{ n from µ. In other words, for large enough n, by Corollary 6.32,

x ´ nµ
P (Sn ď x) « Φ √ ,
σ n

x´µ
P Xn ď x « Φ √ .
σ{ n

The conditions of the central limit theorem can be substantially weakened. In fact, the random
variables X1 , X2 , . . . need be neither identical nor independent: it is sufficient that we can turn
them into a normalized martingale. Without going into too much detail, it suffices to assume that

E (Xn`1 | X1 “ x1 . . . Xn “ xn ) “ E (Xn`1 )
Var (Xn`1 | X1 “ x1 . . . Xn “ xn ) “ Var (Xn`1 ) ą 0

for all n and all possible values for x1 , . . . , xn . In this case, the sequence of random variables
Pn
k“1 pXk ´ E (Xk ) q
Zn :“ p Pn
k“1 Var (Xk )

converges to a Np0, 1q random variable whenever the Lindeberg condition is satisfied. The Lindeberg
condition is a bit complicated to write down, but it is satisfied for example when there is a constant
α such that for all n we have that |Xn | ď α.
For further details, see for instance [Nel87, Chapters 14 & 18].

Example 9.4 Potatoes with an average weight of 100g and standard deviation of 40g are packed into
bags to contain at least 2500g. What is the chance that more than 30 potatoes will be needed to fill
a given bag? Let N be the number needed to exceed 2500 and W be the total weight of 30 potatoes.
√
Then {N ą 30} “ {W ă 2500} and so P (N ą 30) “ P (W ă 2500) « Φpp2500 ´ 3000q{40 30q “
Φp´2.282q “ 0.011.

Exercise 187 (*) See problem sheet.

Exercise 188 (*) See problem sheet.

Exercise 189 (**) See problem sheet.

Exercise 190 (**) See problem sheet.

Remember that we can write any binomially distributed random variable as a sum of independent
Bernoulli random variables. Consequently:

Corollary 9.5 (Normal Approximation to the Binomial Distribution) Let X „ Binpn, pq with
0 ă p ă 1, Bn :“ X{n, and
X ´ np Bn ´ p
Zn :“ p “p
npp1 ´ pq pp1 ´ pq{n
Then
lim FZn (z) “ lim P (Zn ď z) “ Φpzq.
nÑ8 nÑ8

This means that, when X „ Binpn, pq with 0 ă p ă 1 and large enough n, then for any x P R, we
approximately have that !
x ´ np
P (X ď x) « Φ p
npp1 ´ pq

67
For moderate n, a continuity correction improves the approximation; e.g. for k P N:

k ` 0.5 ´ µ
P (X ď k) “ P (X ď k ` 0.5) « Φ
σ

k ´ 0.5 ´ µ
P (k ď X) “ P (k ´ 0.5 ď X) « 1 ´ Φ
σ
Example 9.6 A plane has 110 seats and n business people book seats. They show up independently for
their flight with chance q “ 0.85. Let Xn be the number that arrive to take the flight (the others just
take a different flight) and find P (Xp
n ą 110) for n “110, 120, 130, 140. Using the CLT for binomials,
√
E (Xn ) “ 0.85n and SD (Xn ) “ nqp1 ´ qq “ 0.3571 n and so P (Xn ą 110) « 1 ´ Φpp110.5 ´
√
0.85nq{0.3571 nq which takes values 0, 0.015, 0.500 and 0.978 for n “110, 120, 130 and 140.

Exercise 191 (*) See problem sheet.

9.3 Moment generating functions

Also see [BH15, Section 6.4], [DS02, Sections 4.4 and 5.7], or [Sti03, Sections 6.4 and 7.5].

R Definition 9.7 (Moment Generating Function) For any real-valued random variable X,
the function MX : R Ñ R Y {`8} given by

MX ptq :“ E etX

is called the moment generating function of X.

Because etX ě 0, we have that MX ptq ě 0. From Theorem 8.7, we immediately derive the following
expressions for MX ptq:
X
MX ptq “ etx p (x) if X is discrete, and
xPXpΩq
Z 8
tx
MX ptq “ e f (x) dx if X is continuously distributed.
´8

The above sum and integral always exist, but can be `8.
Example 9.8 If X „ Binp1, pq then MX ptq “ pet ` p1 ´ pq.

Example 9.9 If Y „ Popλ) then MY ptq “ e´λ 8 t k t

P
k“0 pλe q {k! “ exp λpe ´ 1q .

Example 9.10 If U „ Upa, bq then MU ptq “ pebt ´ eat q{pb ´ aqt for t ‰ 0 and MU p0q “ 1.
2 {2
Example 9.11 If Z „ Np0, 1q then MZ ptq “ et .

Exercise 192 (*) See problem sheet.

The moment generating function has several useful properties. The property that gives the name
is revealed by considering the Taylor series for etX :

t2 X 2 t3 X 3

MX ptq “ E etX “ E 1 ` tX `

` ` ¨¨¨
2! 3!
t 2 t 3
“ 1 ` tE (X) ` E X 2 ` E X 3 ` ¨ ¨ ¨ .

2! 3!
(Some work is needed to justify this last step, which we omit.) This gives the first of our properties.

68
M1 (Generates moments.) For every k P N,
dk MX
E Xk “ p0q.
dtk

M2 (Moment generating function determines distribution.) Consider any two random variables X and
Y . If there is an h ą 0 such that

MX ptq “ MY ptq ă `8 for all t P p´h, hq,

then

FX pxq “ FY pxq for all x P R.

Conversely, if FX (x) “ FY (x) for all x P R then MX ptq “ MY ptq for all t P R.

M3 (Scaling.) For any random variable X and any constants a, b P R,

MaX`b ptq “ ebt MX patq.

Pn
M4 (Product.) Suppose that X1 , . . . , Xn are independent random variables and let Y “ i“1 Xi .
Then
Yn
MY ptq “ MXi ptq
i“1

M5 (Convergence.) Suppose that X1 , X2 , . . . is an infinite sequence of random variables, and that X

is a further random variable. If there is an h ą 0 such that

lim MXn ptq “ MX ptq ă `8 for all t P p´h, hq,

nÑ8

then

lim FXn pxq “ FX pxq for all x P R where FX is continuous.

nÑ8

Regarding M5 (convergence), one can show that the cumulative distribution function FX of any
random variable X has at most a countable number of points where FX is not continuous.

This is simply a consequence of the fact that FX is non-decreasing.

Lemma 9.12 Let g : R Ñ R be any non-decreasing function. Then g is discontinuous at at most

countably many points.

So, provided that MX is finite in an open interval around zero, FXn converges to FX on all but
an at most countable number of points. If, additionally, X is continuously distributed, then FX is
continuous everywhere, so in that case, FXn converges to FX everywhere.
Example 9.13 Suppose X „ Popλq. Then, by M1,
1
p0q “ λe0 exp λpe0 ´ 1q “ λ.

E (X) “ MX

Example 9.14 If Z „ Np0, 1q, then by M3,

µt 1 2 2
Mµ`σZ ptq “ e MZ pσtq “ exp µt ` σ t .
2

We already know that µ ` σZ „ Npµ, σ 2 q (see Theorem 6.31), so the above expression gives us the
moment generating function of the normal distribution with mean µ and variance σ 2 .
Additionally, by M2, it follows that X „ Npµ, σ 2 q if and only if MX ptq “ exp µt ` 21 σ 2 t2 .

69
Example 9.15 Suppose that X1 , . . . , Xk are independent random variables with Xi „ Npµi , σi2 q. If
Y “ ki“1 Xi , then, by M4, the moment generating function of Y is
P

k
Y 1 2 2 1 2 2
MY ptq “ exp µi t ` σi t “ exp µt ` σ t
2 2
i“1
Pk Pk
where µ “ i“1 µi and σ 2 “ 2
i“1 σi . Thus, by M2, it must be that Y „ Npµ, σ 2 q

Exercise 193 (*) See problem sheet.

Exercise 194 (*) See problem sheet.

Exercise 195 (*) See problem sheet.

Exercise 196 (*) See problem sheet.

Exercise 197 (***) See problem sheet.

Exercise 198 (***) See problem sheet.

Exercise 199 (***) See problem sheet.

Bibliographical notes
The law of large numbers and the central limit theorem have long and interesting histories. The weak
law of large numbers for binomial (i.e. sums of Bernoulli) variables was first established by Jacob
Bernoulli (1654–1705) and published in 1713 [Ber13]. The name ‘law of large numbers’ was given by
Poisson. The modern version is due to Aleksandr Khinchin (1894–1959), and our Theorem 9.1 is only
a special case—the assumption on variances is unnecessary.

Figure 9.1: (left to right) Bernoulli, Khinchin, Lyapunov, and Pólya.

It was apparent to mathematicians in the mid 1700s that a more refined result than Bernoulli’s law
of large numbers could be obtained. A special case of the central limit theorem for binomial (i.e. sums
of Bernoulli) variables was first established by de Moivre in 1733, and extended by Laplace; hence the
normal approximation to the binomial (Corollary 9.5) is sometimes known as the de Moivre–Laplace
theorem. The name ‘central limit theorem’ was given by George Pólya (1887–1985) in 1920.
The first modern proof of the central limit theorem was given by Aleksandr Lyapunov (1857–1918)
around 1901 [Lya]. Lyapunov’s assumptions were relaxed by Jarl Waldemar Lindeberg (1876–1932)
in 1922 [Lin22]. Many different versions of the central limit theorem were subsequently proved. The
subject of Alan Turing’s (1912–1954) Cambridge University Fellowship Dissertation of 1934 was a
version of the central limit theorem similar to Lindeberg’s; Turing was unaware of the latter’s work.

70
Bibliography

[BA92] Roy Billinton and Ronald N. Allan. Reliability Evaluation of Engineering Systems: Concepts
and Techniques. Plenum Press, 2nd edition, 1992.
[Ber13] Jacob Bernoulli. Ars Conjectandi. Thurnisiorum, Basileæ, 1713.
[BH15] Joseph K. Blitzstein and Jessica Hwang. Introduction to probability. Texts in Statistical
Science Series. CRC Press, Boca Raton, FL, 2015.
[Boo54] George Boole. An Investigation of the Laws of Thought: on which are Founded the
Mathematical Theories of Logic and Probabilities. Walton and Maberly, London, 1854. Avail-
able at the internet text archive.
[dM56] Abraham de Moivre. The Doctrine of Chances: or, a Method for Calculating the Probabilities
of Events in Play. A. Millar, London, third edition, 1756. Available at the internet text archive.
[DS02] Morris H. DeGroot and Mark J. Schervish. Probability and Statistics. Addison Wesley, New
York, 3rd edition, 2002.
[Hac75] Ian Hacking. The Emergence of Probability: A Philosophical Study of Early Ideas about
Probability, Induction and Statistical Inference. Cambridge University Press, 1975.
[Háj12] Alan Hájek. Interpretations of probability. In Edward N. Zalta, editor, The Stanford
Encyclopedia of Philosophy. Winter 2012 edition, 2012. URL: http://plato.stanford.
edu/archives/win2012/entries/probability-interpret/.
[Kol50] Andrei N. Kolmogorov. Foundations of the Theory of Probability. Chelsea Publishing Com-
pany, New York, 1950.
[Lap25] Pierre Simon Laplace. Essai Philosophique sur les Probabilitiés. Bachelier, Paris, 1825.
Cinquième édition.
[Lin22] J. W. Lindeberg. Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeit-
srechnung. Mathematische Zeitschrift, 15(1):211–225, 1922. doi:10.1007/BF01494395.
[Lya] Lyapunov theorem. Encyclopedia of Mathematics. URL: http://www.encyclopediaofmath.
org/index.php?title=Lyapunov_theorem&oldid=25958.
[Nel87] Edward Nelson. Radically Elementary Probability Theory. Annals of Mathematical Studies.
Princeton University Press, New Jersey, 1987. URL: https://web.math.princeton.edu/
~nelson/books/rept.pdf.
[Ros07] Jeff Rosenthal. A First Look at Rigorous Probability Theory. World Scientific, New York,
second edition, 2007.
[SHE07] Saturnino Salas, Einar Hille, and Garret Etgen. Calculus: One and Several Variables. Wiley,
10th edition, 2007.
[Sti03] David Stirzaker. Elementary probability. Cambridge University Press, Cambridge, sec-
ond edition, 2003. URL: http://dx.doi.org/10.1017/CBO9780511755309, doi:10.1017/
CBO9780511755309.

71
[Ven88] John Venn. The Logic of Chance: An Essay on the Foundations and Province of the Theory of
Probability, with Especial Reference to Its Application to Moral and Social Science. Macmil-
lan, London, third edition, 1888. Available at the internet text archive.

[Whi01] William A Whitworth. Choice and Chance. Deighton Bell, Cambridge, third edition, 1901.
Available at the internet text archive.

[Whi92] Peter Whittle. Probability via Expectation. Springer, New York, third edition, 1992.

Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
st104b Statistics 2 Revision Guide
100% (1)
st104b Statistics 2 Revision Guide
394 pages
ECE2191 Lecture Notes
No ratings yet
ECE2191 Lecture Notes
106 pages
ST104B Statistics 2 PDF
No ratings yet
ST104B Statistics 2 PDF
108 pages
Principles of Statistical Analysis - V1
No ratings yet
Principles of Statistical Analysis - V1
426 pages
Audio, Video, and Media in the Ministry
From Everand
Audio, Video, and Media in the Ministry
Clarence Floyd Richmond
No ratings yet
Notesstat230 2014
No ratings yet
Notesstat230 2014
288 pages
Stat230 S2010 Course Notes
100% (1)
Stat230 S2010 Course Notes
281 pages
Math 630 Course Notes Fall 2021
No ratings yet
Math 630 Course Notes Fall 2021
274 pages
STAT 230 Notes 2013
No ratings yet
STAT 230 Notes 2013
278 pages
Stat 230 Notes
No ratings yet
Stat 230 Notes
248 pages
STAT 230 Course Notes Fall 2019
No ratings yet
STAT 230 Course Notes Fall 2019
425 pages
Probability and Statistics notes (complete)
No ratings yet
Probability and Statistics notes (complete)
105 pages
STAT230 Course Notes F16
No ratings yet
STAT230 Course Notes F16
365 pages
STAT230 Course Notes F16
No ratings yet
STAT230 Course Notes F16
365 pages
Lectnotemat 5
No ratings yet
Lectnotemat 5
346 pages
ST104B_subject_guide_2023_final (1)
No ratings yet
ST104B_subject_guide_2023_final (1)
274 pages
Prob Weber
No ratings yet
Prob Weber
32 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
Probability Theory - Weber
No ratings yet
Probability Theory - Weber
117 pages
STAT230 Course Notes
No ratings yet
STAT230 Course Notes
51 pages
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
No ratings yet
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
159 pages
Stat230 Spring
No ratings yet
Stat230 Spring
345 pages
(eBook PDF) Introduction to Probability, Second Edition download
100% (1)
(eBook PDF) Introduction to Probability, Second Edition download
47 pages
Stat 230 No Tess 16 Print
No ratings yet
Stat 230 No Tess 16 Print
359 pages
Summary I 2018-2019
No ratings yet
Summary I 2018-2019
72 pages
Immediate download (eBook PDF) Introduction to Probability, Second Edition ebooks 2024
100% (3)
Immediate download (eBook PDF) Introduction to Probability, Second Edition ebooks 2024
39 pages
(eBook PDF) Introduction to Probability, Second Edition - Download the ebook now to never miss important information
100% (1)
(eBook PDF) Introduction to Probability, Second Edition - Download the ebook now to never miss important information
47 pages
(eBook PDF) Introduction to Probability, Second Editioninstant download
No ratings yet
(eBook PDF) Introduction to Probability, Second Editioninstant download
55 pages
(eBook PDF) Introduction to Probability, Second Edition 2024 scribd download
100% (1)
(eBook PDF) Introduction to Probability, Second Edition 2024 scribd download
51 pages
Math 254
No ratings yet
Math 254
306 pages
A First Course in Probability Notes
No ratings yet
A First Course in Probability Notes
103 pages
Introduction To Probability 2nd Edition Dimitri P Bertsekas John N Tsitsiklis download
No ratings yet
Introduction To Probability 2nd Edition Dimitri P Bertsekas John N Tsitsiklis download
68 pages
Probability
No ratings yet
Probability
67 pages
Subject Guide (2019) PDF
100% (1)
Subject Guide (2019) PDF
398 pages
Lecture Notes Statistics 420-Probability Fall 2002
No ratings yet
Lecture Notes Statistics 420-Probability Fall 2002
127 pages
Probability P
No ratings yet
Probability P
66 pages
doc-cours_MathsV
No ratings yet
doc-cours_MathsV
69 pages
Probability
No ratings yet
Probability
78 pages
Introduction To Probability 2nd Edition Dimitri P Bertsekas John N Tsitsiklis - Read the ebook online or download it to own the full content
No ratings yet
Introduction To Probability 2nd Edition Dimitri P Bertsekas John N Tsitsiklis - Read the ebook online or download it to own the full content
65 pages
Review Notes - Probability
No ratings yet
Review Notes - Probability
16 pages
Prob Book
No ratings yet
Prob Book
234 pages
ST2133 ASDT 2021 Guide
No ratings yet
ST2133 ASDT 2021 Guide
242 pages
Probabilityjan13 PDF
No ratings yet
Probabilityjan13 PDF
281 pages
Stats 210 Course Book
No ratings yet
Stats 210 Course Book
200 pages
2020 Math1024
No ratings yet
2020 Math1024
160 pages
ST2334 Notes (Probability and Statistics - NUS)
No ratings yet
ST2334 Notes (Probability and Statistics - NUS)
55 pages
Web Math Princeton Edu
No ratings yet
Web Math Princeton Edu
202 pages
Fall07 Notenotes For Pee Seogrts
No ratings yet
Fall07 Notenotes For Pee Seogrts
134 pages
Intuition To Probability (Version 1.19)
No ratings yet
Intuition To Probability (Version 1.19)
396 pages
Wagner Book
No ratings yet
Wagner Book
129 pages
Actuarial AdvantEDGE (Sample Manual)
No ratings yet
Actuarial AdvantEDGE (Sample Manual)
33 pages
Probability Notes
No ratings yet
Probability Notes
79 pages
A Level Statistics-2
No ratings yet
A Level Statistics-2
108 pages
Introduction To Probability QMUL
No ratings yet
Introduction To Probability QMUL
117 pages
Part IA - Probability: Definitions
No ratings yet
Part IA - Probability: Definitions
18 pages
The Book of Statistical Proofs: DOI: 10.5281/zenodo.4305950 2022-10-22, 07:22
No ratings yet
The Book of Statistical Proofs: DOI: 10.5281/zenodo.4305950 2022-10-22, 07:22
575 pages
BCS301 Notes AJIET 241009 121415
No ratings yet
BCS301 Notes AJIET 241009 121415
225 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Analysis in Many Variables Problems
No ratings yet
Analysis in Many Variables Problems
2 pages
Linear Algebra Notes: Eigenvectors & Diagonalization
No ratings yet
Linear Algebra Notes: Eigenvectors & Diagonalization
67 pages
Durham University Philosophy Society - Const
No ratings yet
Durham University Philosophy Society - Const
10 pages
Against PSR
No ratings yet
Against PSR
7 pages
S. Ponnusamy (Auth.) - Foundations of Mathematical Analysis-Birkhäuser Basel (2012)
100% (6)
S. Ponnusamy (Auth.) - Foundations of Mathematical Analysis-Birkhäuser Basel (2012)
587 pages
LS3 - JHSWORKSHEETS M1 7 With Anskey
No ratings yet
LS3 - JHSWORKSHEETS M1 7 With Anskey
29 pages
INTRODUCTION TO SET: A Case of Lesson Study Guided by The Framework of Productive Pedagogies
No ratings yet
INTRODUCTION TO SET: A Case of Lesson Study Guided by The Framework of Productive Pedagogies
22 pages
Math Vii
No ratings yet
Math Vii
272 pages
Basic Numerical Skills Solved MCQs (Set-1)
No ratings yet
Basic Numerical Skills Solved MCQs (Set-1)
5 pages
Preface
No ratings yet
Preface
50 pages
Fuzzy Set Theory
No ratings yet
Fuzzy Set Theory
24 pages
Discrete Mathematics: Set Theory
0% (1)
Discrete Mathematics: Set Theory
61 pages
Discrete Mathematics in Computer Science 0132160528 9780132160520 Compress
No ratings yet
Discrete Mathematics in Computer Science 0132160528 9780132160520 Compress
420 pages
Week 3. Mathematical Language and Symbols
No ratings yet
Week 3. Mathematical Language and Symbols
58 pages
Chapter3 Sets
No ratings yet
Chapter3 Sets
46 pages
Mathematical Language & Symbols (Prelim Coverage)
No ratings yet
Mathematical Language & Symbols (Prelim Coverage)
31 pages
BBM Maths Notes - Bhartiyar University
100% (1)
BBM Maths Notes - Bhartiyar University
171 pages
Quantitative Reasoning
No ratings yet
Quantitative Reasoning
103 pages
Florante D. Cenas, LPT, MAMT Abigail V. Santisteban, LPT Baby Jean Baligasa, LPT
100% (1)
Florante D. Cenas, LPT, MAMT Abigail V. Santisteban, LPT Baby Jean Baligasa, LPT
62 pages
DDCO Module 1
No ratings yet
DDCO Module 1
25 pages
Topic 3_Sets and Venn Diagrams
No ratings yet
Topic 3_Sets and Venn Diagrams
18 pages
Bootcamp 2 Session PPT Day 1 Probability Statistics Ankit Javeri 2ND May 2024
No ratings yet
Bootcamp 2 Session PPT Day 1 Probability Statistics Ankit Javeri 2ND May 2024
37 pages
Number - Wikipedia
No ratings yet
Number - Wikipedia
86 pages
MTH 105 Venn Diagrams
No ratings yet
MTH 105 Venn Diagrams
4 pages
JSC - Mathematics in The Modern World NOTES
No ratings yet
JSC - Mathematics in The Modern World NOTES
6 pages
Statistics Made Easy Volume 2 Probability and Probability Distributions by Pritish Ranjan Gayali
100% (1)
Statistics Made Easy Volume 2 Probability and Probability Distributions by Pritish Ranjan Gayali
321 pages
Get (eBook PDF) A Problem Solving Approach to Mathematics for Elementary School Teachers 13th Edition PDF ebook with Full Chapters Now
100% (2)
Get (eBook PDF) A Problem Solving Approach to Mathematics for Elementary School Teachers 13th Edition PDF ebook with Full Chapters Now
26 pages
NOTES FOR ISC CLASS 12 MATHS VIVA (BOARD)
No ratings yet
NOTES FOR ISC CLASS 12 MATHS VIVA (BOARD)
10 pages
Introduction To Discrete Mathematics
No ratings yet
Introduction To Discrete Mathematics
3 pages
ADA Assignment 2 PDF
No ratings yet
ADA Assignment 2 PDF
10 pages
1st Quarter Exam Math 7-10
No ratings yet
1st Quarter Exam Math 7-10
12 pages
Mathematics Notes and Formula For Class 12 Chapter 1. Relations and Functions
No ratings yet
Mathematics Notes and Formula For Class 12 Chapter 1. Relations and Functions
5 pages
Set Theory
No ratings yet
Set Theory
3 pages
Set Theory
No ratings yet
Set Theory
18 pages