Statistical Inference
Statistical Inference
Abstract
This text is a compendium for the undergraduate course on course MVE155 ”Statistical Inference” worth of
7.5 hp, which is a second course in mathematical statistics suitable for students with different backgrounds. A
main prerequisite is an introductory course in probability and statistics. The course gives a deeper understanding
of some traditional topics in mathematical statistics such as methods based on likelihood, topics in descriptive
statistics and data analysis with special attention to graphical displays, aspects of experimental design.
The course introduces a number of standard tools of statistical inference including bootstrap, parametric
and non-parametric testing, analysis of variance, introduction to Bayesian inference, chi-squared tests, multiple
regression.
Recommended textbook: John Rice, Mathematical statistics and data analysis, 3rd edition. The com-
pendium includes a collection of solved exercises many of which are the end-of-chapter exercises from the Rice’s
book. Do not read a solution before you tried to solve an exercise on your own.
Please send your corrections to [email protected]. (I gratefully acknowledge multiple corrections kindly
proposed by the students of the year 2019.)
Last updated: June 10, 2019
Contents
Abstract 1
1 Introduction 2
1.1 List of course topics: part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 List of course topics: part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Random sampling 5
2.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Sample mean and sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Approximate confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Stratified random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Parameter estimation 10
3.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Large sample properties of the maximum likelihood estimates . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Exact confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Hypothesis testing 19
4.1 Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Large-sample test for the proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Small-sample test for the proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Two tests for the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Pearson’s chi-squared test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Example: sex ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1
5 Bayesian inference 27
5.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Credibility interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Bayesian hypotheses testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Summarising data 34
6.1 Empirical probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Quantiles and QQ-plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.4 Testing normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.5 Measures of location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.6 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8 Analysis of variance 51
8.1 One-way layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.2 Simultaneous confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.3 Kruskal-Wallis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.4 Two-way layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.5 Example: iron retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.6 Randomised block design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.7 Friedman test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10 Multiple regression 68
10.1 Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
10.3 Confidence intervals and hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10.4 Intervals for individual observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.5 Linear regression and ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.6 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2
12.7 Critical values for the signed rank test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
13 Solutions to exercises 91
13.1 Solutions to Section 2 (random sampling) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
13.2 Solutions to Section 3 (parameter estimation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
13.3 Solutions to Section 4 (hypothesis testing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
13.4 Solutions to Section 5 (Bayesian inference) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
13.5 Solutions to Section 6 (summarising data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
13.6 Solutions to Section 7 (two samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
13.7 Solutions to Section 8 (analysis of variance) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
13.8 Solutions to Section 9 (categorical data analysis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
13.9 Solutions to Section 10 (multiple regression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3
1 Introduction
Statistical analysis consists of three parts: collection of data, summarising data, and making inferences. The graph
below presents
Mathematical Statistics
parameter estimation
hypotheses testing
Statistical Models
REAL WORLD
Probability Theory
predicts data patterns
for a given parametric model
the relationship between two sister branches of mathematics: probability theory and mathematical statistics. Sec-
tion 11 summarises some key formulas from probability theory considered as prerequisite for this course. Example
of probability versus statistics:
probability. Previous studies showed that the drug was 80% effective. Then we can anticipate that
for a study on 100 patients, in average 80 will be cured and at least 65 will be cured with 99.99% chances.
statistics. It was observed that 78 out of 100 patients were cured. We are 95% confident that
for other similar studies, the drug will be effective on between 69.9% and 86.1% of patients.
The main focus of this course is on the issues of parameter estimation and hypothesis testing using properly
collected, relatively small data sets. Special attention, therefore, is paid to the basic principles of experimental
design: randomisation, blocking, and replication.
A key parametric statistical model is the normal distribution model N(µ, σ 2 ) with mean µ and variance σ 2 .
The standard normal distribution N(0, 1) has the cumulative distribution function
Z z
2
Φ(z) = √1
2π
e−y /2
dy.
−∞
The standard deviation σ plays the role of a scale parameter of the model in that if X ∼ N(µ, σ 2 ), then X−µ
σ ∼
N(0, 1).
The importance of the normal model is due to the central limit theorem as well as due to the remarkable
analytical properties of the normal density function. An example of the central limit theorem is the normal
approximation of the binomial distribution. With a continuity correction the claim is that for X ∼ Bin(n, p),
k− 1 −np
P(X ≤ k − 1) = P(X < k) ≈ Φ √ 2 .
np(1−p)
Notational agreements:
• Random variables X, Y, . . . are denoted by capital letters and their realisations x, y, . . . are denoted by small
letters (but not always). It is important to recognise the difference between the observed data values and
the observed test statistics on one hand, and on the other hand, the corresponding random variables whose
randomness is due to a random mechanism generating the data.
• Normal approximation Zn ≈ N(µn , a2n ) for n 1 means that for any real x,
x−µn
P(Zn ≤ x) ≈ Φ an ,
4
1.1 List of course topics: part I
Statistical inference vs probability theory. Statistical models.
Population distribution. Population mean and standard deviation, population proportion.
Randomisation.
Sampling with replacement, iid-sample.
Sampling without replacement, simple random sample.
5
Two-sample t-test, pooled sample variance.
Exact confidence interval for the mean difference. Transformation of variables.
Ranks vs exact measurements. Rank sum test. Signed rank test.
Approximate confidence interval for the difference p1 − p2 .
Large sample test for two proportions.
Fisher’s exact test.
Double-blind randomised controlled experiments.
Confounding factors, Simpson’s paradox.
Categorical data.
Chi-squared tests of homogeneity and independence.
Chi-squared test for k proportions.
Prospective and retrospective studies. Matched-pairs design, McNemar’s test. Odds ratio.
6
2 Random sampling
Population of size N can be viewed as a set of N elements characterised by values {x1 , x2 , . . . , xN }.
If we pick at random one element from the population, then its value x is a
realisation of a random variable X whose distribution is the population distribution.
In many situations finding a population distribution by enumeration is either very expensive or even impossible.
However, a good guess is available using a random sample of n observations (x1 , . . . , xn ). Such a sample will be
treated as a single realisation of a random vector (X1 , . . . , Xn ). If the sampling experiment is repeated, the new
values (x01 , . . . , x0n ) usually will be different from (x1 , . . . , xn ).
To reduce the problem of finding the whole probability distribution curve, one can focus on estimating population
mean and population standard deviation
p
µ = E(X), σ = Var(X).
This works for the quantitative (continuous or discrete) data. If the data is categorical, then one may translate
it to numbers. An important special case of categorical data is dichotomous data. Consider the example of
xi ∈ {male, female}. After converting categorical values to a quantitative form with xi ∈ {0, 1}, the population
distribution becomes a Bernoulli distribution with parameter
p = P(X = 1),
There are two basic ways of random sampling. Sampling without replacement produces a so called simple
random sample. Sampling with replacement produces an iid-sample, with (X1 , . . . , Xn ) being independent and
identically distributed random variables. An iid-sample is easier to analyse, and the results give a good approxi-
mation to the simple random sample, provided n/N is small.
Θ
b = g(X1 , . . . , Xn )
b − θ)2 = E(Θ
E(Θ b − µ b )2 + 2E(Θ
b − µ b )(µ b − θ) + (µ b − θ)2 = σ 2 + (µ b − θ)2 .
Θ Θ Θ Θ Θb Θ
7
θ̂ is an unbiased estimate of θ, that is µΘ
b = θ,
θ̂ is a consistent estimate, in that the mean square error
b − θ)2 → 0 as the sample size n → ∞.
E(Θ
q
b is its standard deviation σ b =
The standard error for an estimator Θ Θ Var(Θ).
b
In the same way as (x1 , . . . , xn ) is a realisation of a random vector (X1 , . . . , Xn ), the summary statistics x̄ and s2
are realisation of random variables
X1 + . . . + Xn 1 X
X̄ = , S2 = (Xi − X̄)2 .
n n−1
Consider an iid-sample. The sample mean x̄ and sample variance s2 are unbiased and consistent estimators for the
population mean µ and variance σ 2 respectively
2 4
E(X̄) = µ, Var(X̄) = σn , E(S 2 ) = σ 2 , Var(S 2 ) = σn E( X−µ
σ )4
− n−3
n−1 .
Notice that the sample standard deviation s systematically underestimates the population standard deviation σ
since
Now consider simple random sampling, when there is dependence between observations. In this case the sample
mean x̄ is again an unbiased and consistent estimate for the population mean. However, the sample variance s2 is
a biased estimate of σ 2 , since
E(S 2 ) = σ 2 NN−1 ,
where N is the finite population size. We also have
σ2 n−1
Var(X̄) = n (1 − N −1 ),
so that in the light of the previous equality we get an unbiased estimate of Var(X̄) to be
s2 N −1 s2
s2x̄ = n N (1 − n−1
N −1 ) = n (1 − n
N ).
Thus, for the sampling without replacement, the formulas for the estimated standard errors of X̄ and p̂ for the
simple random sample take the new form
q q q
sx̄ = √sn 1 − N
n
, sp̂ = p̂(1−
n−1
p̂) n
1− N .
8
2.3 Approximate confidence intervals
By the Central Limit Theorem, the sample mean distribution is approximately normal
2
X̄ ≈ N(µ, σn ),
in that for large sample sizes n, we have
P(| X̄−µ
σ | > z) ≈ 2(1 − Φ(z)).
X̄
This yields the following formulas of approximate 100(1–α)% two-sided confidence intervals for µ and p:
Iµ = x̄ ± zα/2 · sx̄ ,
Ip = p̂ ± zα/2 · sp̂ ,
where zα stands for the normal quantile. These two formulas are valid even for the sampling without replacement
due to a more advanced version of the central limit theorem. According to the normal distribution table we have
100(1–α)% 68% 80% 90% 95% 99% 99.7%
zα/2 1.00 1.28 1.64 1.96 2.58 3.00
The higher is confidence level the wider is the confidence interval. On the other hand, the larger is sample the
narrower is the confidence interval.
The exact meaning of the confidence level is a bit tricky. For example, a 95% confidence interval is a random
interval, such that out of 100 intervals Iµ computed for 100 samples, only 95 are expected cover the true value of µ.
Notice that the rundom number of successful realisations of the confidence interval has distribution Bin(100,0.95)
which is approximately normal with mean 95 and standard deviation 2.2.
where the last equality is due to the total variance formula, with
σ 2 = w1 σ12 + . . . + wk σk2
being the average variance. Stratified random sampling consists of taking k independent iid-samples from each
stratum with sample sizes (n1 , . . . , nk ) and sample means x̄1 , . . . , x̄k .
Observe that for any allocation (n1 , . . . , nk ), the stratified sample mean is an unbiased estimate of µ
E(X̄s ) = w1 E(X̄1 ) + . . . + wk E(X̄k ) = w1 µ1 + . . . + wk µk = µ.
The variance of X̄s
2 w12 σ12 2 2
wk σk
σX̄s
= w12 σX̄
2
1
+ . . . + wk2 σX̄
2
k
= n1 + ... + nk
is estimated by
w12 s21 2 2
wk sk
s2x̄s = w12 s2x̄1 + . . . + wk2 s2x̄k = n1 + ... + nk ,
where sj is the sample standard deviation corresponding to the sample mean x̄j .
9
Approximate confidence interval Iµ = x̄s ± zα/2 · sx̄s
Optimisation problem: allocate n = n1 + . . . + nk observations among different strata to minimise the sampling
error of x̄s .
wj σj 1
Optimal allocation: nj = n σ̄ , gives the minimum variance Var(X̄so ) = n · σ̄ 2
2.5 Exercises
Problem 1
Consider a population consisting of five values
1, 2, 2, 4, 8.
Find the population mean and variance. Calculate the sampling distribution of the mean of a sample of size 2 by
generating all possible such samples. From them, find the mean and variance of the sampling distribution, and
compare the results to those obtained by the formulas from this section.
Problem 2
In a simple random sample of 1500 voters, 55% said they planned to vote for a particular proposition, and 45%
said they planned to vote against it. The estimated margin of victory for the proposition is thus 10%. What is the
standard error of this estimated margin? What is an approximate 95% confidence interval for the margin?
Problem 3
This problem introduces the concept of a one-sided confidence interval. Using the central limit theorem, how should
the constant k1 be chosen so that the interval
(−∞, x̄ + k1 sx̄ )
is a 90% confidence interval for µ? How should k2 be chosen so that
(x̄ − k2 sx̄ , ∞)
is a 95% confidence interval for µ?
Problem 4
Warner (1965) introduced the method of randomised response to deal with surveys asking sensitive questions.
Suppose we want to estimate the proportion q of illegal drug users among prison inmates. We are interested in
the population as a whole - not in punishing particular individuals. Randomly chosen n inmates have responded
yes/no to a randomised statement (after rolling a die):
“I use heroin” (with probability 5/6)
“I do not use heroin” (with probability 1/6).
Suggest a probability model for this experiment, find a method of moments estimate for q and its standard error.
10
Problem 5
A simple random sample of a population size 2000 yields 25 values with
104 109 11 109 87
86 80 119 88 122
91 103 99 108 96
104 98 98 83 107
79 87 94 92 97
Problem 6
For a simple random sample, take x̄2 as a point estimate of µ2 . (This is an example of the method of moments
estimate.) Compute the bias of this point estimate.
Problem 7
The following table (Cochran 1977) shows the stratification of all farms in a county by farm size and the mean and
standard deviation of the number of acres of corn in each stratum.
Farm size 0-40 41-80 81-120 121-160 161-200 201-240 241+
Number of farms Nj 394 461 391 334 169 113 148
Stratum mean µj 5.4 16.3 24.3 34.5 42.1 50.1 63.8
Stratum standard deviation σj 8.3 13.3 15.1 19.8 24.5 26.0 35.2
(a) For a sample size of 100 farms, compute the sample sizes from each stratum for proportional and optimal
allocation, and compare them.
(b) Calculate the variances of the sample mean for each allocation and compare them to each other and to the
variance of an estimated formed from simple random sampling.
(c) What are the population mean and variance?
(d) Suppose that ten farms are sampled per stratum. What is Var(X̄s )? How large a simple random sample
would have to be taken to attain the same variance? Ignore the finite population correction.
(e) Repeat part (d) using proportional allocation of the 70 samples.
Problem 8
How might stratification be used in each of the following sampling problems?
Problem 9
Consider stratifying the population of Problem 1 into two strata (1,2,2) and (4,8). Assuming that one observation
is taken from each stratum, find the sampling distribution of the estimate of the population mean and the mean
and standard deviation of the sampling distribution. Check the formulas of Section 1.4.
11
3 Parameter estimation
Given a parametric model with unknown parameter(s) θ, we wish to estimate θ from an iid-sample. There are two
basic methods of finding good point estimates
1. method of moments,
2. maximum likelihood method.
The first is a simple intuitive method that also can be used as a first approximation for the maximum likelihood
method, which is optimal for large samples.
An approximate 95% confidence interval for µ, the mean number of hops per bird is given by
Iµ = x̄ ± z0.025 · sx̄ = 2.79 ± 1.96 · 0.205 = 2.79 ± 0.40.
After inspecting the histogram of the data values, we observe geometrically descending frequencies which sug-
gests a geometric model for the number of jumps for a random bird. Geometric model X ∼ Geom(p) assumes
that a bird ”does not remember” the number of hops made so far, and the next move of the bird is to jump with
probability 1 − p or to fly away with probability p. Method of moment estimate for the parameter θ = p of the
geometric model requires a single equation arising from the expression for the first population moment
µ = p1 .
1
This expression leads to the equation x̄ = p̃ which gives the method of moment estimate
1
p̃ = x̄ = 0.36.
We can compute an approximate 95% confidence interval for p using the above mentioned Iµ :
1 1
Ip = ( 2.79+0.40 , 2.79−0.40 ) = (0.31, 0.42).
To answer the question of how well does the geometric distribution fit the data, let us compare the observed
frequencies to the frequencies expected from the geometric distribution with parameter p̃:
12
j 1 2 3 4 5 6 7+
Oj 48 31 20 9 6 5 11
Ej 46.5 29.9 19.2 12.3 7.9 5.1 9.1
Expected frequencies ej are computed in terms of independent geometric random variables (X1 , . . . , Xn )
An appropriate measure of discrepancy between the observed and expected counts is given by the following chi-
squared test statistic
7
(Oj −Ej )2
X
χ2 = Ej = 1.86.
j=1
As it will be explained later on, the observed smal value test statistic allows us to conclude that the geometric
model fits the data well.
f (y1 , . . . yn |θ)
as a function of possible values (y1 , . . . , yn ). Fixing the variables (y1 , . . . , yn ) = (x1 , . . . , xn ) and allowing the
parameter value θ to vary, we obtain the so-called likelihood function
Notice, that the likelihood function usually is not a density function over θ. To illustrate this construction, draw
three density curves for three parameter values θ1 < θ2 < θ3 , then show how for a given observed value x, the
likelihood curve connects the three points on the plane
x n−x
l0 (p) = − = 0,
p 1−p
x
we find that the MLE of population proportion is the sample proportion p̂ = n.
13
3.3 Sufficiency
Clearly, if there is a statistic t = g(x1 , . . . , xn ) such that
L(θ) = f (x1 , . . . , xn |θ) = h(t, θ)c(x1 , . . . , xn ) ∝ h(t, θ),
where ∝ means proportional, as the coefficient of proportionality c(x1 , . . . , xn ) does not explicitly depend on θ
and can be treated as a constant which does not influence the location of the maximum likelihood estimate θ̂. It
follows that θ̂ depends on the data (x1 , . . . , xn ) only as a function of t. Given such a factorisation property, we call
t a sufficient statistic, as no other statistic that can be calculated from the same sample provides any additional
information on the value of the maximum likelihood estimate θ̂.
where I(θ) is the Fisher information in a single observation is defined as follows. Let
∂2
g(x, θ) = ∂θ 2 ln f (x|θ)
and put Z
I(θ) = −E[g(X, θ)] = − g(x, θ)f (x|θ)dx.
Fisher information I(θ) is the curvature of the log-likelihood function at the top. The larger information nI(θ) is
in n observations, the smaller is the asymptotic variance (and the larger is the curvature).
z
Approximate 100(1 − α)% confidence interval Iθ = θ̂ ± √ α/2
nI(θ̂)
It turns out that the maximum likelihood estimators are asymptotically unbiased, consistent, and asymptotically
efficient (have minimal variance) in the following sense.
14
Example: exponential model
For lifetimes of five batteries measured in hours
we propose an exponential model X ∼ Exp(θ), where θ is the battery death rate per hour. Method of moment
estimate: from µ = 1/θ, we find
5
θ̃ = x̄1 = 28.5 = 0.175.
The likelihood function
first grows from 0 to 2.2 · 10−7 and then falls down towards zero. The likelihood maximum is reached at θ̂ = 0.175.
For the exponential model, t = x1 + . . . + xn is a sufficient statistic, and the maximum likelihood estimate
θ̂ = 1/x̄
This yields
b ≈ θ2
Var(Θ) n
The gamma distribution model is more more flexible than the normal distribution model as for different values
of the shape parameter α the density curves have different shapes. If α = 1, then
Gam(1, λ) = Exp(λ).
X1 + . . . + Xk ∼ Gam(k, λ).
Parameter λ does not influence the shape of the density influencing - only the scaling of the random variable. It is
easy to see that
X ∼ Gam(α, λ) ⇒ λX ∼ Gam(α, 1).
Normal approximation for the gamma distribution:
Gam(α, λ) ≈ N( αλ , λα2 ), α 1.
15
Turning to the likelihood function
n
Y 1 α α−1 −λxi λnα λnα α−1 −λt1
L(α, λ) = λ xi e = n (x1 · · · xn )α−1 e−λ(x1 +...+xn ) = n t e ,
i=1
Γ(α) Γ (α) Γ (α) 2
where
(t1 , t2 ) = (x1 + . . . + xn , x1 · · · xn )
is a pair of sufficient statistics containing all information from the data needed to compute the likelihood function.
To maximise the log-likelihood function
l(α, λ) = ln L(α, λ),
set the two derivatives
0
∂
∂α l(α, λ) = n ln(λ) − n ΓΓ(α)
(α)
+ ln t2 ,
∂ nα
∂λ l(α, λ) = λ − t1 ,
equal to zero. Then we solve numerically the system of two equations
ln(α̂/x̄) = − n1 ln t2 + Γ0 (α̂)/Γ(α̂),
λ̂ = α̂/x̄
using the method of moment estimates (α̃, λ̃) as the initial values.
Parametric bootstrap
What is the standard error sα̂ of the maximum likelihood estimate α̂ = 908.76? No analytical formula is available.
If we could simulate from the true population distribution Gam(α, λ), then B samples of size n = 24 would generate
B independent estimates α̂j . Then the standard deviation of the sampling distribution would give us the desired
standard error:
B B
1 X 1 X
s2α̂ = (α̂j − ᾱ)2 , ᾱ = α̂j .
B j=1 B j=1
16
3.6 Exact confidence intervals
If we put a restrictive assumption on the population distribution and assume that an iid-sample (x1 , . . . , xn ) is
taken from a normal distribution N(µ, σ 2 ) with unspecified parameters µ and σ, then instead of the approximate
confidence interval formula for the mean we may apply an exact confidence interval formula based on the following
probability theory fact. If random variables X1 , . . . , Xn are independent and have the same distribution N(µ, σ 2 ),
then
X̄ − µ
∼ tn−1
SX̄
has the so-called t-distribution with n − 1 degrees of freedom. Here
S
SX̄ = √
n
Exact 100(1 − α)% confidence interval for the mean Iµ = x̄ ± tn−1 ( α2 ) · sx̄
A tk -distribution curve looks similar to N(0,1)-curve. Its density function is symmetric around zero:
Γ( k+1 ) x2
− k+1
2
f (x) = √ 2 k 1 + k , k ≥ 1.
kπΓ( 2 )
It has a larger spread the standard normal distribution. If the number of degrees of freedom k ≥ 3, then the
k
variance is k−2 . Connection to the standard normal distribution: if Z, Z1 , . . . , Zk are N(0,1) and independent, then
Z
p ∼ tk .
(Z12 + . . . + Zk2 )/k
Let α = 0.05. The exact confidence interval for µ is wider than the approximate confidence interval x̄ ± 1.96 · sx̄
valid for the very large n. For example
Moreover, in the N(µ, σ 2 ) case we get access to an exact confidence interval formula for the variance thanks to
the following result.
(n−1)S 2
Exact distribution σ2 ∼ χ2n−1
The chi-squared distribution with k degrees of freedom is the gamma distribution with α = k2 , λ = 12 . It is connected
to the standard normal distribution as follows: if Z1 , . . . , Zk are N(0,1) and independent, then
The exact confidence interval for σ 2 is non-symmetric. Examples of 95% confidence intervals for σ 2 :
q
2σ 4 2
Under the normality assumption Var(S 2 ) = n−1 , estimated standard error for s2 is 2
n−1 s .
17
3.7 Exercises
Problem 1
The Poisson distribution has been used by traffic engineers as a model for light traffic. The following table shows
the number of right turns during 300 three-min intervals at a specific intersection. Fit a Poisson distribution.
Comment on on the fit by comparing observed and expected counts. It is useful to know that the 300 intervals
were distributed over various hours of the day and various days of the week.
n Frequency
0 14
1 30
2 36
3 68
4 43
5 43
6 30
7 14
8 10
9 6
10 4
11 1
12 1
13+ 0
Problem 2
One of the earliest applications of the Poisson distribution was made by Student (1907) in studying errors made in
counting yeast cells.In this study, yeast cells were killed and mixed with water and gelatin; the mixture was then
spread on a glass and allowed to cool. Four different concentrations were used. Counts were made on 400 squares,
and the data are summarised in the following table:
Number of cells Concent. 1 Concent. 2 Concent. 3 Concent. 4
0 213 103 75 0
1 128 143 103 20
2 37 98 121 43
3 18 42 54 53
4 3 8 30 86
5 1 4 13 70
6 0 2 2 54
7 0 0 1 37
8 0 0 0 18
9 0 0 1 10
10 0 0 0 5
11 0 0 0 2
12 0 0 0 2
(a) Estimate the parameter λ for each of the four sets of data.
(b) Find an approximate 95% confidence interval for each estimate.
(c) Compare observed and expected counts.
Problem 3
Suppose that X is a discrete random variable with
P(X = 0) = 23 θ,
P(X = 1) = 13 θ,
P(X = 2) = 23 (1 − θ),
P(X = 3) = 13 (1 − θ),
where θ ∈ [0, 1] is parameter. The following 10 independent observations were taken from such a distribution:
(3, 0, 2, 1, 3, 2, 1, 0, 2, 1).
18
(a) Find the method of moments estimate of θ.
(b) Find an approximate standard error for your estimate.
(c) What is the maximum likelihood estimate of θ?
(d) What is an approximate standard error of the maximum likelihood estimate?
Problem 4
Suppose that X ∼ Bin(n, p).
x
(a) Show that the maximum likelihood estimate of p is p̂ = n.
(b) Show that p̂ = nx attains the Cramer-Rao lower bound.
(c) If n = 10 and X = 5, plot the log-likelihood function.
Problem 5
A company has manufactured certain objects and has printed a serial number on each manufactured object. The
serial numbers start at 1 and end at N , where N is the number of objects that have been manufactured. One of
these object is selected at random, and the serial number of that object is 888.
Problem 6
Capture-recapture method for estimating the number N of fish living in a lake:
1. capture and tag say n = 100 fish, then release them in the lake,
Suppose x = 20 fish were tagged among the k = 50 fish. Find a maximum likelihood estimate N after suggesting
a simple parametric model.
Problem 7
The following 16 numbers came from normal random generator on a computer
(a) What would you guess the mean and the variance of the generating normal distribution were?
(b) Give 90%, 95%, and 99% confidence intervals for µ and σ 2 .
(c) Give 90%, 95%, and 99% confidence intervals for σ.
(d) How much larger sample would you need to halve the length of the confidence interval for µ?
Problem 8
Let X1 , . . . , Xn be independent random variables uniformly distributed on [0, θ].
(a) Find the method of moments estimate of θ and its mean and variance.
(b) Find the maximum likelihood estimate of θ.
(c) Find the probability density of the maximum likelihood estimate and calculate its mean and variance.
Compare the variance, the bias, and the mean square error to those of the method of moments estimate.
(d) Find a modification of the maximum likelihood estimate that renders it unbiased.
19
Problem 9
For two factors, starchy-or-sugary and green-or-white base leaf, the following counts for the progeny of self-fertilized
heterozygotes were observed (Fisher 1958)
Type Count
Starchy green x1 = 1997
Starchy white x2 = 906
Sugary green x3 = 904
Sugary white x4 = 32
According to the genetic theory the cell probabilities are
2+θ 1−θ 1−θ θ
p1 = , p2 = , p3 = , p4 = ,
4 4 4 4
where 0 < θ < 1. In particular, if θ = 0.25, then the genes are unlinked and the genotype frequencies are
Green White Total
Starchy 9/16 3/16 3/4
(a) Find the maximum likelihood estimate of θ and its asymptotic variance.
(b) For an approximate 95% confidence interval for θ based on part (a).
(c) Use the bootstrap to find the approximate standard deviation of the maximum likelihood estimate and
compare to the result of part (a).
20
4 Hypothesis testing
4.1 Statistical significance
Often we need a rule based on data for choosing between two mutually exclusive hypotheses
null hypothesis H0 : the effect of interest is zero,
alternative H1 : the effect of interest is not zero.
H0 represents an established theory that must be discredited in order to demonstrate some effect H1 .
A decision rule for hypotheses testing is based a test statistic t = t(x1 , . . . , xn ), a function of the data with distinct
typical values under H0 and H1 . The task is to find an appropriately chosen rejection region R and
Making a decision we can commit type I or type II error, see the table:
α = P(T ∈ R|H0 ) significance level of the test, conditional probability of type I error,
1 − α = P(T ∈
/ R|H0 ) specificity of the test,
β = P(T ∈
/ R|H1 ) conditional probability of type II error,
1 − β = P(T ∈ R|H1 ) sensitivity of the test or power.
Observe that the p-value depends on the data and therefore, is a realisation of a random variable P. The source of
randomness is in the sampling procedure: if you take another sample, you obtain a different p-value. To illustrate,
suppose we are testing H0 : θ = θ0 versus H1 : θ > θ0 with help of a test statistic Z whose null distribution is
N(0,1). Suppose the null hypothesis is true. Given zobs = z, the p-value is
Now, under H0 the observed value for different samples has distribution N(0,1) so that
P(P > p) = P(1 − Φ(Zobs ) > 1 − Φ(z)) = P(Φ(Zobs ) < Φ(z)) = P(Zobs < z) = Φ(z) = 1 − p,
21
Three different composite alternative hypotheses:
Power function
Consider two simple hypotheses
The power function of the one-sided test can be computed using the normal approximation for √ X−np1 under
np1 (1−p1 )
H1 :
X − np
0
Pw(p1 ) = P p ≥ zα |H1
np0 (1 − p0 )
p √
X − np
1 zα p0 (1 − p0 ) + n(p0 − p1 )
=P p ≥ p |H1
np1 (1 − p1 ) p1 (1 − p1 )
p √
zα p0 (1 − p0 ) + n(p0 − p1 )
≈1−Φ p .
p1 (1 − p1 )
Planning of sample size: given α and β, choose sample size n such that
p p
√ zα p0 (1 − p0 ) + zβ p1 (1 − p1 )
n= .
|p1 − p0 |
R = { p̂−0.25
0.0433 ≥ 1.645} = {p̂ ≥ 0.32} = {x ≥ 32}.
1 − Φ( 1.645·0.433−0.5
0.458 ) = 32%.
n = ( 1.645·0.433+1.28·0.458
0.05 )2 = 675.
0.3 − 0.25
zobs = = 1.15
0.0433
and the one-sided p-value becomes
P(Z ≥ 1.15) = 12.5%.
22
4.3 Small-sample test for the proportion
Binomial model X ∼ Bin(n, p) with H0 : p = p0 . For small n, use exact null distribution X ∼ Bin(n, p0 ).
X ∼ Bin(20, p).
For the one-sided alternative H1 : p > 0.25 and α = 5%, the rejection rule is R = {x ≥ 9}. Notice that the exact
significance level = 4.1%.
p 0.27 0.30 0.40 0.5 0.60 0.70
Power function
P(X ≥ 9) 0.064 0.113 0.404 0.748 0.934 0.995
One-sample t-test is used for small n, under the assumption that the population distribution is normal. Compute
the rejection region using an exact null distribution
H
T ∼0 tn−1 .
R = {µ0 ∈
/ Iµ }
in terms of a 100(1-α)% confidence interval for the mean. Having such confidence interval, reject H0 : µ = µ0 if
the interval does not cover the value µ0 .
H0 : θ = θ0 against H1 : θ = θ1 ,
L(θ0 )
use the likelihood ratio Λ = L(θ 1)
as a test statistic. Large values of Λ suggest that H0 explains the data set better
than H1 , while a small Λ indicates that H1 explains the data set better. Likelihood ratio test rejects H0 for small
values of Λ.
Neyman-Pearson lemma: the likelihood ratio test is optimal in the case of two simple hypothesis.
Nested hypotheses
With a pair of nested parameter sets Ω0 ⊂ Ω we get two composite alternatives
H0 : θ ∈ Ω0 against H1 : θ ∈ Ω \ Ω0 .
It will be more convenient to recast this setting in terms of two nested hypotheses
H0 : θ ∈ Ω0 , H : θ ∈ Ω,
23
θ̂0 = maximises the likelihood function L(θ) over θ ∈ Ω0 ,
θ̂ = maximises the likelihood function L(θ) over θ ∈ Ω.
L(θ̂0 )
Λ̃ = L(θ̂)
,
It turns out that the test statistic −2 ln Λ̃ has a nice approximate null distribution
H0
−2 ln Λ̃ ≈ χ2df , where df = dim(Ω) − dim(Ω0 ).
n!
(O1 , . . . , OJ ) ∼ Mn(n; p1 , . . . , pJ ), P(O1 = k1 , . . . , OJ = kJ ) = pk1 · · · pkJJ .
k1 ! · · · kJ ! 1
To see if the proposed model fits the data, compute λ̂, the maximum likelihood estimate of λ, and then the expected
cell counts
Ej = n · vj (λ̂),
where ”expected” means expected under the null hypothesis model. In the current setting, the likelihood ratio test
statistic
−2 ln Λ̃ ≈ χ2
J
X (Oj − Ej )2
χ2 = .
j=1
Ej
The approximate null distribution of the chi-squared test statistic is χ2J−1−r , since
Since the chi-squared test is approximate, all expected counts are recommended to be at least 5. If not, then you
shoud combine small cells in larger cells and recalculate the number of degrees of freedom df.
H0 : number of hops that a bird does between flights has a geometric distribution Geom(p).
Using p̂ = 0.358 and J = 7 we obtain χ2 = 1.86. With df = 5 we find p-value = 0.87, therefore we conclude a good
fit of the geometric distribution model to the data.
24
4.7 Example: sex ratio
A 1889 study made in Germany recorded the numbers of boys (y1 , . . . , yn ) for n = 6115 families with 12 children
each. Consider three nested models for the distribution of the number of boys Y in a family with 12 children
4.8 Exercises
Problem 1
Suppose that X ∼ Bin(100, p). Consider a test
H0 : p = 1/2, H1 : p 6= 1/2.
25
that rejects H0 in favour of H1 for |x − 50| > 10. Use the normal approximation to the binomial distribution to
answer the following:
(a) What is α?
(b) Graph the power as a function of p.
Problem 2
Let X have one of the following two distributions
X-values x1 x2 x3 x4
P(x|H0 ) 0.2 0.3 0.3 0.2
P(x|H1 ) 0.1 0.4 0.1 0.4
(a) Compare the likelihood ratio, Λ, for each xi and order the xi according to Λ.
(b) What is the likelihood ratio test of H0 versus H1 at level α = 0.2? What is the test at level α = 0.5?
Problem 3
Let (x1 , . . . , xn ) be a sample from a Poisson distribution. Find the likelihood ratio for testing H0 : λ = λ0 against
H1 : λ = λ1 , where λ1 > λ0 . Use the fact that the sum of independent Poisson random variables follows a Poisson
distribution to explain how to determine a rejection region for a test at level α.
Problem 4
Let (x1 , . . . , x25 ) be a sample from a normal distribution having a variance of 100.
(a) Find the rejection region for a test at level α = 0.1 of H0 : µ = 0 versus H1 : µ = 1.5.
(b) What is the power of the test?
(c) Repeat for α = 0.01.
Problem 5
Under H0 , a random variable has a cumulative distribution function
F (x) = x2 , 0 ≤ x ≤ 1,
F (x) = x3 , 0 ≤ x ≤ 1.
Problem 6
Let (x1 , . . . , x15 ) be a sample from a normal distribution.
Problem 7
An iid-sample from N(µ, σ 2 ) gives a 99% confidence interval for µ to be (−2, 3). Test
H0 : µ = −3 against H1 : µ 6= −3
at α = 0.01.
26
Problem 8
Binomial model for the data value x:
X ∼ Bin(n, p).
(a) What is the generalised likelihood ratio for testing H0 : p = 0.5 against H1 : p 6= 0.5?
(b) Show that the test rejects for large values of |x − n2 |.
(c) How the significance level corresponding to the rejection region
R = {|x − n2 | > k}
can be determined?
(d) If n = 10 and k = 2, what is the significance level of the test?
(e) Use the normal approximation to the binomial distribution to find the significance level if n = 100
and k = 10.
Problem 9
Suppose that a test statistic Z has a standard normal null-distribution.
(a) If the test rejects for large values of |z|, what is the p-value corresponding to z = 1.5?
(b) Answer the same question if the test rejects for large values of z.
Problem 10
It has been suggested that dying people may be able to postpone their death until after an important occasion,
such as a wedding or birthday. Phillips and King (1988) studied the patterns of death surrounding Passover, an
important Jewish holiday.
(a) California data 1966-1984. They compared the number of deaths during the week before Passover to the
number of deaths during the week after Passover for 1919 people who had Jewish surnames. Of these, 922 occurred
in the week before and 997 in the week after Passover.
(b) For 852 males of Chinese and Japanese ancestry, 418 died in the week before and 434 died in the week after
Passover.
Problem 11
If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur with probabilities
Plato et al. (1964) published the following data on haptoglobin type in a sample of 190 people
Genotype Hp 1-1 Hp 1-2 Hp 2-2
Observed count xi 10 68 112
Test the goodness of fit of the data to the equilibrium model.
Problem 12
US suicides in 1970. Check for the seasonal variation
Month Number of suicides
Jan 1867
Feb 1789
Mar 1944
Apr 2094
May 2097
Jun 1981
Jul 1887
Aug 2024
Sep 1928
Oct 2032
Nov 1978
Dec 1859
27
Problem 13
In 1965, a newspaper carried a story about a high school student who reported getting 9207 heads and 8743 tails
in 17950 coin tosses.
Are the data consistent with the hypothesis that all the coins were fair (p = 12 )?
(c) Are the data consistent with the hypothesis that all five coins had the same probability of heads but this
probability was not necessarily 12 ?
28
5 Bayesian inference
The statistical tools introduced in this course so far are based on the the called frequentist approach. In the
parametric case the data x is assumed to be randomly generated by a distribution f (x|θ) and the unknown
population parameter θ is estimated using the maximum likelihood estimate. This section presents basic concepts
of the Bayesian approach when it is assumed that the parameter of interest θ is itself randomly generated using a
given prior distribution g(θ). The prior distribution brings into the model our knowledge (or lack of knowledge)
on θ before data x is generated using f (x|θ), which in this section is called the likelihood function.
After the data x is generated by such a two-step procedure involving the pair g(θ) and f (x|θ), we may update
our knowledge on θ and compute a posterior distribution h(θ|x) using the Bayes formula
f (x|θ)g(θ)
h(θ|x) = ,
φ(x)
where Z X
φ(x) = f (x|θ)g(θ)dθ or φ(x) = f (x|θ)g(θ)
θ
gives the marginal distribution of the random data X. For a fixed x, the denominator φ(x) is treated as a constant
and the Bayes formula can be summarised as
posterior ∝ likelihood × prior
where ∝ means proportional, as the coefficient of proportionality φ(x) does not explicitly depend on θ.
When we have no prior knowledge of θ, the prior distribution is often modelled by the uniform distribution.
In this case of uninformative prior, given g(θ) is a constant, we have h(θ|x) ∝ f (x|θ) so that all the posterior
knowledge comes from the likelihood function.
Example: IQ measurement
A randomly chosen individual has an unknown true intelligence quotient value θ. Suppose an IQ test is calibrated in
such a way that the prior distribution of θ is normal N(100, 225). This normal distribution describes the population
distribution of people’s IQ with population mean of m = 100 and population standard deviation v = 15. For a
person with an IQ value θ, the result x of an IQ measurement is generated by another normal distribution N(θ,
100), with no systematic error and a random error σ = 10.
Since
1 (θ−m)2 1 (x−θ)2
g(θ) = √ e− 2v2 , f (x|θ) = √ e− 2σ2 ,
2πv 2πσ
and the posterior is proportional to g(θ)f (x|θ), we get
(θ − m)2 (x − θ)2 (θ − γm − (1 − γ)x)2
h(θ|x) ∝ exp − − ∝ exp − ,
2v 2 2σ 2 2γv 2
where
σ2
γ=
σ2 + v2
is the so-called shrinkage factor. We conclude that the posterior distribution is also normal
1 (θ−γm−(1−γ)x)2
−
h(θ|x) = √ e 2γv 2
2πγv
having mean γm + (1 − γ)x and variance γv 2 .
In particular, if the observed IQ result is x = 130, then the posterior distribution becomes N(120.7, 69.2). We
see that the prior expectation m = 100 has corrected the observed result x = 130 down to 120.7. The posterior
variance 69.2 is smaller than that of the prior distribution 225 by the shrinkage factor γ = 0.308: the updated
knowledge is less uncertain than the prior knowledge.
29
Beta distribution
Beta distribution Beta(a, b) is determined by two parameters a > 0, b > 0 which are called pseudo-counts. It has
density,
Γ(a + b) a−1
f (p) = p (1 − p)b−1 , 0 < p < 1,
Γ(a)Γ(b)
with mean and variance having the form
a µ(1 − µ)
µ= , σ2 = .
a+b a+b+1
Beta distribution is a convenient prior distribution for a population frequency p ∈ (0, 1). The uniform distribution
U(0, 1) is obtained with a = b = 1.
5
4.5
Beta (0.5, 0.5)
4 Beta (10, 3)
Beta (1, 1)
3.5 Beta (0.8, 3)
2.5
1.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Exercise: verify that for given a > 1 and b > 1, the maximum of density function f (p) is attained at
a−1
p̂ = .
a+b−2
Dirichlet distribution
Dirichlet distribution is a multivariate extension of the Beta distribution. Dir(α1 , . . . , αr ) is a probability distribu-
tion over the vectors (p1 , . . . , pr ) with non-negative components such that
p1 + . . . + pr = 1.
Γ(α0 )
f (p1 , . . . , pr ) = pα1 −1 . . . prαr −1 , α0 = α1 + . . . + αr .
Γ(α1 ) . . . Γ(αr ) 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
30
The figure illustrates four examples of Dir(α1 , α2 , α3 ) distribution. Each triangle contains n = 300 points generated
using different sets of parameters (α1 , α2 , α3 ):
upper left (0.3, 0.3, 0.1), upper right (13,16,15), lower left (1,1,1), lower right (3,0.1,1).
A dot in a triangle gives (x1 , x2 , x3 ) as the distances to the bottom edge of the triangle (x1 ), to the right edge of
the triangle (x2 ), and to the left edge of the triangle (x3 ).
is called the shrinkage factor as it gives the ratio between the posterior variance to the prior variance. Notice that
the posterior variance is always smaller than the prior variance. This list of conjugate prior models illustrates
that the contribution of the prior distribution becomes smaller for larger samples. For the Binomial-Beta and
Multinomial-Dirichlet models the update rule has the form
posterior pseudo-counts = prior pseudo-counts plus sample counts
Example: Binomial-Beta model
First we show a simple demonstration that beta distribution gives a conjugate prior to the binomial likelihood.
Indeed, if
prior ∝ pa−1 (1 − p)b−1 ,
and
likelihood ∝ px (1 − p)n−x ,
then obviously posterior is also a beta distribution:
postterior ∝ prior × likelihood ∝ pa+x−1 (1 − p)b+n−x−1 .
Suppose we are interested in the probability p of a thumbtack landing on its base. Two experiments are
performed. An experiment consists of n tosses of the thumbtack with the number of base landings X ∼ Bin(n, p)
being counted.
Experiment 1: n1 = 10 tosses, counts x1 = 2, n1 − x1 = 8. We apply the uninformative prior distribution
Beta(1, 1) with mean µ0 = 0, 5 and standard deviation σ0 = 0.29. It gives a posterior distribution Beta(3, 9) with
3
mean p̂ = 12 = 0.25 and standard deviation σ1 = 0.12.
Experiment 2: n2 = 40 tosses, counts x2 = 9, n2 − x2 = 31. As a new prior distribution we use the posterior
distribution obtained from the first experiment Beta(3, 9). The new posterior distribution becomes Beta(12, 40)
12
with mean p̂ = 52 = 0.23 and standard deviation σ2 = 0.06.
31
Zero-one loss function and maximum a posteriori probability
Zero-one loss function: l(θ, a) = 1{θ6=a}
Using the zero-one loss function we find that the posterior risk is the probability of misclassification
X
R(a|x) = h(θ|x) = 1 − h(a|x).
θ6=a
It follows that to minimise the risk we have to maximise the posterior probability. We define θ̂map as the value of
θ that maximises h(θ|x). Observe that with the uninformative prior, θ̂map = θ̂mle .
Using the squared error loss function we find that the posterior risk is a sum of two components
Since the first component is independent of a, we minimise the posterior risk by putting
θ̂pme = E(Θ|x).
2, 1, 1, 4, 5, 3, 3, 2, 4, 1, 4, 2, 3, 4, 3, 5, 1, 5.
The parameter of interest is θ = (p1 , . . . , p6 ). The maximum likelihood estimate based on the sample counts is
4 3 4 4 3
θ̂mle = ( 18 , 18 , 18 , 18 , 18 , 0).
The maximum likelihood estimate assigns value zero to p6 , thereby excluding sixes in future observations. Now
take the uninformative prior distribution Dir(1, 1, 1, 1, 1, 1) and compare two Bayesian estimates
4 3 4 4 3 5 4 5 5 4 1
θ̂map = ( 18 , 18 , 18 , 18 , 18 , 0), θ̂pme = ( 24 , 24 , 24 , 24 , 24 , 24 ).
Example: IQ measurement
Given n = 1, we have X̄ ∼ N(µ; 100) and an exact 95% confidence interval for µ takes the form
Posterior distribution of the mean is N(120.7; 69.2) and therefore a 95% credibility interval for µ is
√
Jµ = 120.7 ± 1.96 · 69.2 = 120.7 ± 16.3.
32
5.4 Bayesian hypotheses testing
We consider the case of two simple hypotheses. Choose between H0 : θ = θ0 and H1 : θ = θ1 using not only the
likelihoods of the data f (x|θ0 ), f (x|θ1 ) but also prior probabilities
P(H0 ) = π0 , P(H1 ) = π1 .
In terms of the rejection region R the decision should be taken depending of a cost function having the following
four cost values
Decision H0 true H1 true
x∈
/R Accept H0 0 c1
x∈R Accept H1 c0 0
where c0 is the error type I cost and c1 is the error type II cost. For a given set R, the average cost is the weighted
mean of two values c0 and c1
Z
c0 π0 P(X ∈ R|H0 ) + c1 π1 P(X ∈ / R|H1 ) = c1 π1 + c0 π0 f (x|θ0 ) − c1 π1 f (x|θ1 ) dx.
R
Thus the optimal decision rule becomes to reject H0 for small values of the likelihood ratio when
f (x|θ0 ) c1 π1
< ,
f (x|θ1 ) c0 π0
takes into account the number of males who theoretically could have committed the crime without any evidence
taken into account. There were tree conditionally independent pieces of evidence
E1 : strong DNA match,
E2 : defendant A is not recognised by the victim,
E3 : an alibi supported by the girlfriend.
1 P(E1 |H0 ) 1
P(E1 |H0 ) = 200,000,000 , P(E1 |H1 )=1, so that
P(E1 |H1 ) = 200,000,000 evidence in favour of H1
2 |H0 )
P(E2 |H1 ) = 0.1, P(E2 |H0 ) = 0.9, so that P(E
P(E2 |H1 ) = 9 evidence in favour of H0
3 |H0 )
P(E3 |H1 ) = 0.25, P(E3 |H0 ) = 0.5, so that P(E
P(E3 |H1 ) = 2 evidence in favour of H0
33
Prosecutor’s fallacy: P(H0 |E) = P(E|H0 ), which is only true if P(E) = π0 .
Example: π0 = π1 = 1/2, P(E|H0 ) ≈ 0, P(E|H1 ) ≈ 1.
5.5 Exercises
Problem 1
This is a continuation of the Problem 3 from Section 3.7.
(e) Assume uniform prior Θ ∼ U(0, 1) and find the posterior density. Plot it. What is the mode of the posterior?
Problem 2
In an ecological study of the feeding behaviour of birds, the number of hops between flights was counted for several
birds.
Number of hops j 1 2 3 4 5 6 7 8 9 10 11 12 Tot
Observed frequency Oj 48 31 20 9 6 5 4 2 1 1 2 1 130
Assuming that the data were generated by a Geom(p) model and take a uniform prior for p. What is then the
posterior distribution and what are the posterior mean and standard deviation?
Problem 3
Laplace’s rule of succession. Laplace claimed that when an event happens n times in a row and never fails to
n+1
happen, the probability that the event will occur the next time is n+2 . Can you suggest a rationale for this claim?
Problem 4
This is a continuation of the Problem 2 from Section 4.8.
Let X have one of the following two distributions
X-values x1 x2 x3 x4
P(x|H0 ) 0.2 0.3 0.3 0.2
P(x|H1 ) 0.1 0.4 0.1 0.4
(c) If the prior probabilities are P(H0 ) = P(H1 ) = 12 , which outcomes favour H0 ?
(d) What prior probabilities correspond to the decision rules with α = 0.2 and α = 0.5?
Problem 5
Suppose that under H0 , a measurement X is N(0, σ 2 ), and under H1 , the measurement X is N(1, σ 2 ). Assume
that the prior probabilities satisfy
P(H0 ) = 2P(H1 ).
The hypothesis H0 will be chosen if P(H0 |x) > P(H1 |x). For σ 2 = 0.1, 0.5, 1.0, 5.0:
34
Problem 6
Under H0 , a random variable has a cumulative distribution function F (x) = x2 , 0 ≤ x ≤ 1, and under H1 , it has
a cumulative distribution function F (x) = x3 , 0 ≤ x ≤ 1.
(a) If the two hypotheses have equal prior probabilities, for what values of x is the posterior probability of H0
greater than that of H1 ?
35
6 Summarising data
6.1 Empirical probability distribution
Consider an iid-sample (x1 , . . . , xn ) from the population distribution F (x) = P(X ≤ x).
1
Pn
Empirical distribution function F̂ (x) = n i=1 1{xi ≤x} .
For a fixed x,
F̂ (x) = p̂
and since
n
X x2 i
E(Y 2 ) = = x2 ,
i=1
n
we get
n−1 2
Var(Y ) = x2 − (x̄)2 = s .
n
n−1 2
It is easy to verify that F̂ (·) is a cumulative distribution function with mean x̄ and variance n s even if some of
xi coincide. We call
n
X
σ̂ 2 = n−1
n s2
= 1
n (xi − x̄)2 = x2 − (x̄)2
i=1
If the data describes life lengths, then it is more convenient to use the empirical survival function
Ŝ(x) = 1 − F̂ (x),
the proportion of the data greater than x. If the life length T has distribution function F (t) = P(T ≤ t), then its
survival function is
S(t) = P(T > t) = 1 − F (t).
f (t)
Hazard function h(t) = S(t) , where f (t) = F 0 (t) is the probability density function.
The hazard function can be viewed as the negative of the slope of the log survival function:
d d
h(t) = − dt ln S(t) = − dt ln(1 − F (t)).
36
Example: Guinea pigs
Guinea pigs were randomly divided in 5 treatment groups of 72 animals each and one control group of 107 animals.
The guinea pigs in the treatment groups were infected with increasing doses of tubercle bacilli (Bjerkdal, 1960).
The survival times were recorded (note that not all the animals in the lower-dosage regimens died).
Control lifetimes
18 36 50 52 86 87 89 91 102 105 114 114 115 118 119 120 149 160 165 166 167 167 173 178 189 209 212 216
273 278 279 292 341 355 367 380 382 421 421 432 446 455 463 474 506 515 546 559 576 590 603 607 608 621 634
634 637 638 641 650 663 665 688 725 735
Dose I lifetimes
76 93 97 107 108 113 114 119 136 137 138 139 152 154 154 160 164 164 166 168 178 179 181 181 183 185
194 198 212 213 216 220 225 225 244 253 256 259 265 268 268 270 283 289 291 311 315 326 326 361 373 373 376
397 398 406 452 466 592 598
Dose II lifetimes
72 72 78 83 85 99 99 110 113 113 114 114 118 119 123 124 131 133 135 137 140 142 144 145 154 156 157 162
162 164 165 167 171 176 177 181 182 187 192 196 211 214 216 216 218 228 238 242 248 256 257 262 264 267 267
270 286 303 309 324 326 334 335 358 409 473 550
10 33 44 56 59 72 74 77 92 93 96 100 100 102 105 107 107 108 108 108 109 112 113 115 116 120 121 122 122
124 130 134 136 139 144 146 153 159 160 163 163 168 171 172 176 183 195 196 197 202 213 215 216 222 230 231
240 245 251 253 254 254 278 293 327 342 347 361 402 432 458 555
Dose IV lifetimes
Dose V lifetimes
12 15 22 24 24 32 32 33 34 38 38 43 44 48 52 53 54 54 55 56 57 58 58 59 60 60 60 60 61 62 63 65 65 67 68
70 70 72 73 75 76 76 81 83 84 85 87 91 95 96 98 99 109 110 121 127 129 131 143 146 146 175 175 211 233 258 258
263 297 341 341 376
It is difficult to compare the groups just looking at numbers. The data is illuminated by two graphs: one for
the survival functions and the other for the log-survival functions.
1 0
0.9 -0.5
0.8
-1
0.7
-1.5
0.6
-2
0.5
-2.5
0.4
-3
0.3
-3.5
0.2
0.1 -4
0 -4.5
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
The negative slopes of the curves to the right illustrate the hazard rates for different groups.
37
6.2 Density estimation
A histogram displays the observed counts
n
X
Oj = 1{xi ∈cellj }
i=1
over the adjacent cells of width h. The choice of a balanced width h is important: smaller h give ragged profiles,
larger h give obscured profiles. Put
O
fh (x) = nhj , for x ∈ cellj ,
and notice that Z X
h
fh (x)dx = nh Oj = 1.
j
The scaled histogram given by the graph of fh (x) is a density estimate. Kernel density estimate with bandwidth
h produces a smooth curve
n
1 X x−xi 2
fh (x) = φ( h ), where φ(x) = √12π e−x /2 .
nh i=1
Example: male heights
Let x stand for the column of 24 male heights. For a given bandwidth h, the following Matlab code produces a
plot for a kernel density estimate
x=160:0.1:210; L=length(x);
f=normpdf((ones(24,1)*x - hm*ones(1,L))/h);
fh=sum(f)/(24*h); plot(x,fh)
The stem-and-leaf plot for the 24 male heights indicates the distribution shape plus gives the full numerical infor-
mation:
17:056678899
18:0000112346
19:229
The quantile function Φ−1 for the standard normal distribution Φ is called the probit function (from probability
unit).
Special quantiles:
Quantile xp cuts off proportion p of smallest values of a random variable X with P(X ≤ x) = F (x):
we have
k k−1
Fn (x(k) ) = n, Fn (x(k) − ) = n .
This observation leads to the following definition of empirical quantiles
38
x(k) is called the empirical ( k−0.5
n )-quantile
Suppose we have two independent samples (x1 , . . . , xn ) and (y1 , . . . , yn ) of equal size n which are taken from
two population distributions FX and FY . A relevant null hypothesis H0 : FX ≡ FY is equivalent to H0 : QX ≡ QY .
It can be tested graphically using a QQ-plot.
If such a QQ-plot closely follows the 45 degree line, that is when we observe almost equal quantiles, we can claim
that H0 : FX ≡ FY is true.
More generally, if the QQ-plot approximates a straight line y = a + bx, then we take this as evidence for the
linear relation
Y = a + bX in distribution.
Indeed, the latter claim means that for all x,
FX (x) = FY (a + bx),
If the normal probability plot is close to a straight line y = a + bx, then we accept H0 and use the point estimates
µ̂ = a, σ̂ = b. If normality does not hold, draw a straight line via empirical lower and upper quartiles to detect a
light tails profile or heavy tails profile.
Another simple way of testing normality relies on two summary statistics: skewness and kurtosis.
E[(X−µ)3 ] 1
Pn
Coefficient of skewness: β1 = σ3 , sample skewness: b1 = s3 n i=1 (xi − x̄)3
Depending on the sign of the coefficient of skewness with distinguish between symmetric β1 = 0, skewed to the
right β1 > 0, and skewed to the left β1 < 0 distributions.
Given that b1 is close to zero, kurtosis can be used as an indication of the curve profile to be close to that of
the normal distribution.
E[(X−µ)4 ] 1
Pn
Kurtosis β2 = σ4 , sample kurtosis: b2 = s4 n i=1 (xi − x̄)4
For the normal distribution, kurtosis coefficient takes value β2 = 3. Leptokurtic distribution: β2 > 3 (heavy tails).
Platykurtic distribution: β2 < 3 (light tails).
implying that more than half of heights are below the average.
39
6.5 Measures of location
The central point of a distribution can be defined in terms of various measures of location, for example, as the
population mean µ or the median m. The population median m is estimated by the sample median.
x(k) +x(k+1)
Sample median: m̂ = x(k) , if n = 2k − 1, and m̂ = 2 , if n = 2k.
The sample mean x̄ is sensitive to outliers, while the sample median m̂ is not. Therefore, we say that m̂ is a robust
estimator (robust to outliers).
For example, if n = 25, then from the table below we find that (X(8) , X(18) ) gives a 95.7% confidence interval for
the median.
k 6 7 8 9 10 11 12
100 · pk 99.6 98.6 95.7 89.2 77.0 57.6 31.0
Sign test
The sign test is a non-parametric test of H0 : m = m0 against the two-sided alternative H1 : m 6= m0 .
The sign test statistic
n
X
y0 = 1{xi ≤m0 }
i=1
H
counts the number of observations below the null hypothesis value. It has a simple null distribution Y0 ∼0
Bin(n, 0.5). Connection to the above confidence interval formula: reject H0 if m0 falls outside the correspond-
ing confidence interval
Im = (x(k) , x(n−k+1) ).
Trimmed means
A trimmed mean is a robust measure of location computed from a central portion of the data.
nα nα
α-trimmed mean x̄α = sample mean without 2 smallest and 2 largest observations
When summarizing data compute several measures of location and compare the results.
Nonparametric bootstrap
Substitute the population distribution by the empirical distribution. Then a bootstrap sample is obtained by
resampling with replacement from the original sample
x1 , . . . , x n .
Generate many bootstrap samples of size n to approximate the sampling distribution for an estimator like trimmed
mean, sample median, or s.
40
6.6 Measures of dispersion
Sample variance s2 and sample range
R = x(n) − x(1)
are sensitive to outliers. Two robust measures of dispersion:
interquartile range IQR = x0.75 − x0.25 is the difference between the upper and lower quartiles,
MAD = Median of Absolute values of Deviations from sample median |xi − m̂|, i = 1, . . . , n.
IQR MAD
Three estimates of σ for the normal distribution N(µ, σ 2 ) model: s, 1.35 , 0.675
and
IQR = (µ + σΦ−1 (0.75)) − (µ + σΦ−1 (0.25)) = 2σΦ−1 (0.75) = 1.35σ,
Boxplots
Boxplots are convenient to use for comparing different samples. A boxplot is built of the following components:
box, whiskers and outliers.
Box
• upper edge of the box = upper quartile (UQ)
• box center = median
• lower edge of the box = lower quartile (LQ)
Wiskers
• upper whisker end = {maximal data point not
exceeding UQ + 1.5 × IQR}
• lower whisker end = {min data point ≥ LQ – 1.5 × IQR}
Outliers
• upper dots = {data points ≥ UQ + 1.5 × IQR}
• lower dots = {data points ≤ LQ – 1.5 × IQR}
6.7 Exercises
Problem 1
Suppose that (X1 , . . . , Xn ) are independent uniform U(0, 1) random variables.
(a) Sketch the population distribution function F (x) and the standard deviation of the empirical distribution
function F̂ (x).
(b) Generate many samples of size 16. For each sample, plot the difference F (x) − F̂ (x) and relate what you
see to your answer to (a).
Problem 2
Let (X1 , . . . , Xn ) be independent random variables with the same distribution F , and let F̂ denote the empirical
distribution function. Show that for u < v,
1
Cov(F̂ (u), F̂ (v)) = n F (u)(1 − F (v)).
If follows that F̂ (u) and F̂ (v) are positively correlated: if F̂ (u) overshoots F (u), then F̂ (v) will tend to overshoot
F (v).
41
Problem 3
A random sample x1 , . . . , xn , n = 59:
14.27 14.80 12.28 17.09 15.10 12.92 15.56 15.38 15.15 13.98
14.90 15.91 14.52 15.63 13.83 13.66 13.98 14.47 14.65 14.73
15.18 14.49 14.56 15.03 15.40 14.68 13.33 14.41 14.19 15.21
14.75 14.41 14.04 13.68 15.31 14.32 13.64 14.77 14.30 14.62
14.10 15.47 13.73 13.65 15.02 14.01 14.92 15.47 13.75 14.87
15.28 14.43 13.96 14.57 15.49 15.13 14.23 14.44 14.57
(a) Plot the empirical distribution function, a histogram, and a normal probability plot. Find the 0.9, 0.75, 0.5,
0.25, and 0.1 quantiles. Does the distribution appear Gaussian?
(b) The average percentage of hydrocarbons in a synthetic wax is 85%. Suppose that beeswax was diluted with
1% synthetic wax. Could this be detected? What about 3% and 5% dilution?
Problem 4
Calculate the hazard function for the Weibull distribution
β
F (t) = 1 − e−αt , t ≥ 0,
Problem 5
Give an example of a distribution with an increasing failure rate. Give an example of a distribution with a decreasing
failure rate.
Problem 6
Of the 26 measurements of the heat of sublimation of platinum, 5 are outliers.
136.3 136.6 135.8 135.4 134.7 135 134.1 143.3 147.8 148.8
134.8 135.2 134.9 146.5 141.2 135.4 134.8 135.8 135 133.7
134.4 134.9 134.8 134.5 134.3 135.2
133:7
134:134
134:5788899
135:002244
135:88
136:36
High: 141.2, 143.3, 146.5, 147.8, 148.8
Problem 7
For the data in Problem 3.
(a) Find the mean, median, and 10% and 20% trimmed means.
42
(b) Find an approximate 90% confidence interval for the mean.
(c) Find a confidence interval with coverage near 90% for the median.
(d) Use the bootstrap to find approximate standard errors of the trimmed means.
(f) Find and compare the standard deviation of the measurements, the IQR, and the MAD.
(g) Use the bootstrap to approximate the standard error of the upper quartile.
Problem 8
Olson, Simpson, and Eden (1975) discuss the analysis of data obtained from a cloud seeding experiment. The
following data present the rainfall from 26 seeded and 26 control clouds.
Seeded clouds
129.6, 31.4, 2745.6, 489.1, 430, 302.8, 119, 4.1, 92.4, 17.5,
200.7, 274.7, 274.7, 7.7, 1656, 978, 198.6, 703.4, 1697.8, 334.1,
118.3, 255, 115.3, 242.5, 32.7, 40.6
Control clouds
26.1, 26.3, 87, 95, 372.4, .01, 17.3, 24.4, 11.5, 321.2,
68.5, 81.5, 47.3, 28.6, 830.1, 345.5, 1202.6, 36.6, 4.9, 4.9,
41.1, 29, 163, 244.3, 147.8, 21.7
Make a QQ-plot for rainfall versus rainfall and log rainfall versus log rainfall. What do these plots suggest about
the effect, if any, of seeding?
The difference (x̄ − ȳ) is an unbiased estimate of µ1 − µ2 . We are interested in finding the standard error of x̄ − ȳ
and an interval estimate for the difference µ1 − µ2 , as well as testing the null hypothesis of equality
H0 : µ1 = µ2 .
Two main settings will be addressed: two independent samples and paired samples.
σ12 σ2
2
Var(X̄ − Ȳ ) = σX̄ + σȲ2 = + 2,
n m
and
s21 s2
s2x̄−ȳ = s2x̄ + s2ȳ = + 2
n m
gives an unbiased estimate of Var(X̄ − Ȳ ). Therefore, sx̄−ȳ will be called the (estimated) standard error of the
point estimate x̄ − ȳ.
(X̄ − Ȳ ) − (µ1 − µ2 )
2 + S2 ≈ N(0, 1).
SX̄ Ȳ
43
The hypothesis
H0 : µ1 = µ2
is tested using the test statistic
x̄ − ȳ
z=q
s2x̄ + s2ȳ
q
Approximate confidence interval formula Iµ1 −µ2 = x̄ − ȳ ± zα/2 · s2x̄ + s2ȳ .
Two-sample t-test
The key assumption of the two-sample t-test:
q
x̄−ȳ nm
Two sample t-test uses the test statistic t = sp · n+m for testing H0 : µ1 = µ2 . The null distribution of the test
statistic is
T ∼ tn+m−2 .
The graphs below show that the population distributions are not normal. Therefore, to test H0 : µ1 = µ2 we use
the large sample test. Using the observed value
x̄ − ȳ
zobs = q = 0.7,
s2x̄ + s2ȳ
and applying the normal distribution table we find an approximate two-sided p-value = 0.48.
44
Left panel: boxplots for percentages of Fe2+ (left) and Fe3+ (right). Right panel: two normal probability plots.
After the log transformation the data look more like normally distributed, as seen from the graphs below. For
the transformed data, we get
n = 18, x̄0 = 2.09, s01 = 0.659, sx̄0 = 0.155,
m = 18, ȳ 0 = 1.90, s02 = 0.574, sȳ0 = 0.135.
Two sample t-test for the transformed data with
H0 : F1 = F2 against H1 : F1 6= F2 .
45
For n ≥ 10, m ≥ 10 apply the normal approximation for the null distributions of R1 and R2 with
With help of this formula we can test the null hypothesis of equality
H0 : p 1 = p 2 .
H0 : p = 0.4 vs H0 : p 6= 0.4.
H0 : p 1 = p 2 ,
when the sample sizes m and n are not sufficiently large for applying normal approximations for the binomial
distributions. We summarise the data of two independent samples as a 2 × 2 table of sample counts
46
Fisher’s idea for this case, was to use X as a test statistic conditionally on the total number of successes x + y.
Under the null hypothesis, the conditional distribution of X is hypergeometric
X ∼ Hg(N, n, p)
This null distribution should be used for determining the rejection rule of the Fisher test.
H0 : p1 = p2 no gender bias,
against
H1 : p1 > p2 males are favoured.
Fisher’s test would reject H0 in favour of the one-sided alternative H1 for large values of x under the null distribution
35
13 35
13
x 24−x 35−x
P(X = x) = 48
= 48
x−11 , 11 ≤ x ≤ 24.
24 24
so that a one-sided p-value = 0.025, and a two-sided p-value = 0.05. We conclude that there is a significant evidence
of sex bias, and reject the null hypothesis.
(x1 , y1 ), . . . , (xn , yn ).
As before, our main question is whether the difference µ1 − µ2 is statistically significant. To this end, we turn a
one-dimensional iid-sample of differences
(d1 , . . . , dn ), di = xi − yi .
The population mean difference µ1 − µ2 is estimated by d¯ = x̄ − ȳ. This is an unbiased estimate whose variance
value takes into account dependence between X and Y . Observe that
47
Since Xi and Yj are independent for i 6= j, we get
Cov(X1 + . . . + Xn , Y1 + . . . + Yn ) = Cov(X1 , Y1 ) + . . . + Cov(Xn , Yn ) = nCov(X, Y ) = nσ1 σ2 ρ,
where
Cov(X, Y )
ρ=
σ1 σ2
is the correlation coefficient for the joint population distribution of (X, Y ). Thus
2
Var(X̄ − Ȳ ) = 1
n (σ1 + σ22 − 2σ1 σ2 ρ).
If samples are independent and have equal sizes, then ρ = 0 and
2
Var(X̄ − Ȳ ) = Var(X̄) + Var(Ȳ ) = 1
n (σ1 + σ22 ).
Importantly, if ρ > 0, then
Var(X̄ − Ȳ ) < Var(X̄) + Var(Ȳ ),
which demonstrates that the the pared sampling with ρ > 0 ensures a smaller standard error for the estimate x̄ − ȳ
as compared with the two independent samples case.
48
Signed rank test
The sign test disregards a lot of information in the data taking into account only the sign of the differences. The
signed rank test pays attention to sizes of positive and negative differences. This is a non-parametric test for the
null hypothesis of no difference
The null hypothesis consists of two parts: symmetry of the distribution and m = 0. Test statistics: either
n
X
w+ = rank(|di |) · 1{di >0}
i=1
or
n
X
w− = rank(|di |) · 1{di <0} .
i=1
n(n + 1)
w+ + w− = .
2
The null distributions of W+ and W− are the same and tabulated for smaller values of n. For n ≥ 20, one can use
the normal approximation of the null distribution with mean and variance
The signed rank test uses more data information than the sign test
but requires symmetric distribution of differences.
• placebo effect,
• time factor,
• location factor.
Simpson’s paradox
Hospital A has higher overall death rate than hospital B. However, if we split the data in two parts, patients in
good (+) and bad (−) conditions, for both parts hospital A performs better.
49
Hospital: A B A+ B+ A– B–
Died 63 16 6 8 57 8
Survived 2037 784 594 592 1443 192
Total 2100 800 600 600 1500 200
Death Rate .030 .020 .010 .013 .038 .040
Here, the external factor, patient condition, is an example of a confounding factor:
Hospital performance ← Patient condition → Death rate
7.5 Exercises
Problem 1
Four random numbers generated from a normal distribution
x1 = 1.1650, x2 = 0.6268, x3 = 0.0751, x4 = 0.3516,
2
along with five random numbers with the same variance σ but perhaps a different mean
y1 = 0.3035, y2 = 2.6961, y3 = 1.0591, y4 = 2.7971, y5 = 1.2641.
(a) What do you think the means of the random normal number generators were? What do you think the
difference of the means was?
(b) What do you think the variance of the random number generator was?
(c) What is the estimated standard error of your estimate of the difference of the means?
(d) Form a 90% confidence interval for the difference of the means.
(e) In this situation, is it more appropriate to use a one-sided test or a two-sided test of the equality of the
means?
(f) What is the p-value of a two-sided test of the null hypothesis of equal means?
(g) Would the hypothesis that the means were the same versus a two-sided alternative be rejected at the
significance level α = 0.1?
(h) Suppose you know that the variance of the normal distribution was σ 2 = 1. How would your answers to
the preceding questions change?
Problem 2
In the ”two independent samples” setting we have two ways of estimating the variance of X̄ − Ȳ :
(a) s2p ( n1 + 1
m ), if σx = σy ,
s2x s2y
(b) n + m without the assumption of equal variances.
Show that if m = n, then these two estimates are identical.
Problem 3
An experiment of the efficacy of a drug for reducing high blood pressure is performed using four subjects in the
following way:
two of the subjects are chosen at random for the control group and two for the treatment group.
During the course of a treatment with the drug, the blood pressure of each of the subjects in the teatment group
is measured for ten consecutive days as is the blood pressure of each og the subjects in the control group.
(a) In order to test whether the treatment has an effect, do you think it is appropriate to use the two-sample t
test with n = m = 20?
(b) Do you think it is appropriate to use the rank sum test?
Problem 4
Let x1 , . . . , x25 be an iid-sample drawn from N(0.3, 1). Consider testing at α = 0.05
H0 : µ = 0, H1 : µ > 0.
Compare
(a) the power of the sign test , and
(b) the power of the test based on the normal theory assuming that σ is known.
50
Problem 5
Suppose that n measurements are to be taken under a treatment condition and another n measurements are to be
taken independently under a control condition. It is thought that the standard deviation of a single observation is
about 10 under both conditions. How large should n be so that a 95% confidence interval for the mean difference
has a width of 2? Use the normal distribution rather than the t-distribution, since n will turn out to be rather
large.
Problem 6
Data: millions of cycles until failure for two types of engine bearings
Type I Type II
3.03 3.19
5.53 4.26
5.60 4.47
9.30 4.53
9.92 4.67
12.51 4.69
12.95 6.79
15.21 9.37
16.04 12.75
16.84 12.78
(a) Use normal theory to test the null hypothesis of no difference against the two-sided alternative
H0 : µx = µy , H1 : µx 6= µy .
(b) Test the hypothesis that there is no difference between the two types of bearing using a nonparametric
method.
(c) Which of the methods (a) or (b) do you think is better in this case?
(d) Estimate π, the probability that a type I bearing will outlast a type II bearing.
(e) Use the bootstrap to estimate the sampling distribution of π̂ and its standard error.
Problem 7
Find the exact null distribution for the test statistic of the signed rank test with n = 4.
Problem 8
Turn to the two-sided signed rank test. For n = 10, 20, 25 and α = 0.05, 0.01, compare the critical values from the
table and using the normal approximation of the null distribution.
Problem 9
Two population distributions with σx = σy = 10. Two samples of sizes n = 25 can be taken in two ways
H0 : µx = µy , H1 : µx > µy , α = 0.05.
Problem 10
Lin, Sutton, and Qurashi (1979) compared microbiological and hydroxylamine methods for the analysis of ampicillin
dosages. In one series of experiments, pairs of tablets were analysed by the two methods. The data in the table
give the percentages of claimed amount of ampicillin found by the two methods in several pairs of tablets.
51
Microbiological method Hydroxylamine method
97.2 97.2
105.8 97.8
99.5 96.2
100 101.8
93.8 88
79.2 74
72 75
72 67.5
69.5 65.8
20.5 21.2
95.2 94.8
90.8 95.8
96.2 98
96.2 99
91 100.2
What are x̄ − ȳ and sx̄−ȳ ? If the pairing had been erroneously ignored and it had been assumed that the two
samples were independent, what would have been the estimate of the standard deviation of X̄ − Ȳ ? Analyse the
data to determine if there is a systematic difference between the two methods.
Problem 11
The media often present short reports of the results of experiments. To the critical reader, such reports often raise
more questions than they answer. Comment on the following pitfalls in the interpretation of each of the following.
(a) It is reported that patients whose hospital rooms have a window recover faster than those whose rooms do
not.
(b) Nonsmoking wives whose husbands smoke have a cancer rate twice that of wives whose husbands do not
smoke.
(c) A two-year study in North Carolina found that 75% of all industrial accidents in the state happened to
workers who had skipped breakfast.
(d) A school integration program involved busing children from minority schools to majority (primarily white)
schools. Participation in the program was voluntary. It was found that the students who were bused scored lower
on standardised tests than did their peers who chose not to be bused.
(e) When a group of students were asked to match pictures of newborns and with pictures of their mothers,
they were correct 36% of the time.
(f) A survey found that that those who drank a moderate amount of beer were healthier than those who totally
abstained from alcohol.
(g) A 15-year study of more than 45 000 Swedish soldiers revealed that heavy users of marijuana were six times
more likely than nonusers to develop schizophrenia.
(h) A University of Wisconsin study showed that within 10 years of wedding, 38% of those who had lived
together before marriage had split up, compared to 27% of those who had married without a ”trial period”.
(i) A study of nearly 4000 elderly North Carolinians has found that those who attended religious services every
week were 46% less likely to die over a six-year period than people who attended less often or not at all.
52
8 Analysis of variance
The two sample setting from the previous section is the case of a single main factor having two levels. In this
section, we extend the setting first to a single main factor with arbitrary many levels (one-way layout) and then
to two main factors (two-way layout). Afterwards we introduce the two-way layout which generalises the paired
sample setting of the previous section.
Y = µ(X) + , E() = 0,
where
µ(x) = E(Y |X = x)
is the expected value of response conditioned on the value of X, while the noise component (incorporating other
factors except the main one) is assumed to be independent of X.
We choose particular values x1 , . . . , xI of the explanatory variable X, and denote the corresponding conditional
expectations by
µi = µ(xi ), i = 1, . . . , I.
For each of the I levels of the main factor A, we independently collect an iid-sample (yi1 , . . . , yiJ ) of the same size
J. Having such I independent samples we want a utility test of
If the the collected response compares I different treatments, then the null hypothesis claims that all treatments
have the same effect (so that the factor A has no influence of the response Y and the suggested model is not useful).
The data is summarised below in the form of seven boxplots. Ordered means
Lab i 1 3 7 2 5 6 4
Mean µi 4.062 4.003 3.998 3.997 3.957 3.955 3.920
Here the null hypothesis of interest states that there is no significant difference between the output of the seven
laboratories.
53
4.1
4.05
3.95
3.9
3.85
3.8
1 2 3 4 5 6 7
each observation is a sum of the overall mean, the differential main effect, and the noise.
Using the maximum likelihood approach we find the point estimates
It follows, that
I
X
yij = µ̂ + α̂i + ˆij , ˆij = yij − ȳi. , α̂i = 0,
i=1
where
SST = i j (yij − ȳ.. )2 is the total sum of squares for the pooled sample with dfT = IJ − 1,
P P
This decomposition says that the total variation in response values is the sum of the between-group variation and
the within-group variation. After normalising by the numbers of degrees of freedom, we obtain so-called the mean
squares
SSA SSE
M SA = , M SE = .
df A df E
If treated as random variables, they lead the following formulas for the expected values
X
E(M SA ) = σ 2 + J
I−1 αi2 , E(M SE ) = σ 2 ,
i
M SA
which suggest looking for the ratio between the two mean squares M SE to find an evidence against the null
hypothesis
H0 : α1 = . . . = αI = 0.
54
One-way F -test
The pooled sample variance XX
s2p = M SE = 1
I(J−1) (yij − ȳi. )2
i j
2 M SA
is an unbiased estimate of σ . F-test rejection rule: use F = M SE as a test statistic for and reject H0 for large
values of F based on the null distribution
H
F ∼0 Fn1 ,n2 , where n1 = I − 1, n2 = I(J − 1).
F-distribution Fn1 ,n2 with degrees of freedom (n1 , n2 ) is the distribution for the ratio
X1 /n1
X2 /n2 ∼ Fn1 ,n2 , where X1 ∼ χ2n1 and X2 ∼ χ2n2 are two independent random variables.
Labs 1–4 1–6 1–5 3–4 7–4 2–4 1–2 1–7 1–3 5–4
Diff 0.142 0.107 0.105 0.083 0.078 0.077 0.065 0.064 0.059 0.047
The multiple comparison problem: the above confidence interval formula is aimed at a single difference, and may
produce false discoveries. We need a simultaneous confidence interval formula for all 21 pairwise comparisons.
Bonferroni method
Think of a statistical test repeatedly applied to k independent samples of size n. The overall result is positive if
we get at least one positive result among these k tests. Observe that the overall significance level α is obtained,
if each single test is performed at significance level α0 = α/k. Indeed, assuming the null hypothesis is true, the
number of positive results is X ∼ Bin(k, α0 ). Thus for small values of α0 ,
formula of a 100(1 − α)% simultaneous confidence interval which can be used (as an
This yields Bonferroni’s
approximation) for I2 pairwise differences (µu − µv ):
q
α
Bµu −µv = ȳu. − ȳv. ± tdf ( 2k ) · sp J2 , 1 ≤ u < v ≤ I.
where t63 (0.0012) = 3.17, detects 3 significant differences between labs (1,4), (1,5), (1,6).
55
Tukey method
If I independent samples (yi1 , . . . , yiJ ) taken from N(µi , σ 2 ) have the same size J, then
2
Zi = Ȳi. − µi ∼ N(0, σJ )
R = max{Z1 , . . . , ZI } − min{Z1 , . . . , ZI }.
The normalised range has a distribution that is free from the parameter σ
R
√ ∼ SR(I, df), df = I(J − 1).
Sp / J
The so-called studentised range distribution SR has two parameters: the number of samples and the number of
degrees of freedom used in the variance estimate s2p . Tukey’s simultaneous confidence interval is built using an
appropriate quantile qI,df (α) of the studentised range distribution. It takes into account the dependences between
the differences µu − µv .
s
Tukey’s 100(1 − α)% simultaneous confidence interval Tµu −µv = ȳu. − ȳv. ± qI,df (α) · √p
J
which brings four significant pairwise differences: (1,4), (1,5), (1,6), (3,4).
Extending the idea of the rank-sum test, consider the pooled sample of size N = IJ. Let rij be the pooled ranks
of the sample values yij , so that XX
rij = 1 + 2 + . . . + N = N (N2+1) ,
i j
N +1
implying that the mean rank is r̄.. = 2 .
12J
PI N +1 2
Kruskal-Wallis test statistic W = N (N +1) i=1 (r̄i. − 2 )
Reject H0 for large W using the null distribution table. For I = 3, J ≥ 5 or I ≥ 4, J ≥ 4, use the approximate
H0
null distribution W ≈ χ2I−1 .
Labs 1 2 3 4 5 6 7
70 4 35 6 46 48 38
63 3 45 7 21 5 50
53 65 40 13 47 22 52
64 69 41 20 8 28 58
59 66 57 16 14 37 68
54 39 32 26 42 2 1
43 44 51 17 9 31 15
61 56 25 11 10 34 23
67 24 29 27 33 49 60
55 19 30 12 36 18 62
Means 58.9 38.9 38.5 15.5 26.6 27.4 42.7
56
8.4 Two-way layout
Suppose the data values are influenced by two explanatory variables (X1 , X2 ) and noise :
Y = µ(X1 , X2 ) + , E() = 0,
where µ(x1 , x2 ) is the conditional expectation of the response variable. Choosing I values for variable X1 and J
values for variable X2 , we generate IJ independent samples, each of size K. We say that the main factor A has I
levels, the main factor B has J levels, and denote
where
• µ is the grand mean,
• (α1 , . . . , αI ) is the main A-effect, with α1 + . . . + αI = 0,
• (β1 , . . . , βJ ) is the main B-effect, with β1 + . . . + βJ = 0,
• {δij , i = 1, . . . , I, j = 1, . . . , J} is the AB-interaction effect, with
P P
i δij = 0, j δij = 0.
When an interaction effect is present, the impact of one factor depends on the level of the other factor. Without
the interaction effect, the joint impact of two factors is described by an additive model:
µij = µ + αi + βj .
where ijk ∼ N(0, σ 2 ) are independent and have the same variance. Here we assume that for each combination of
levels (i, j) of two main factors, K ≥ 2 replications are performed. The maximum likelihood estimates
XXX
1
µ̂ = ȳ... = IJK yijk ,
i j k
XX
1
α̂i = ȳi.. − ȳ... , ȳi.. = JK yijk ,
j k
XX
1
β̂j = ȳ.j. − ȳ... , ȳ.j. = IK yijk ,
i k
X
1
δ̂ij = ȳij. − ȳ... − α̂i − β̂j = ȳij. − ȳi.. − ȳ.j. + ȳ... , ȳij. = K yijk ,
k
30
25
20
15
10
0
1 2 3 4 5 6
57
0
The raw data yijk is the percentage of iron retained in mice. Factor A has two levels I = 2 representing two iron
forms, while factor B has three levels J = 3 representing dosage concentrations. Six samples are collected with
K = 18 replications for each (iron form, dosage) combination. Six boxplots for these six samples (see above) show
that the raw data is not normally distributed having different variances across six samples.
Fe3+ (10.2) Fe3+ (1.2) Fe3+ (0.3) Fe2+ (10.2) Fe2+ (1.2) Fe2+ (0.3)
.71 2.2 2.25 2.2 4.04 2.71
1.66 2.93 3.93 2.69 4.16 5.43
2.01 3.08 5.08 3.54 4.42 6.38
2.16 3.49 5.82 3.75 4.93 6.38
2.42 4.11 5.84 3.83 5.49 8.32
2.42 4.95 6.89 4.08 5.77 9.04
2.56 5.16 8.5 4.27 5.86 9.56
2.6 5.54 8.56 4.53 6.28 10.01
3.31 5.68 9.44 5.32 6.97 10.08
3.64 6.25 10.52 6.18 7.06 10.62
3.74 7.25 13.46 6.22 7.78 13.8
3.74 7.9 13.57 6.33 9.23 15.99
4.39 8.85 14.76 6.97 9.34 17.9
4.5 11.96 16.41 6.97 9.91 18.25
5.07 15.54 16.96 7.52 13.46 19.32
5.26 15.89 17.56 8.36 18.4 19.87
8.15 18.3 22.82 11.65 23.89 21.6
8.24 18.59 29.13 12.45 26.39 22.25
0
However, the transformed data yijk = ln(yijk ) produce more satisfactory boxplots.
3.5
2.5
1.5
0.5
-0.5
1 2 3 4 5 6
The
six sample means for the transformed data (ȳij. )
2.4
3+
Fe 1.16 1.90 2.28 1.78 1.8
1.4
1
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
when depicted as two profiles for the two rows are not parallel which indicates possible interaction. The maximum
likelihood estimates are
ȳ... = 1.92, α̂1 = −0.14, α̂2 = 0.14, β̂1 = −0.50, β̂2 = 0.08, β̂3 = 0.42,
and
−0.12 0.04 0.08
(δ̂ij ) =
0.12 −0.04 −0.08
A two-way Anova table for the transformed iron retention data:
Source df SS MS F P
Iron form 1 2.074 2.074 5.99 0.017
Dosage 2 15.588 7.794 22.53 0.000
Interaction 2 0.810 0.405 1.17 0.315
Error 102 35.296 0.346
Total 107 53.768
58
According to the rightmost column
• the dosage effect is undoubtably significant, however, this is something expected,
• interaction is not statistically significant,
• there is significant effect due to iron form (compare to the previous analysis of two samples).
The estimated log scale difference α̂2 − α̂1 = ȳ2.. − ȳ1.. = 0.28 yields the multiplicative effect of e0.28 = 1.32 on the
original scale, implying that the retention percentage of Fe2+ is 1.32 higher than that of Fe3+ .
Three F -tests
We explain the Anova table above by starting with the sums of squares decomposition
α̂i2 ,
P
SSA = JK i df A = I − 1
β̂j2 ,
P
SSB = IK j df B = J − 1
2
P P
SSAB = K i j δ̂ij , df AB = (I − 1)(J − 1)
ˆ2ijk ,
P P P
SSE = i j k df E = IJ(K − 1)
The mean sums of squares and their expected values
SSA JK
E(M SA ) = σ 2 + αi2
P
M SA = df A , I−1 i
SSB IK
E(M SB ) = σ 2 + βj2
P
M SB = df B , J−1 j
SSAB K
E(M SAB ) = σ 2 + 2
P P
M SAB = df AB , (I−1)(J−1) i j δij
SSE
M SE = df E , E(M SE ) = σ 2
Reject null hypothesis for large values of the respective test statistic.
Inspect normal probability plot for the residuals ˆijk .
59
Additive model
For the rest of this section suppose that K = 1. With only one replication per cell, then we cannot estimate
interaction. This leads to the additive model without interaction
Yij = µ + αi + βj + ij , ij ∼ N(0, σ 2 ).
For the given data (yij ), find the maximum likelihood estimates and residuals
α̂i2 ,
P
SSA = J i df A = I − 1
β̂j2 ,
P
SSB = I j df B = J − 1
ˆ2ij ,
P P
SSE = i j df E = (I − 1)(J − 1)
and
SSA J
E(M SA ) = σ 2 + αi2
P
M SA = df A , I−1 i
SSB I
E(M SB ) = σ 2 + βj2
P
M SB = df B , J−1 j
SSE
M SE = df E , E(MSE ) = σ 2
We can apply two F-tests for two different null hypotheses
M SA HA
HA : α1 = . . . = αI = 0, FA = ∼ Fdf A ,df E ,
M SE
M S B HB
HB : β1 = . . . = βJ = 0, FB = ∼ Fdf B ,df E .
M SE
Example: itching
Data: the duration of the itching in seconds yij , with K = 1 observation per cell, I = 7 treatments to relieve
itching applied to J = 10 male volunteers aged 20-30.
Subject No Drug Placebo Papaverine Morphine Aminophylline Pentabarbital Tripelennamine
60
8.7 Friedman test
Here we introduce another nonparametric test, which does not require that ij are normally distributed, for testing
H0 : no treatment effect. The Friedman test is based on within block ranking. Let ranks within j-th block be:
so that
I(I + 1)
r1j + . . . + rIj = 1 + 2 + . . . + I = .
2
For these ranks, we have I1 (r1j + . . . + rIj ) = I+1
2 and therefore r̄.. = I+1
2 .
12J
PI I+1 2 H0
Friedman test statistic Q = I(I+1) i=1 (r̄i. − 2 ) has an approximate null distribution Q ≈ χ2I−1 .
Test statistic Q is a measure of agreement between J rankings, so we reject H0 for large values of Q.
Example: itching
From the rank values rij and r̄i. given in the next table and I+1
2 = 4, we find the Friedman test statistic value to
be Q = 14.86. Using the chi-squared distribution table with df = 6 we obtain the p-value is approximately 2.14%.
We reject the null hypothesis of no effect even in the non-parametric setting.
BG 5 7 1 6 3.5 2 3.5
JF 6 5 1 2 3 7 4
BS 7 6 4 2 1 5 3
SI 6 7 1 4 3 2 5
BW 3 4 2 5 1 7 6
TS 7 3 1 5 2 4 6
GM 7 5 1 2 3 6 4
SS 1 2 5 3 7 6 4
MU 5 3 2 4 6 7 1
OS 4 7 5 2 1 3 6
r̄i. 5.10 4.90 2.30 3.50 3.05 4.90 4.25
8.8 Exercises
Problem 1
A study on the tensile strength of aluminium rods is conducted. Forty identical rods are randomly divided into
four groups, each of size 10. Each group is subjected to a different heat treatment, and the tensile strength, in
thousands of pounds per square inch, of each rod is determined. The following data result:
Treatment 1 2 3 4 Combined data
18.9 18.3 21.3 15.9 18.9 18.3 21.3 15.9
20.0 19.2 21.5 16.0 20.0 19.2 21.5 16.0
20.5 17.8 19.9 17.2 20.5 17.8 19.9 17.2
20.6 18.4 20.2 17.5 20.6 18.4 20.2 17.5
19.3 18.8 21.9 17.9 19.3 18.8 21.9 17.9
19.5 18.6 21.8 16.8 19.5 18.6 21.8 16.8
21.0 19.9 23.0 17.7 21.0 19.9 23.0 17.7
22.1 17.5 22.5 18.1 22.1 17.5 22.5 18.1
20.8 16.9 21.7 17.4 20.8 16.9 21.7 17.4
20.7 18.0 21.9 19.0 20.7 18.0 21.9 19.0
mean 20.34 18.34 21.57 17.35 19.40
variance 0.88 0.74 0.88 0.89 3.58
skewness 0.16 0.14 -0.49 -0.08 0.05
kurtosis 2.51 2.59 2.58 2.46 1.98
Consider the null hypothesis of equality between the four treatment means of tensile strength.
(a) Test the null hypothesis applying an ANOVA test. Show clearly how all the sums of squares are computed
using the sample means and variances given in the table.
61
(b) What are the assumptions of the ANOVA model you used? Do they appear fulfilled?
(c) The Bonferroni method suggests the following formula for computing simultaneous 95% confidence intervals
for six pairwise differences between four treatment means
Explain this formula and using it check which of the pairs of treatments have significantly different means.
Problem 2
For a one-way analysis of variance with two treatment groups, show that the F statistic is t2 , where t is the test
statistic for a two-sample t-test.
Problem 3
Derive the likelihood ratio test for the null hypothesis of the one-way layout, and show that it is equivalent to the
F-test.
Problem 4
Suppose in a one-way layout there are 10 treatments and seven observations under each treatment. What is the
ratio of the length of a simultaneous confidence interval for the difference of two means formed by Tukey’s method
to that of one formed by the Bonferroni method? How do both of these compare in length to an interval based on
the t-distribution that does not take account of multiple comparisons?
Problem 5
During each of four experiments on the use of carbon tetrachloride as a worm killer, ten rats were infested with
larvae (Armitage 1983). Eight days later, five rats were treated with carbon tetrachloride; the other five were kept
as controls. After two more days, all the rats were killed and the numbers of worms were counted. The table below
gives the counts of worms for the four control groups.
Significant differences among the control groups, although not expected, might be attributable to changes in the ex-
perimental conditions. A finding of significant differences could result in more carefully controlled experimentation
and thus greater precision in later work.
Use both graphical techniques and the F-test to test whether there are significant differences among the four
groups. Use a nonparametric technique as well.
Problem 6
The concentrations (in nanogram per millimiter) of plasma epinephrine were measured for 10 dogs under isofluorane,
halothane, and cyclopropane anesthesia. The measurements are given in the following table (Perry et al. 1974).
Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9 Dog 10
Isofluorane 0.28 0.51 1.00 0.39 0.29 0.36 0.32 0.69 0.17 0.33
Halothane 0.30 0.39 0.63 0.68 0.38 0.21 0.88 0.39 0.51 0.32
Cyclopropane 1.07 1.35 0.69 0.28 1.24 1.53 0.49 0.56 1.02 0.30
Problem 7
The following table gives the survival times (in hours) for animals in an experiment whose design consisted of three
poisons, four treatments, and four observations per cell.
62
Treatment A Treatment B Treatment C Treatment D
Poison I 3.1 4.5 8.2 11.0 4.3 4.5 4.5 7.1
4.6 4.3 8.8 7.2 6.3 7.6 6.6 6.2
Poison II 3.6 2.9 9.2 6.1 4.4 3.5 5.6 10.0
4.0 2.3 4.9 12.4 3.1 4.0 7.1 3.8
Poison III 2.2 2.1 3.0 3.7 2.3 2.5 3.0 3.6
1.8 2.3 3.8 2.9 2.4 2.2 3.1 3.3
(a) Conduct a two-way analysis of variance to test the effects of the two main factors and their interaction.
(b) Box and Cox (1964) analysed the reciprocals of the data, pointing out that the reciprocal of a survival time
can be interpreted as the rate of death. Conduct a two-way analysis of variance, and compare to the results of part
(a). Comment on how well the standard two-way ANOVA model fits and on the interaction in both analyses.
Consider a cross-classification for a pair of categorical factors A and B. If factor A has I levels and factor B
has J levels, then the population distribution of a single cross classification event has the form
b1 b2 ... bJ Total
a1 π11 π12 ... π1J π1·
a2 π21 π22 ... π2J π2·
... ... ... ... ... ...
aI πI1 πI2 ... πIJ πI·
Total π·1 π·2 ... π·J 1
Here
πij = P(A = ai , B = bj )
are the marginal probabilities. The null hypothesis of independence claims that there is no relationship between
factors A and B
H0 : πij = πi· π·j for all pairs (i, j).
b1 b2 ... bJ
a1 π1|1 π1|2 ... π1|J
a2 π2|1 π2|2 ... π2|J
... ... ... ... ...
aI πI|1 πI|2 ... πI|J
Total 1 1 ... 1
63
9.1 Chi-squared test of homogeneity
Consider a table of I × J observed counts obtained from J independent samples taken from J population distribu-
tions:
Pop. 1 Pop. 2 ... Pop. J Total
Category 1 n11 n12 ... n1J n1·
Category 2 n21 n22 ... n2J n2·
... ... ... ... ... ...
Category I nI1 nI2 ... nIJ nI·
Sample sizes n·1 n·2 ... n·J n··
The total number of degrees of freedom for J independent samples of size I is J(I − 1).
Under the hypothesis of homogeneity H0 : πi|j = πi for all (i, j), the maximum likelihood estimates of πi are
the pooled sample proportions
π̂i = ni· /n·· , i = 1, . . . , I.
These estimates consumes (I − 1) degrees of freedom, since their sum is 1. Using these maximum likelihood
estimates we compute the expected cell counts
We reject H0 for large values of χ2 using the approximate null distribution χ2 ≈ χ2df with
χ2 = 27.24 with df = (3 − 1) · (3 − 1) = 4.
After comparing χ2 with the table value χ24 (0.005) = 14.86, we reject the hypothesis of homogeneity at 0.5%
significance level. Persons who saw themselves as cautious conservatives are more likely to express a favourable
opinion of small cars.
b1 b2 ... bJ Total
a1 n11 n12 ... n1J n1.
a2 n21 n22 ... n2J n2.
... ... ... ... ... ...
aI nI1 nI2 ... nIJ nI.
Total n.1 n.2 ... n.J n..
64
whose joint distribution is multinomial
ni· n·j
Eij = n·· π̂ij =
n··
with the same
df = (IJ − 1) − (I − 1 + J − 1) = (I − 1)(J − 1).
The same chi-squared test rejection rule for the homogeneity test and independence test.
The observed chi-squared test statistic is χ2 = 16.01. With df = 1 we can use the normal distribution table, since
Z ∼ N(0, 1) is equivalent to Z 2 ∼ χ21 . Thus
We see that the p-value of the test is less that 0.1%, and we reject the null hypothesis of independence. College-
educated women, once they marry, are less likely to divorce.
where the four counts distinguish among sampled individual who are
• prospective study (take an X-sample and a control X̄-sample, then watch who gets affected) would give
0 0
,
n1 n2
• retrospective study (take a D-sample and a control D̄-sample, then find who had been exposed) would give
informative counts.
Two retrospective case-control studies where produced opposite results. Study A (Vianna, Greenwald, Davis,
1971) gave a cross classification table
65
Study A X X̄
D 67 34
D̄ 43 64
The chi-squared test of homogeneity was applied. With χ2A = 14.29 and df = 1, the p-value was found to be very
small √
P(χ2A ≥ 14.29) ≈ 2(1 − Φ( 14.29)) = 0.0002.
Study B (Johnson and Johnson, 1972) was summarised by a table
Study B X X̄
D 41 44
D̄ 33 52
and the chi-squared tests of homogeneity was applied. With χ2B = 1.53 and df = 1, the p-value was strikingly
different √
P(χ2B ≥ 1.53) ≈ 2(1 − Φ( 1.53)) = 0.215.
It turned out that the study B was based on a matched-pairs design violating the assumption of the chi-squared
test of homogeneity. The sample consisted of m = 85 sibling pairs having same sex and close age: one of the
siblings was affected the other not. A proper summary of the study B sample distinguishes among four groups of
sibling pairs: (X, X), (X, X̄), (X̄, X), (X̄, X̄)
McNemar’s test
Consider data obtained by matched-pairs design for the population distribution
unaffected X unaffected X̄ Total
affected X p11 p12 p1.
affected X̄ p21 p22 p2.
p.1 p.2 1
The relevant null hypothesis is not the hypothesis of independence but rather
df = 4 − 1 − 2
because 2 independent parameters are estimated from the data. We reject the H0 for large values of the test
statistic.
66
9.4 Odds ratios
Odds and probability of a random event A:
P(A) odds(A)
odds(A) = and P(A) = .
P(Ā) 1 + odds(A)
P(A|B) P(AB)
odds(A|B) = = .
P(Ā|B) P(ĀB)
odds(A|B) P(AB)P(ĀB̄)
∆AB = = ,
odds(A|B̄) P(ĀB)P(AB̄)
X X̄ Total X X̄ Total
D P(X|D) P(X̄|D) 1 D n11 n12 n1·
D̄ P(X|D̄) P(X̄|D̄) 1 D̄ n21 n22 n2·
9.5 Exercises
Problem 1
Adult-onset diabetes is known to be highly genetically determined. A study was done comparing frequencies of a
particular allele in a sample of such diabetics and a sample of nondiabetics. The data is shown in the following
table:
Diabetic Normal Total
Bb or bb 12 4 16
BB 39 49 88
Total 51 53 104
Are the relative frequencies of the alleles significantly different in the two groups?
67
Problem 2
Overfield and Klauber (1980) published the following data on the incidence of tuberculosis in relation to blood
groups in a sample of Eskimos. Is there any association of the disease and blood group within the ABO system or
within the MN system?
O A AB B
Moderate 7 5 3 13
Minimal 27 32 8 18
Not present 55 50 7 24
MM MN NN
Moderate 21 6 1
Minimal 54 27 5
Not present 74 51 11
Problem 3
It is conventional wisdom in military squadron that pilots tend to father more girls than boys. Snyder (1961)
gathered data for military fighter pilots. The sex of the pilots’ offspring were tabulated for three kinds of flight
duty during the month of conception, as shown in the following table.
Girl Boy
Flying fighter 51 38
Flying transport 14 16
Not flying 38 46
(a) Is there any significant difference between the three groups?
(b) In the United States in 1950, 105.37 males were born for every 100 females. Are the data consistent with
this sex ratio?
Problem 4
A randomised double-blind experiment compared the effectiveness of several drugs in ameliorating postoperative
nausea. All patients were anesthetized with nitrous oxide and ether. The following table shows the incidence of
nausea during the first four hours for each of several drugs and a placebo (Beecher 1959).
Number of patients Incidence of nausea
Placebo 165 95
Chlorpromazine 152 52
Dimenhydrinate 85 52
Pentobarbital (100 mg) 67 35
Pentobarbital (150 mg) 85 37
Compare the drugs to each other and to the placebo.
Problem 5
In a study of the relation of blood type to various diseases, the following data were gathered in London and
Manchester (Woolf 1955).
London Control Peptic Ulcer Manchester Control Peptic Ulcer
Group A 4219 579 Group A 3775 246
Group O 4578 911 Group O 4532 361
First, consider the two tables separately. Is there a relationship between blood type and propensity to peptic ulcer?
If so, evaluate the strength of the relationship. Are the data from London and Manchester comparable?
Problem 6
Record of 317 patients at least 48 years old who were diagnosed as having endometrial carcinoma were obtained
from two hospitals (Smith et al. 1975). Matched controls for each case were obtained from the two institutions:
the controls had cervical cancer, ovarian cancer, or carcinoma of the vulva. Each control was matched by age
at diagnosis (within four years) and year of diagnosis (within two years) to a corresponding case of endometrial
carcinoma.
68
The following table gives the number of cases and controls who had taken estrogen for at least 6 months prior
to the diagnosis of cancer.
Controls: estrogen used Controls: estrogen not used Total
Cases: estrogen used 39 113 152
Cases: estrogen not used 15 150 165
Total 54 263 317
(a) Is there a significant relationship between estrogen use and endometrial cancer?
(b) This sort of design, called a retrospective case-control study, is frequently used in medical investigations
where a randomised experiment is not possible. Do you see any possible weak points in a retrospective case-control
design?
Problem 7
A psychological experiment was done to investigate the effect of anxiety on a person’s desire to be alone or in
company (Lehman 1975). A group of 30 subjects was randomly divided into two groups of sizes 13 and 17. The
subjects were told that they would be subjected to some electric shocks, but one group (high-anxiety) was told
that the shocks would be quite painful and the other group (low-anxiety) was told that they would be mild and
painless. Both groups were told that there would be a 10-min wait before the experiment began, and each subject
was given the choice of waiting alone or with the other subjects. The following are the results:
Wait Together Wait Alone Total
High-Anxiety 12 5 17
Low-Anxiety 4 9 13
Total 16 14 30
Use Fisher’s exact test to test whether there is a significant difference between the high- and low-anxiety groups.
What is a reasonable one-sided alternative?
Problem 8
Hill and Barton (2005): red against blue outfits - does it matter in combat sports? Although other colors are also
present in animal displays, it is specifically the presence and intensity of red coloration that correlates with male
dominance and testosterone levels. Increased redness during aggressive interactions may reflect relative dominance.
In the 2004 Olympic Games, contestants in four combat sports were randomly assigned red and blue outfits.
The winner counts in different sports
Red Biue Total
Boxing 148 120 268
Freestyle wrestling 27 24 51
Greco-Roman wrestling 25 23 48
Tae Kwon Do 45 35 80
Total 245 202 447
(a) Let πR denote the probability that the contestant wearing red wins. Test the null hypothesis that πR = 0.5
versus the alternative hypothesis that πR is the same in each sport, but πR 6= 0.5.
(b) Test the null hypothesis that πR = 0.5 versus the alternative hypothesis that allows πR to be different in
different sports, but not equal to 0.5.
(c) Are these hypothesis tests equivalent to that which would test the null hypothesis πR = 0.5 versus the
alternative hypothesis πR 6= 0.5, using as data the total numbers of wins summed over all the sports?
(d) Is there any evidence that wearing red is more favourable in some of the sports than others?
Problem 9
Suppose that a company wishes to examine the relationship of gender to job satisfaction, grouping job satisfaction
into four categories: very satisfied, somewhat satisfied, somewhat dissatisfied, and very dissatisfied. The company
plans to ask the opinion of 100 employees. Should you, the company’s statistician, carry out a chi-square test of
independence or a test of homogeneity?
69
10 Multiple regression
Pearson’s father-son data. The following scatter diagram shows the heights of 1,078 fathers and their full-grown
sons, in England, circa 1900. There is one dot for each father-son pair.
Focussing on 6 feet tall fathers (above average height), we see that their sons on average are shorter than their
fathers. Francis Galton called this phenomenon regression to mediocrity.
Y = β0 + β1 X + , ∼ N(0, σ 2 ),
where is the noisy part of the response, that is not explained by the model. The key assumption of the model
requires that has a normal distribution N(0,σ 2 ) independent of X. This assumption is called homoscedasticity,
meaning that the noise size σ is the same for all possible levels of the predictor variable.
y i = β0 + β1 x i + e i , i = 1, . . . , n,
the linear regression model is characterised by the likelihood function of the three dimensional parameter θ =
(β0 , β1 , σ 2 )
n n (y − β − β x )2 o
Y 1 i 0 1 i S(β0 ,β1 )
L(θ) = √ exp − 2
f (xi ) = Cσ −n e− 2σ2 ,
i=1
2πσ 2σ
where f (x) is the probability density function of the random variable X and
n
X
C = (2π)−n/2 f (x1 ) · · · f (xn ), S(β0 , β1 ) = (yi − β0 − β1 xi )2 .
i=1
This implies the following expression for the log-likelihood function l(θ) = ln L(θ)
S(β0 ,β1 )
l(θ) = ln C − n ln σ − 2σ 2 .
Observe that
n
X
n−1 S(β0 , β1 ) = n−1 (yi − β0 − β1 xi )2 = β02 + 2β0 β1 x̄ − 2β0 ȳ − 2β1 xy + β12 x2 + y 2 ,
i=1
70
where
x1 +...+xn y1 +...+yn x21 +...+x2n y12 +...+yn
2
x1 y1 +...+xn yn
x̄ = n , ȳ = n , x2 = n , y2 = n , xy = n
where
1 X 1 X
s2x = (xi − x̄)2 , s2y = (yi − ȳ)2 .
n−1 n−1
As a result, the fitted regression line y = b0 + b1 x takes the form
s
y = ȳ + r sxy (x − x̄),
involving the sample correlation coefficient defined using the sample covariance:
sxy 1 X
r= , sxy = (xi − x̄)(yi − ȳ).
sx sy n−1
Notice that the maximum likelihood estimates (b0 , b1 ) of parameters (β0 , β1 ) are obtained by minimising the sum
of squares S(β0 , β1 ). Therefore, they are called the least squares estimates. Warning: the least square estimates
(b0 , b1 ) are not robust against outliers exerting leverage on the fitted line.
∂l 2
Putting ∂σ 2 = 0, and replacing (β0 , β1 ) with (b0 , b1 ), we find the maximum likelihood estimate of σ to be
S(b0 , b1 )
σ̂ 2 = ,
n
where
n
X
S(b0 , b1 ) = (yi − ŷi )2 ,
i=1
and
ŷi = b0 + b1 xi
are the so-called predicted responses. The maximum likelihood estimate of σ̂ 2 is a biased but asymptotically
unbiased estimate of σ 2 . An unbiased estimate of σ 2 is given by
S(b0 , b1 )
s2 = .
n−2
10.2 Residuals
The fitted regression line
s
y = b0 + b1 x = ȳ + r sxy (x − x̄),
is used for prediction of the response to a given predictor value x. For the given sample (x1 , y1 ), . . . , (xn , yn ), we
now compare the actual responses yi with the predicted responses ŷi . Introducing residuals by
êi = yi − ŷi ,
we can write
yi = ŷi + êi , i = 1, . . . , n,
The residuals (ê1 , . . . , ên ) are linearly connected via
71
so we can say that êi are uncorrelated with xi and êi are uncorrelated with ŷi . The residuals êi are realisations of
random variables having normal distributions with zero means and
(xk − xi )2
P P
k (xk − xi )(xk − xj )
Var(êi ) = σ 2 1 − k , Cov(ê ,
i jê ) = −σ 2
· .
n(n − 1)s2x n(n − 1)s2x
The error sum of squares X X
SSE = S(b0 , b1 ) = (yi − ŷi )2 = ê2i
i
can be expressed as
X s s2 X
SSE = (yi − ȳ)2 − 2r sxy n(xy − ȳx̄) + r2 s2y (xi − x̄)2 = (n − 1)s2y (1 − r2 ).
x
i i
72
Exact 100(1 − α)% confidence intervals Iβi = bi ± tn−2 ( α2 ) · sbi
For i = 0 or i = 1 and a given value β ∗ , one would like to the the null hypothesis H0 : βi = β ∗ . Use the test
statistic
bi − β ∗
t= ,
sbi
which is a realisation of a random variable T that has the exact null distribution
T ∼ tn−2 .
H0 : β1 = 0
stating that there is no relationship between the predictor variable x and the response y. The corresponding
test statistic, often called t-value,
t = b1 /sb1 .
and find whether this value is significant using the t-distribution with df = n − 2.
is estimated by
µ̂ = b0 + b1 x.
σ2 σ2 2
Var(µ̂) = n + n−1 · ( x−x̄
sx ) .
q
Exact 100(1 − α)% confidence interval Iµ = b0 + b1 x ± tn−2 ( α2 ) · s n1 + n−1
1
( x−x̄
sx )
2
q
Exact 100(1 − α)% prediction interval I = b0 + b1 x ± tn−2 ( α2 ) · s 1 + n1 + n−1
1
( x−x̄
sx )
2
Prediction interval has wider limits since it contains the uncertainty due the noise factors:
Graphically compare the formulas for Iµ and prediction interval I by drawing the confidence bands around the
regression line both for the individual observation Y and the mean µ.
73
10.5 Linear regression and ANOVA
We can represent the two independent samples case from Chapter 7 as two groups of independent observations
1. first sample µ1 + e1 , . . . , µ1 + en ,
where the (e1 , . . . , en+m ) are realisations of independent random variables with a common distribution N(0, σ 2 ).
This setting is equivalent to a simple linear regression model with the sample size n + m
yi = β0 + β1 xi + ei ,
x1 = . . . = xn = 0, xn+1 = . . . = xn+m = 1,
µ1 = β0 , µ2 = β0 + β1 .
Then the model utility test H0 : β1 = 0 becomes equivalent to the mean equality test H0 : µ1 = µ2 .
More generally, for the one-way ANOVA setting with I = p levels for the main factor and n = pJ observations
β0 + e i , i = 1, . . . , J,
β0 + β1 + e i , i = J + 1, . . . , 2J,
...
β0 + βp−1 + ei , i = (p − 1)J + 1, . . . , n,
The corresponding design matrix (see below) has the following block form
1 0 0 ... 0
1 1 0 ... 0
X= 1
0 1 ... 0 ,
... ... ... ...
1 0 0 ... 1
where each block is a column vector of length J whose components are either or ones or zeros.
74
It is very convenient to use the matrix notation
y = Xβ + e,
where
y = (y1 , . . . , yn )T , β = (β0 , . . . , βp−1 )T , e = (e1 , . . . , en )T ,
are column vectors, and X is the so called design matrix
1 x1,1 . . . x1,p−1
X = ... ... ... ...
1 xn,1 . . . xn,p−1
S(b) = ky − Xbk2 ,
where
kak2 = a21 + . . . + a2k , a = (a1 , . . . , ak ).
Solving the normal equations XT Xb = XT y we find the least squares estimates:
ê = y − ŷ = (I − P)y
SSE
An unbiased estimate of σ 2 is given by s2 = n−p , where SSE = kêk2 .
Denote by
m11 , m22 , . . . , mp−1,p−1 , mpp
the diagonal elements of matrix M. Then the standard error of bj is computed as
√
sbj = s mj+1,j+1 .
Bj −βj
Exact sampling distributions SBj ∼ tn−p , j = 1, . . . , p − 1.
To check the underlying normality assumption inspect the normal probability plot for the standardised residuals
√ êi , where pii are the diagonal elements of P.
s 1−pii
Coefficient of multiple determination can be computed similarly to the simple linear regression model as
SSE
R2 = 1 − ,
SST
where SST = (n − 1)s2y . The problem with R2 is that it increases even if irrelevant variables are added to the
model. To punish for irrelevant variables it is better to use the adjusted coefficient of multiple determination
n−1 SSE s2
Ra2 = 1 − n−p · SST =1− s2y .
n−1
The adjustment factor n−p gets larger for the larger values of predictors p.
75
Example: flow rate vs stream depth
The data in the following table were gathered for an environmental impact study that examined the relationship
between the depth of a stream and the rate of its flow (Ryan et al 1976).
Depth x .34 .29 .28 .42 .29 .41 .76 .73 .46 .40
Flow rate y .64 .32 .73 1.33 .49 .92 7.35 5.89 1.98 1.12
A bowed shape of the plot of the residuals versus depth suggests that the relation between x and y is not linear.
The multiple linear regression framework can by applied to the quadratic model
y = β0 + β1 x + β2 x 2 ,
with x1 = x and x2 = x2 .
Coefficient Estimate Standard Error t value
β0 1.68 1.06 1.52
β1 −10.86 4.52 −2.40
β2 23.54 4.27 5.51
The residuals show no sign of systematic misfit. The test statistic t = 5.51 of the utility test of H0 : β2 = 0 shows
that the quadratic term in the model is statistically significant.
H-model: L = β0 + β1 H + , W-model: L = β0 + β1 W + .
L = β0 + β1 H + β2 W + ,
which gives
76
b0 = 21, sb0 = 8.8, b0 /sb0 = 2.39,
b1 = 0.20, sb1 = 0.36, b1 /sb1 = 0.56,
b2 = 0.19, sb2 = 0.17, b2 /sb2 = 1.12,
s = 3.9, R2 = 0.81, Ra2 = 0.77.
In contrast to the simple models, we can not reject neither H1 : β1 = 0 nor H2 : β2 = 0. This paradox is explained
by different meaning of the slope parameters in the simple and multiple regression models. In the multiple model
β1 is the expected change in L when H increased by one unit and W held constant.
Collinearity problem: height and weight have a strong linear relationship. The fitted plane has a well resolved
slope along the line about which the (H, W ) points fall and poorly resolved slopes along the H and W axes.
100
90
80
70
60
50
40
30
20
10
0
20 25 30 35 40 45 50 55 60 65
10.7 Exercises
Problem 1
Suppose we agiven a two-dimensional iid-sample
(x1 , y1 ), . . . , (xn , yn ).
Verify that the sample covariance is an unbiased estimate of the population covariance.
Problem 2
Ten pairs
x 0.34 1.38 -0.65 0.68 1.40 -0.88 -0.30 -1.18 0.50 -1.75
y 0.27 1.34 -0.53 0.35 1.28 -0.98 -0.72 -0.81 0.64 -1.59
Draw a scatter plot.
(a) Fit a straight line y = a + bx by the method of least squares, and sketch it on the plot.
(b) Fit a straight line x = c + dy by the method of least squares, and sketch it on the plot.
(c) Are the lines on (a) and (b) the same? If not, why not?
Problem 3
Two consecutive grades
X = the high school GPA (grade point average),
Y = the freshman GPA.
Allow two different intercepts for females
Y = βF + β1 X + , ∼ N(0, σ 2 )
and for males
Y = βM + β1 X + , ∼ N(0, σ 2 ).
Give the form of the design matrix for such a model.
77
Problem 4
Simple linear regression model
Y = β0 + β1 X + , ∼ N(0, σ 2 ).
Using n pairs of (xi , yi ) we fit a regression line by
σ 2 x2 σ2 2
σ x̄
y = b0 + b1 x, Var(B0 ) = (n−1)s2x , Var(B1 ) = (n−1)s2x , Cov(B0 , B1 ) = − (n−1)s 2 .
x
Y0 = β0 + β1 x0 + 0
by
ŷ0 = b0 + b1 x0 .
(a) Find an expression for the variance of Ŷ0 − Y0 , and compare it to the variance of Ŷ0 . Find an , the standard
deviation of Ŷ0 −Y
σ
0
.
(b) Derive a formula for 95% prediction interval I such that
P(Y0 ∈ I) = 0.95
using
Y0 − Ŷ0
∼ tn−2 .
San
Problem 5
Data collected for
x = midterm grade,
y = final grade,
gave
r = 0.5, x̄ = ȳ = 75, sx = sy = 10.
(a) Given x = 95, predict the final score.
(b) Given y = 85 and not knowing the midterm score, predict the midterm score.
Problem 6
Let
Y = X + βZ,
where X ∈ N(0, 1) and Z ∈ N(0, 1) are independent.
(b) Use the result of part (a) to generate bivariate samples (xi , yi ) of size 20 with population correlation
coefficients −0.9, −0.5, 0, 0.5, and 0.9. Compute the sample correlation coefficients.
Problem 7
The stopping distance of an automobile on a certain road was studied as a function of velocity (Brownee 1960)
velocity of a car x (mi/h) 20.5 20.5 30.5 40.5 48.8 57.8
stopping distance y (ft) 15.4 13.3 33.9 73.1 113.0 142.6
√
Fit y and y as linear functions of velocity, and examine the residuals in each case. Which fit is better? Can you
suggest any physical reason that explains why?
78
11 Overview of basic probability theory
11.1 Probability rules
Sample space and random events
P(A∩B)
Conditional probability of A given B has occurred: P(A|B) = P(B)
X Z
F (x) = P(X ≤ x) = f (y) or = f (y)dy.
y≤x y≤x
79
11.3 Important parametric distributions
Discrete uniform distribution X ∼ U(N ):
1 N +1 N2 − 1
f (k) = , 1 ≤ k ≤ N, E(X) = , Var(X) = .
N 2 12
Continuous uniform distribution X ∼ U(a, b):
1 a+b (b − a)2
f (x) = , a < x < b, E(X) = , Var(X) = .
b−a 2 12
Binomial distribution X ∼ Bin(n, p):
n k
f (k) = p (1 − p)n−k , 0 ≤ k ≤ n, E(X) = np, Var(X) = np(1 − p).
k
n−1
where the factor 1 − N −1 is called the finite population correction.
λk −λ
f (k) = k! e , k ≥ 0, E(X) = Var(X) = λ.
Poisson approximation
Standard normal distribution Z ∼ N(0, 1) has the following density function and distribution function respectively
Z z
1 −z2 /2
φ(z) = √ e , Φ(z) = φ(x)dx.
2π −∞
X −µ x−µ
∼ N(0, 1), f (x) = 1
σ φ( σ ), E(X) = µ, Var(X) = σ 2 .
σ
Central Limit Theorem: if X1 , . . . , Xn are independent and identically distributed with mean µ and variance σ 2 ,
then for all z,
X1 + . . . + Xn − nµ
P √ ≤ z → Φ(z), n → ∞.
σ n
Normal approximations:
80
11.4 Joint distributions
Given a joint probability mass (density) function fX,Y (x, y) of a random vector (X, Y ), the marginal distributions
of X and Y are respectively,
X X
fX (x) = fX,Y (x, y), fY (y) = fX,Y (x, y),
y x
Two random variables X and Y are independent if for all pairs of values (x, y)
Cov(Xi , Xj ) = −npi pj , i 6= j.
Bivariate normal distribution (X, Y ) ∼ N(µ1 , µ2 , σ12 , σ22 , ρ) has joint density
(x − µ1 )2 (y − µ2 )2
1 1 2ρ(x − µ1 )(y − µ2 )
f (x, y) = exp − + − ,
2(1 − ρ2 ) σ12 σ22
p
2πσ1 σ2 1 − ρ2 σ1 σ2
where −∞ < x, y < ∞. The marginal distributions of the bivariate normal distribution are also normal
µ2 + ρ σσ12 (x − µ1 ), σ22 (1 − ρ2 ).
81
12 Statistical distribution tables
82
12.2 Chi-square distribution table
83
12.3 Tails of t-distribution
84
12.4 Critical values of the F-distribution
85
Critical values of the F-distribution (continued)
86
Critical values of the F-distribution (continued)
87
12.5 Critical values of the Studentised range distribution
88
Critical values of the Studentised range distribution (continued)
89
12.6 Critical values for the rank sum test
90
12.7 Critical values for the signed rank test
91
Critical values for the signed rank test (continued)
92
13 Solutions to exercises
13.1 Solutions to Section 2 (random sampling)
Solution 1
Here we consider sampling with replacement. For an answer in the case of sampling without replacement consult
the book page A36.
Population distribution
Values 1 2 4 8
1 2 1 1
Probab. 5 5 5 5
The list of X̄ values (and their probabilities in brackets) for n = 2 observations taken with replacement:
1 2 4 8 Total prob.
1 1.0 (1/25) 1.5 (2/25) 2.5 (1/25) 4.5 (1/25) 1/5
2 1.5 (2/25) 2.0 (4/25) 3.0 (2/25) 5.0 (2/25) 2/5
4 2.5 (1/25) 3.0 (2/25) 4.0 (1/25) 6.0 (1/25) 1/5
8 4.5 (1/25) 5.0 (2/25) 6.0 (1/25) 8.0 (1/25) 1/5
Tot. prob. 1/5 2/5 1/5 1/5 1
Solution 2
Dichotomous data
q q
p̂(1−p̂) 0.55×0.45
n = 1500, p̂ = 0.55, 1 − p̂ = 0.45, sp̂ = n−1 = 1499 = 0.013.
93
Solution 3
X̄−µ
Normal approximation: SX̄ is asymptotically N(0,1)-distributed. From
0.90 ≈ P( X̄−µ
S > −1.28) = P(−∞ < µ < X̄ + 1.28SX̄ ),
X̄
0.95 ≈ P( X̄−µ
SX̄ < 1.645) = P(X̄ − 1.645SX̄ < µ < ∞).
we find
k1 = 1.28, k2 = 1.645.
Solution 4
Randomised response method. Consider
Solution 5
Data summary X X
N = 2000, n = 25, xi = 2351, x2i = 231305.
(a) Unbiased estimate of µ is
2351
x̄ = = 94.04.
25
(b) Sample variance
n 25 231305
s2 = (x2 − x̄2 ) = − (94.04)2 = 425.71.
n−1 24 25
Unbiased estimate of σ 2 is
N −1 2 1999
s = 425.71 = 425.49.
N 2000
Unbiased estimate of Var(X̄) is
s2 n
s2x̄ =
1− = 16.81.
n N
(c) An approximate 95% confidence interval for µ
√
Iµ = x̄ ± 1.96sx̄ = 94.04 ± 1.96 16.81 = 94.04 ± 8.04.
94
Solution 6
The bias is
σ2 n−1
E(X̄ 2 ) − µ2 = E(X̄ 2 ) − (EX̄)2 = Var(X̄) = 1− .
n N −1
For large n, the bias is small.
Solution 7
Stratified population of size N = 2010 with k = 7 strata.
(a) With n = 100, we get the following answers using the relevant formulas
Stratum number j 1 2 3 4 5 6 7 Weighted mean
Stratum proportion wj 0.20 0.23 0.19 0.17 0.08 0.06 0.07
Stratum mean µj 5.4 16.3 24.3 34.5 42.1 50.1 63.8 µ = 26.15
Stratum standard deviation σj 8.3 13.3 15.1 19.8 24.5 26.0 35.2 σ̄ = 16.94
w σ
Optimal allocation n σ̄jj l 10 18 17 19 12 9 15
Proportional allocation nwj 20 23 19 17 8 6 7
(b) Since σ̄ 2 = 286.9 and σ 2 = 339.7, we have
σ̄ 2 σ2 σ2
Var(X̄so ) = = 2.87, Var(X̄sp ) = = 3.40, Var(X̄) = = 6.15,
n n n
where σ 2 is computed in the next item.
Solution 9
Stratified population with
N = 5, k = 2, w1 = 0.6, w2 = 0.4, µ1 = 1.67, µ2 = 6, σ12 = 0.21, σ22 = 4.
Given n1 = n2 = 1 and n = 2, the sampling distribution of the stratified sample mean x̄s = 0.6x1 + 0.4x2 is
x1 = 1 x1 = 2 Total prob.
x2 = 4 2.2 (1/6) 2.8 (2/6) 1/2
x2 = 8 3.8 (1/6) 4.4 (2/6) 1/2
Tot. prob. 1/3 2/3 1
We find that
1 2 1 1
E(X̄s ) = 2.2 · 6 + 2.8 · 6 + 3.8 · 6 + 4.4 · 6 = 3.4,
2
(E(X̄s )) = 11.56,
E(X̄s2 ) = (2.2)2 · 1
6 + (2.8)2 · 2
6 + (3.8)2 · 1
6 + (4.4)2 · 2
6 = 12.28,
Var(X̄s ) = 12.28 − 11.56 = 0.72.
These results are in agreement with the formulas
w12 σ12 2 2
wk σk
E(X̄s ) = µ, Var(X̄s ) = n1 + ... + nk = 0.36σ12 + 0.16σ22 .
95
13.2 Solutions to Section 3 (parameter estimation)
Solution 1
A method of moment estimate of the parameter λ for the Poisson distribution model is given by the sample mean
λ̃ = 3.9. Using this value we compute the expected counts, see table. Comparing observed and expected counts by
a naked eye we see that the Poisson model does not fit well. The sample variance is close to 5 which shows that
there is over dispersion in that the variance is larger than λ.
This extra variation in the data can be explained by the fact that the 300 intervals were distributed over various
hours of the day and various days of the week.
n Frequency Observed frequency
0 14 6.1
1 30 23.8
2 36 46.3
3 68 60.1
4 43 58.5
5 43 45.6
6 30 29.6
7 14 16.4
8 10 8.0
9 6 3.5
10 4 1.3
11 1 0.5
12 1 0.2
13+ 0 0.1
Solution 2
Number X of yeast cells on a square. Test the Poisson model X ∼ Pois(λ).
Concentration 1.
x̄ = 0.6825, x2 = 1.2775, s2 = 0.8137, s = 0.9021, sx̄ = 0.0451.
Approximate 95% confidence interval
Iµ = 0.6825 ± 0.0884.
Pearson’s chi-squared test based on λ̂ = 0.6825:
x 0 1 2 3 4+ Total
Observed 213 128 37 18 4 400
Expected 202.14 137.96 47.08 10.71 2.12 400
Observed test statistic χ2 = 10.12, df = 5 − 1 − 1 = 3, p-value < 0.025. Reject the model.
Concentration 2.
x̄ = 1.3225, x2 = 3.0325, s = 1.1345, sx̄ = 0.0567.
Approximate 95% confidence interval
Iµ = 1.3225 ± 0.1112.
Pearson’s chi-squared test: observed test statistic χ2 = 3.16, df = 4, p-value > 0.10. Do not reject the model.
Concentration 3.
x̄ = 1.8000, s = 1.1408, sx̄ = 0.0701.
Approximate 95% confidence interval for
Iµ = 1.8000 ± 0.1374.
Pearson’s chi-squared test: observed test statistic χ2 = 7.79, df = 5, p-value > 0.10. Do not reject the model.
Concentration 4.
n = 410, x̄ = 4.5659, s2 = 4.8820, sx̄ = 0.1091.
Approximate 95% confidence interval
Iµ = 4.566 ± 0.214.
Pearson’s chi-squared test: observed test statistic χ2 = 13.17, df = 10, p-value > 0.10. Do not reject the model.
96
Solution 3
Population distribution: X takes values 0, 1, 2, 3 with probabilities
2 1 2 1
p0 = · θ, p1 = · θ, p2 = · (1 − θ), p3 = · (1 − θ),
3 3 3 3
so that
p0 + p1 = θ, p2 + p3 = 1 − θ.
We are given an iid-sample with
n = 10, x̄ = 1.5, s = 1.08,
and observed counts
x 0 1 2 3 Total
Ox 2 3 3 2 10
(a) Method of moments. Using
1 2 1 7
µ= · θ + 2 · · (1 − θ) + 3 · · (1 − θ) = − 2θ,
3 3 3 3
derive an equation
7
x̄ = − 2θ̃.
3
It gives an unbiased estimate
7 x̄ 7 3
θ̃ = − = − = 0.417.
6 2 6 4
(b) To find sθ̃ , observe that
1 σ2
Var(X̄) =
Var(Θ̃) = .
4 40
σ
Thus we need to find sθ̃ , which estimates σθ̃ = 6.325 . Next we estimate σ using two methods.
Method 1. From
2
2 2 1 2 2 1 7 7 2
σ = E(X ) − µ = · θ + 4 · · (1 − θ) + 9 · · (1 − θ) = − 2θ − − 2θ = + 4θ − 4θ2 ,
3 3 3 3 3 9
we estimate σ as r
2
+ 4θ̃ − 4θ̃2 = 1.093.
9
This gives
1.093
sθ̃ = = 0.173.
6.325
Method 2:
s 1.08
sθ̃ =
= = 0.171.
6.325 6.325
(c) Likelihood function is obtained using (O0 , O1 , O2 , O3 ) ∼ Mn(n, p0 , p1 , p2 , p3 )
O 0 O 1 O 2 O 3
L(θ) = 2
3θ
1
3θ
2
3 (1 − θ) 1
3 (1 − θ) = const θt (1 − θ)n−t ,
97
(d) We find sθ̂ using the formula for the standard error of sample proportion
q
sθ̂ = θ̂(1− θ̂)
n−1 = 0.167.
Solution 4
Likelihood function of X ∼ Bin(n, p) for a given n and X = x is
n x
L(p) = p (1 − p)n−x ∝ px (1 − p)n−x .
x
(a) To maximise L(p) we minimise
ln px (1 − p)n−x ) = x ln p + (n − x) ln(1 − p).
Since
∂ x n−x
(x ln p + (n − x) ln(1 − p)) = − ,
∂p p 1−p
x n−x x
we have to solve p = 1−p , which brings the maximum likelihood estimate formula p̂ = n.
(b) We have X = Y1 + . . . + Yn , where (Y1 , . . . , Yn ) are iid Bernoulli random variables with
f (y|p) = py (1 − p)1−y , y = 0, 1.
By Cramer-Rao, if p̃ is an unbiased estimate of p, then
1
Var(P̃ ) ≥ ,
nI(p)
where
∂2
1
I(p) = −E ln f (Y |p) = ,
∂p2 p(1 − p)
see Solution 3 (d). We conclude that the variance sample proportion p̂ attains the Cramer-Rao lower bound since
p(1−p)
Var(P̂ ) = n .
(c) Plot L(p) = 252p5 (1 − p)5 . The top of the curve is in the middle p̂ = 0.5.
Solution 5
The observed serial number x = 888 can be modeled by the discrete uniform distribution X ∼ U(N ).
98
Solution 6
Statistical model: x is the number of black balls obtained by sampling k balls without replacement from an urn
with N balls of which n balls are black. Hypergeometric distribution
n
N −n
20
P(X = 20) = N
30 .
50
L(N̂ )
= 1 ⇔ (N̂ − 100)(N̂ − 50) = N̂ (N̂ − 130),
L(N̂ − 1)
Solution 7
An iid-sample of size n = 16 from a normal distribution.
(d) To find sample size x that halves the confidence interval length we set up an equation using the exact
confidence interval formula for the mean
s s0
t15 (α/2) · √ = 2 · tx−1 (α/2) · √ ,
16 x
where s0 is the sample standard deviation for the sample of size x. A simplistic version of this equation 1
4 = √2
x
implies x ≈ (2 · 4)2 = 64. Further adjustment for a 95% confidence interval is obtained using
Solution 8
An iid-sample (x1 , . . . , xn ) from the uniform distribution U(0, θ) with density
f (x|θ) = θ1 1{0≤x≤θ} .
99
(a) Method of moments estimate θ̃ is unbiased
4σ 2 θ2
µ = θ2 , θ̃ = 2x̄, E(Θ̃) = θ, Var(Θ̃) = n = 3n .
but asymptotically unbiased. Notice the unusual asymptotics indicating that the conditions on the parametric
1
model implying Θ̂ ≈ N(θ, nI(θ) ) are violated:
θ2
Var(Θ)
b = , n → ∞.
n2
Compare two mean square errors:
2
nθ2 2θ2
b − θ)2 = θ
MSE(Θ)
b = E(Θ − + 2
= ,
n+1 (n + 1) (n + 2) (n + 1)(n + 2)
θ2
MSE(Θ̃) = .
3n
(d) Corrected maximum likelihood estimate
n+1
θ̂c = · x(n)
n
θ2
becomes unbiased E(Θ
b c ) = θ with Var(Θ
b c) =
n(n+2) .
Solution 9
Data
x1 = 1997, x2 = 906, x3 = 904, x4 = 32
and model
2+θ 1−θ 1−θ θ
p1 = , p2 = , p3 = , p4 = .
4 4 4 4
(a) Sample counts (X1 , X2 , X3 , X4 ) ∼ Mn(n, p1 , p2 , p3 , p4 ) with n = 3839. Given a realisation
(x1 , x2 , x3 , x4 ) with x1 + x2 + x3 + x4 = n,
where ∝ means that we drop factors depending only on (n, x1 , x2 , x3 , x4 ). The last expression reveals that we have
a case of two sufficient statistics (x1 , x4 ). Putting
d x1 x4 n − x1 − x4
ln L(θ) = + −
dθ 2+θ θ 1−θ
100
equal to zero, we arrive at the equation
x1 x4 n − x1 − x4
+ =
2+θ θ 1−θ
or equivalently
θ2 n + θu − 2x4 = 0,
where
u = 2x2 + 2x3 + x4 − x1 = 2n − x4 − 3x1 .
We find the maximum likelihood estimate to be
√
−u + u2 + 8nx4
θ̂ = = 0.0357.
2n
Asymptotic variance
b ≈ 1
Var(Θ) , I(θ) = −E(g(Y1 , Y2 , Y3 , Y4 , θ)).
nI(θ)
where (Y1 , Y2 , Y3 , Y4 ) ∼ Mn(1, p1 , p2 , p3 , p4 ) with
f (y1 , y2 , y3 , y4 |θ) = py11 py22 py33 py44 = (2 + θ)y1 (1 − θ)y2 +y3 θy4 4−n
and
∂2 y1 y2 + y3 y4
g(y1 , y2 , y3 , y4 , θ) = 2
ln f (y1 , y2 , y3 , y4 |θ) = − 2
− 2
− 2.
∂θ (2 + θ) (1 − θ) θ
101
and the significance level in question is (using a continuity correction)
(b) The power of the test is a function of the parameter value p (without continuity correction)
Pw(p) = P(|X − 50| > 10) = P(X < 40) + P(X > 60)
! !
40 − 100p 60 − 100p
=P Z< p +P Z > p
10 p(1 − p) 10 p(1 − p)
! !
4 − 10p 10p − 6
≈Φ p +Φ p .
p(1 − p) p(1 − p)
Solution 2
P(x|H0 )
Data: one observation of X = x. Likelihood ratio test: reject for small values of Λ = P(x|H1 ) .
X-values x4 x2 x1 x3
P(x|H0 ) 0.2 0.3 0.2 0.3
P(x|H1 ) 0.4 0.4 0.1 0.1
Likelihood ratio Λ = P(x|H 0)
P(x|H1 ) 0.5 0.75 2 3
X-values x4 x2 x1 x3
Likelihood ratio Λ 0.5 0.75 2 3
P(x|H0 ) 0.2 0.3 0.2 0.3
Cumulative probab. 0.2 0.5 0.7 1
Solution 3
Likelihood function
n n
Y 1 xi −λ −λn y
Y 1
L(λ) = λ e =e λ
x
i=1 i
! x
i=1 i
!
where
y = x1 + . . . + xn
is a sufficient statistic. Reject H0 for small values of the likelihood ratio
L(λ0 )
= e−n(λ0 −λ1 ) ( λλ01 )y .
L(λ1 )
If λ1 > λ0 , then we reject H0 for large values of y. Test statistic Y has null distribution Pois(nλ0 ).
102
Solution 4
We have an iid-sample from N(µ, 100) of size n = 25. Two simple hypotheses
H0 : µ = 0, H1 : µ = 1.5
X̄ ∼ N(µ, 4).
(a) The rejection region at α = 0.1 is {x̄ > x}, where x is the solution of the equation
From the normal distribution table we find x/2 = 1.28, so that x = 2.56 and the rejection region is
The corresponding confidence interval method uses the one-sided 90% confidence interval for the mean
We reject H0 if the interval does not cover µ0 = 0, that is when x̄ − 2.56 > 0.
Solution 5
We have a pair of beta-densities
f (x|H0 ) 2
Λ= = 3x , 0 ≤ x ≤ 1.
f (x|H1 )
The corresponding likelihood ratio test of H0 versus H1 rejects H0 for large values of x.
(c) The rejection region of a level α test is computed from the equation
that is
1 − x2crit = α.
We conclude that √
R = {x : x > 1 − α}.
(d) The power of the test is √
P(X > 1 − α|H1 ) = 1 − (1 − α)3/2
103
Solution 6
Let s2 be the sample variance computed from (x1 , . . . , x15 ).
(b) We reject H0 : σ = 1 if the value 1 falls outside the confidence interval interval Iσ2 , so that
Solution 7
Using the confidence interval-method of hypotheses testing we reject H0 in favour of the two-sided alternative,
since the value µ = −3 is not covered by the two-sided confidence interval (−2, 3).
Solution 8
The analysis is the basis of the sign test.
(b) The generalised likelihood ratio test rejects H0 for small values of
where
y = |x − n/2|.
The function a(y) is monotonely increasing over y ∈ [0, n/2], since
n
+y
a0 (y) = ln 2
n > 0.
2 −y
We conclude that the test rejects for large values of y.
(c) Compute the significance level for the rejection region |x − n2 | > k:
X n
n
α = P(|X − 2 | > k|H0 ) = 2 2−n .
n i
i< 2 −k
(d) Using the normal approximation for n = 100 and k = 10, we find
Solution 9
(a) Two-sided p-value = 0.134.
104
Solution 10
We are supposed to test
H0 : death cannot be postponed,
H1 : death can be postponed until after an important date.
(a) Jewish data: n = 1919 death dates
x = 922 deaths during the week before Passover,
n − x = 997 deaths during the week after Passover.
Under the binomial model X ∼ Bin(n, p), we would like to test
H0 : p = 0.5 against H1 : p < 0.5.
We apply the large sample test for proportion. Observed test statistic
x − np0 922 − 1919 · 0.5
z=p = √ = −1.712.
np0 (1 − p0 ) 1919 · 0.5
One-sided p-value of the test
Φ(−1.712) = 1 − Φ(1.712) = 1 − 0.9564 = 0.044.
Reject H0 in favour of one-sided H1 at the significance level 5%.
(b) To control for the seasonal effect the Chinese and Japanese data were studied
n = 852, x = 418, n − x = 434, z = −0.548.
One-sided p-value is 29%, showing no significant effect.
Solution 11
Multinomial model
(X1 , X2 , X3 ) ∼ Mn(190, p1 , p2 , p3 ).
Composite null hypothesis (Hardy-Weinberg Equilibrium)
H0 : p1 = (1 − θ)2 , p2 = 2θ(1 − θ), p3 = θ2 .
Likelihood function and maximum likelihood estimate
190 292
L(θ) = 268 θ292 (1 − θ)88 , θ̂ = = 0.768.
10, 68, 112 380
Pearson’s chi-squared test:
cell 1 2 3
Total
observed 10 68 112
190
expected 10.23 67.71 112.07
190
2
√
Observed chi-squared test statistic χ = 0.0065, df = 1, p-value = 2(1 − Φ( 0.0065)) = 0.94.
Conclusion: the Hardy-Weinberg Equilibrium model fits well the haptoglobin data.
Solution 12
Month Oj Days Ej Oj − Ej
Jan 1867 31 1994 −127
Feb 1789 28 1801 −12
Mar 1944 31 1994 −50
Apr 2094 30 1930 164
May 2097 31 1994 103
Jun 1981 30 1930 51
Jul 1887 31 1994 -107
Aug 2024 31 1994 30
Sep 1928 30 1930 -2
Oct 2032 31 1994 38
Nov 1978 30 1930 48
Dec 1859 31 1994 -135
105
Simple null hypothesis
31 28 30
H0 : p1 = p3 = p5 = p7 = p8 = p10 = p12 = , p2 = , p4 = p6 = p9 = p11 = .
365 365 365
The total number suicides n = 23480, so that the expected counts are
(0)
Ej = npj , j = 1, . . . , 12.
Since df = 12 − 1 = 11, and χ211 (0.005) = 26.8, we reject H0 of no seasonal variation. Merry Christmas!
Solution 13
Number of heads
Y ∼ Bin(n, p), n = 17950.
(a) For H0 : p = 0.5 the observed z-score
z = √ y−np0 = 3.46.
np0 (1−p0 )
1
θ̂map = θ̂pme = .
2
106
Solution 2
Number of bird hops X ∼ Geom(p)
f (x|p) = (1 − p)x−1 p, x = 1, 2, . . . .
(x1 , . . . , xn ), n = 130.
It is a beta distribution
Beta(n + 1, nx̄ − n + 1) = Beta(131, 234).
Posterior mean
a 131
µ= = = 0.36.
a+b 131 + 234
Observe that
1
1+ n
µ= 2 ,
x̄ + n
gets closer to the method of moments estimate of p as n → ∞. The standard deviation of the posterior distribution
r r
µ(1 − µ) 0.36 · 0.64
σ= = = 0.025.
a+b+1 366
Solution 3
We use the binomial model X ∼ Bin(n, p), with p being the probability that the event will occur at a given trial.
Use an uninformative conjugate prior P ∼ Beta(1, 1). Given X = n, the posterior becomes P ∼ Beta(n + 1, 1).
n+1
Since the posterior mean is n+2 , we get
n+1
p̂pme = .
n+2
Solution 4
Recall solutions of parts (a) and (b).
107
so that in terms of the likelihood ratio,
π1 1 1
Λ< = − 1, π0 < .
π0 π0 1+Λ
If x = x4 , then Λ = 0.5, and we reject H0 , provided π0 < 23 . This corresponds to the decision rules with α = 0.2.
If x = x2 , then Λ = 0.75, and we reject H0 , provided π0 < 47 . This corresponds to the decision rules with
α = 0.5.
Furthermore, if x = x1 , then Λ = 2, and we reject H0 , provided π0 < 13 , and if x = x3 , then Λ = 3, and we
reject H0 , provided π0 < 14 .
Solution 5
For a single observation X ∼ N(µ, σ 2 ), where σ 2 is known, test H0 : µ = 0 vs H1 : µ = 1. Prior probabilities
2 1
P(H0 ) = , P(H1 ) = .
3 3
(a) Likelihood ratio
x2
f (x|0) e− 2σ2 1 −x
2
= 2 = e σ2 .
f (x|1) e
(x−1)
− 2σ2
We conclude that
σ 2 = 0.1 σ 2 = 0.5 σ2 = 1 σ2 = 5
Proportion of the time H0 will be chosen 0.67 0.73 0.78 0.94
Solution 6
We have a pair of beta-densities
f (x|H0 ) = 2x, f (x|H1 ) = 3x2 , 0 ≤ x ≤ 1.
(a) If the two hypotheses have equal prior probabilities, then the posterior probabilities equal
1
2 f (x|H0 ) x 2 3x
h(H0 |x) = 1 1 = 3 2 = 2 + 3x , h(H1 |x) = .
2 f (x|H0 ) + 2 f (x|H1 ) x + 2x 2 + 3x
Therefore, the posterior probability of H0 is greater than that of H1 for x satisfying
2 > 3x, that is when x < 23 .
108
x=rand(16,100);
y=sort(x)’;
for k=1:100
plot(y(k,:),(1:16)/16-y(k,:),’.’)
hold on
end
Solution 2
We have
n n X
1 hX X i
E(F̂ (u) · F̂ (v)) = E(1 1
{Xi ≤u} {Xi ≤v} ) + E(1 1
{Xi ≤u} {Xj ≤v} )
n2 i=1 i=1 j6=i
n n X
1 hX X i
= 2 F (u) + F (u)F (v)
n i=1 i=1 j6=i
1h i
= F (u) + (n − 1)F (u)F (v) .
n
Finish by using
Cov(F̂ (u), F̂ (v)) = E(F̂ (u) · F̂ (v)) − E(F̂ (u)) · E(F̂ (v))
1
= [F (u) + (n − 1)F (u)F (v)] − F (u)F (v)
n
= n1 F (u)(1 − F (v)).
Solution 3
Ordered sample x(1) , . . . , x(n)
(a) The figure shows the empirical distribution function and a normal probability plot.
109
Use Matlab commands
x=data vector;
stairs(sort(x),(1:length(x))/length(x)) % empirical cdf
hist(x) % histogram, the same as hist(x,10)
normplot(x) % normal probability plot
prctile(x,90) % 0.90-quantile
We see that the value 15.28 can not be detected as an outlier, since it coincides with the 82% sample quantile.
There is only one sample value larger than 16.69, therefore 3% dilution would be easier to detect. Obviously, 5%
dilution resulting in 18.10 is very easy to detect.
Solution 4
Taking the derivative of
β
1 − F (t) = e−αt ,
and dividing the latter by the former we obtain the hazard function
h(t) = αβtβ−1 .
Solution 5
Take the Weibull distribution with parameters α and β.
• If β > 1, then h(t) increases with t meaning that the older individuals die more often than the younger.
• If 0 < β < 1, then h(t) decreases with t meaning that the longer you live the healthier you become.
110
Solution 6
5
(a) Due to sampling with replacement we have N ∼ Bin (26, 26 ).
B · P(N ≥ 10) = 18
(d) The probability that a bootstrap sample is composed entirely of these outliers is negligibly small
Solution 7
(a) The Matlab commands
x=data vector;
trimmean(x,10)
trimmean(x,20)
m = trimmean(x,percent) calculates the trimmed mean of the values in x. For a vector input, m is the
mean of x, excluding the highest and lowest k data values, where k=n*(percent/100)/2 and where n is
the number of values in x.
0.78
Iµ = 14.58 ± 1.645 · √ = 14.58 ± 0.17 = (14.41; 14.75)
59
(c) Nonparametric 90% confidence interval for the population median M is (x(k) , x(60−k) ), where P(Y < k) =
0.05 and Y ∼ Bin (59, 0.5). Applying the normal approximation for Bin (n, p) with continuity correction
!
k − 0.5 − np
P(Y < k) = P(Y ≤ k − 1) ≈ Φ p ,
np(1 − p)
we arrive at equation
59
k − 0.5 − 2
q = −1.645,
59
4
n=59; B=1000;
z=x(random(’unid’,n,n,B)); % (’unid’,n) - uniform discrete [1, n], 1000 samples of size n
t1=trimmean(z,10);
t2=trimmean(z,20);
std(t1)
std(t2)
give the standard errors 0.1034 and 0.1004 for x̄0.1 and x̄0.2 respectively.
111
iqr(x)
median(abs(x-median(x)))
q=prctile(z,75);
hist(q)
std(q)
Solution 8
Matlab command (x = control and y = seeded data)
qqplot(x,y)
produces a QQ-plot that fits the line y = 2.5x claiming 2.5 times more rainfall from seeded clouds. On the other
hand, Matlab command
qqplot(log(x),log(y))
(b) We have s2x = 0.2163, s2y = 1.1795, s2p = 0.7667. The latter is an unbiased estimate of σ 2 .
(d) Based on t7 -distribution, an exact 90% confidence interval for mean difference is
(f) From the observed test statistic value t = 1.8206, we find the two-sided p = 0.1115 using the Matlab com-
mand 2*tcdf(-1.8206,7).
b: σ 2 = 1,
c: sȳ−x̄ = 0.0.6708,
d: Iµy −µx = 1.0694 ± 1.1035,
f: z = 1.5942 two-sided p-value = 0.11.
Solution 2
If m = n, then ! Pn Pn
1 1 2 i=1 (xi − x̄)2 + i=1 (yi − ȳ)2 s2x + s2y s2 s2y
s2p + = · = = x+ .
n m n 2n − 2 n n m
112
Solution 3
Test the null hypothesis of no drug effect
Suggested measurement design: during the same n = 10 days take blood pressure measurements on 4 people, two
on the treatment
x11 , . . . , x1n ,
x21 , . . . , x2n ,
x31 , . . . , x3n ,
x41 , . . . , x4n .
Dependencies across the days and the people make inappropriate both two-sample t test and rank sum test. Proper
design for 40 measurements is that of two independent samples: 20 people on the treatment and 20 controls:
x1 , . . . , x20 ,
y1 , . . . , y20 .
Solution 4
(a) The sign test statistic
H
t = number of positive xi , T ∼0 Bin(25, 21 ) ≈ N( 25 25
2 , 4 ).
which gives
k − 13
= 1.645, k = 17.
2.5
We know the true population distribution is N(0.3, 1). Since
we can use
T ∼ Bin(25, 0.62) ≈ N(15.5, 5.89)
X̄−µ
(b) Normal distribution model X ∼ N(µ, 1). Since 1/5 ∼ N(0, 1), we reject H0 for
113
Solution 5
Two independent samples
x1 , . . . , x n , y1 , . . . , yn ,
are taken from two population distributions with equal standard deviation σ = 10. Approximate 95% confidence
interval q
Iµ1 −µ2 = x̄ − ȳ ± 1.96 · 10 · n2 = x̄ − ȳ ± 27.72
√ .
n
Solution 6
Rank Type I Type II Rank
1 3.03 3.19 2
8 5.53 4.26 3
9 5.60 4.47 4
11 9.30 4.53 5
13 9.92 4.67 6
14 12.51 4.69 7
17 12.95 6.79 10
18 15.21 9.37 12
19 16.04 12.75 15
20 16.84 12.78 16
Rank sum 130 80
(b) Rank sum test statistics Rx = 130, Ry = 80. From the table in Section 12.6 we find that the two-sided
p-value is between 0.05 < p-value < 0.10.
(c) The non-parametric test in (b) is more relevant, since both normplot(x) and normplot(y) show non-normality
of the data distribution.
(d) To estimate the probability π, that a type I bearing will outlast a type II bearing, we turn to the ordered
pooled sample
X-YYYYYY-XX-Y-X-Y-XX-YY-XXXX.
Pick a pair (X, Y ) at random, then by the division rule of probability
114
end,end,end
P=N/100;
hist(P,20)
std(P)
Solution 7
Model: iid-sample of the differences d1 , . . . , dn whose population distribution is symmetric around the unknown
median m. Test the null hypothesis of no difference H0 : m = 0 using the signed ranks test statistic w+ defined as
follows:
Under H0 : m = 0, on the step 3, the signs ± are assigned symmetrically at random (due to the model assumption
that the population distribution is symmetric around the median). As a result there are 16 equally likely outcomes
1 2 3 4 w+
− − − − 0
+ − − − 1
− + − − 2
+ + − − 3
− − + − 3
+ − + − 4
− + + − 5
+ + + − 6
− − − + 4
+ − − + 5
− + − + 6
+ + − + 7
− − + + 7
+ − + + 8
− + + + 9
+ + + + 10
Thus the null distribution of W+ is given by the table
k 0 1 2 3 4 5 6 7 8 9 10
1 1 1 2 2 2 2 2 1 1 1
pk 16 16 16 16 16 16 16 16 16 16 16
1
The smallest one-sided p-value is 16 = 0.06 which is higher than 5%. Thus n = 4 is too small sample size.
Therefore, the table in Section 12.7 starts from n = 5.
Solution 8
Using
r
n(n + 1) n(n + 1)(2n + 1)
W0.05 (n) = − 1.96 · ,
4 24
r
n(n + 1) n(n + 1)(2n + 1)
W0.01 (n) = − 2.58 · ,
4 24
we find (table/normal approximation)
n = 10 n = 20 n = 25
n(n+1)
q 4 27.5 105 162.5
n(n+1)(2n+1)
24 9.81 26.79 37.17
α = 0.05 8/8.3 52/53.5 89/89.65
α = 0.01 3/2.2 38/36.0 68/67.6
115
Solution 9
(a) The variance of a difference
D̄ = X̄ − Ȳ ≈ N(µx − µy , 100
25 ) = N(δ, 4).
Power function
Paired samples
Independent samples
0.5
0.05
3.29 4.65
Solution 10
Paired samples
If the pairing had been erroneously ignored, then the two independent samples formula would give 6 times larger
standard error
sx̄−ȳ = 7.81.
To test H0 : µx = µy against H1 : µx 6= µy assume D ∼ N(µ, σ 2 ) and apply one-sample t-test
d¯
t= = 0.368.
sd¯
With df = 14, two-sided p-value = 0.718, we can not reject H0 .
Without normality assumption we apply the signed rank test. Matlab command
signrank(x,y)
computes the two-sided p-value = 0.604. We can not reject H0 .
116
Solution 11
Possible explanations
(a) room with a window ← rich patient → recovers faster,
(b) besides passive smoking: smoker ← the man is a bad husband → wife gets cancer,
(c) no breakfast ← more stress → accident,
(d) choose to change the school and to be bused ← lower grades before → lower grades after,
(e) match two babies with two mothers (or even 3 babies with 3 mothers) then it is pure chance,
(f) abstain from alcohol ← poor health,
(g) marijuana ← schizophrenia,
(h) total time together = time before wedding + time after wedding,
(i) being part of a community can have a positive effect on mental health and emotional wellbeing.
boxplot(x)
anova1(x)
anova2(x)
Solution 1
(a) The sums of squares: between samples, within samples and total:
SSA = 10((20.34 − 19.40)2 + (18.34 − 19.40)2 + (21.57 − 19.40)2 + (17.35 − 19.40)2 ) = 109.2
SSE = 9(0.88 + 0.74 + 0.88 + 0.89) = 30.5
SST = 3.58 · 39 = 139.7 = 109.2 + 30.5
Source SS df MS F
Treatment 109.2 3 36.4 42.9
Error 30.5 36 0.85
Total 139.7 39
Comparing with the observed test statistics 42.9 with the critical value for F3,27 (0.001) = 7.27 shows that the
result is highly significant and we reject the null hypothesis.
(b) The normality assumption is supported by the four skewness and kurtosis values, with the former being
close to zero and the latter close to 3. On the other hand, the four sample variances are close to each other making
realistic the assumption of equal variances.
√
(c) Since sp = MSE = 0.92 and the t-distribution table gives approximately t36 (0.0042) ≈ t40 (0.005) = 2.7,
we get
Bµu −µv = (ȳu. − ȳv. ) ± 1.11.
Therefore all observed pairwise differences except (2-4) are significant:
Solution 2
Consider one-way ANOVA test statistic
J
PI 2
M SA I−1 i=1 (ȳi· − ȳ·· )
F = = PI PJ
M SE 1
i=1 j=1 (yij − ȳi· )
2
I(J−1)
117
In this two-sample setting, the F-test statistic becomes
x̄−ȳ
√2
This equals t2 , where t = is the two-sample t-test statistic.
sp n
Solution 3
The null hypothesis says that the data (yij ) comes from a single normal distribution
H0 : µ1 = . . . = µI = µ
dim Ω = I + 1
since the general setting is described by parameters µ1 , . . . , µI and σ 2 . The likelihood ratio
L0 (µ̂, σ̂02 )
Λ= ,
L(µ̂1 , . . . , µ̂I , σ̂ 2 )
The likelihood ratio test rejects the null hypothesis for small values of Λ or equivalently for large values of
Solution 4
One-way layout with I = 10, J = 7,
Yij ∼ N(µi , σ 2 ).
Pooled sample variance
1 XX
s2p = M SE = (yij − ȳi. )2
I(J − 1) i j
118
10
(b) Bonferroni simultaneous 95% confidence interval for 2 = 45 differences µu − µv
q
Bµu −µv = ȳu· − ȳv· ± t60 ( 0.025
45 )sp
2
J
Solution 5
For I = 4 control groups of J = 5 mice each, test H0 : no systematic differences between groups.
450
400
350
Values
300
250
200
1 2 3 4
Column Number
Source SS df MS F P
Columns 27230 3 9078 2.271 0.12
Error 63950 16 3997
Total 91190 19
Do not reject H0 at 10% significance level. Boxplots show non-normality. The largest difference is between the
third and the fourth boxplots. Control question: why the third boxplot has no upper whisker?
Kruskal-Wallis test. Pooled sample ranks
119
Solution 6
Two-way layout with I = 3 treatments on J = 10 subjects with K = 1 observations per cell. ANOVA table
Source SS df MS F P
Columns (blocks) 0.517 9 0.0574 0.4683 0.8772
Rows (treatments) 1.081 2 0.5404 4.406 0.0277
Error 2.208 18 0.1227
Total 3.806 29
Reject
H0 : no treatment effects
at 5% significance level. (Interestingly, no significant differences among the blocks.)
Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9 Dog 10 r̄i.
Isof 1 2 3 2 1 2 1 3 1 3 1.9
Halo 2 1 1 3 2 1 3 1 2 2 1.8
Cycl 3 3 2 1 3 3 2 2 3 1 2.3
Solution 7
Forty eight survival times: I = 3 poisons and J = 4 treatments with K = 4 observations per cell. Cell means for
the survival times
A B C D
I 4.125 8.800 5.675 6.100
II 3.200 8.150 3.750 6.625
III 2.100 3.350 2.350 3.250
Draw three profiles: I and II cross each other, and profile III is more flat. Three null hypotheses of interest
HA : no poison effect,
HB : no treatment effect,
HAB : no interaction.
Source SS df MS F P
Columns (treatments) 91.9 3 30.63 14.01 0.0000
Rows (poisons) 103 2 51.52 23.57 0.0000
Intercation 24.75 6 4.124 1.887 0.1100
Error 78.69 36 2.186
Total 298.4 47
Reject HA and HB at 1% significance level, we can not reject HAB even at 10% significance level:
(b) Transformed data: death rate = 1/survival time. Cell means for the death rates
120
Normal Probability Plot Normal Probability Plot
0.99 0.99
0.98 0.98
0.95 0.95
0.90 0.90
0.75 0.75
Probability
Probability
0.50 0.50
0.25 0.25
0.10 0.10
0.05 0.05
0.02 0.02
0.01 0.01
A B C D
I 0.249 0.116 0.186 0.169
II 0.327 0.139 0.271 0.171
III 0.480 0.303 0.427 0.309
Draw three profiles: they look more parallel.
New data matrix y=x.ˆ(-1). Results of anova2(y,4):
Source SS df MS F P
Columns (treatments) 0.204 3 0.068 28.41 0.0000
Rows (poisons) 0.349 2 0.174 72.84 0.0000
Intercation 0.01157 6 0.0026 1.091 0.3864
Error 0.086 36 0.0024
Total 0.6544 47
Solution 1
Test
H0 : same genotype frequencies for diabetics and normal
using the chi-squared test of homogeneity. The table below gives the expected counts along with the observed
counts:
Diabetic Normal Total
Bb or bb 12 (7.85) 4 (8.15) 16
BB 39 (43.15) 49 (44.85) 88
Total 51 53 104
Observed χ2 =5.10, df=1, p-value = 0.024. Reject H0 . Diabetics have genotype BB less often.
16
The exact Fisher test uses Hg(104,51, 104 ) as the null distribution of the test statistic N11 = 12
121
Normal approximation of the null distribution
16
Hg(104, 51, 104 ) ≈ N(7.85, 3.41).
12−7.85
Since zobs = √
3.41
=2.245, the approximate two-sided p-value = 0.025.
Solution 2
(a) H0 : no association of the disease and the ABO blood group:
O A AB B Total
Moderate 7 (10.4) 5 (9.8) 3 (2.0) 13 (6.2) 28
Minimal 27 (30.4) 32 (29.7) 8 (6.1) 18 (18.8) 85
Not present 55 (48.6) 50 (47.5) 7 (9.8) 24 (30.0) 136
Total 89 87 18 55 249
MM MN NN Total
Moderate 21 (16.7) 6 (9.4) 1 (1.9) 28
Minimal 54 (51.3) 27 (28.9) 5 (5.8) 86
Not present 74 (81.1) 51 (45.7) 11 (9.2) 136
Total 149 84 17 250
Solution 3
(a) Apply the chi-squared test of homogeneity:
(b) Goodness of fit chi-squared test for the same sex ratio for three father’s activities
Solution 4
We use the chi-squared test for homogeneity
122
No nausea Incidence of nausea Total
Placebo 70 (84) 95 (81) 165
Chlorpromazine 100 (78) 52 (74) 152
Dimenhydrinate 33 (43) 52 (42) 85
Pentobarbital (100 mg) 32 (34) 35 (33) 67
Pentobarbital (150 mg) 48 (43) 37 (42) 85
Total (150 mg) 283 271 554
The observed test statistic χ2 = 35.8 according to the χ24 -distribution table gives p-value = 3 · 10−7 . Comparing
the observed and expected counts we conclude that Chlorpromazine is most effective in ameliorating postoperative
nausea.
Solution 5
(a) H0 : no relation between blood group and disease in London:
ˆ = 1.45.
Observed χ2 =42.40, df=1, p-value = 0.000. Reject H0 . Odds ratio ∆
ˆ = 1.22.
Observed χ2 =5.52, df=1, p-value = 0.019. Reject H0 . Odds ratio ∆
(c) H0 : London Group A and Manchester Group A have the same propensity to Peptic Ulcer:
H0 : London Group O and Manchester Group O have the same propensity to Peptic Ulcer:
Solution 6
D = endometrical carcinoma, X = estrogen taken at least 6 months prior to the diagnosis of cancer.
123
Apply McNemar test for
H0 : π1· = π·1 vs H1 : π1· 6= π·1 .
Observed value of the test statistic
(113−15)2
χ2 = 113+15 = 75
√
is highly significant as 75 = 8.7 and the corresponding two-sided P-value obtained from N(0,1) table is very small.
Solution 7
16
(a) The exact Fisher test uses Hg(30,17, 30 ) as the null distribution of the test statistic whose observed value is
x = 12. It gives
Solution 8
Denote
π1 = probability that red wins in boxing,
π2 = probability that red wins in freestyle wrestling,
π3 = probability that red wins in Greco-Roman wrestling,
π4 = probability that red wins in Tae Kwon Do.
(a, c) Assuming
Heq : π1 = π2 = π3 = π4 = π,
we test
1
H0 : π = 2 vs H1 : π 6= 12 .
We use the large sample test for proportion based on the statistic X = 245 whose null distribution is Bin(n, 12 ),
n = 447. The two-sided P-value is approximated by
447
245− 2
2(1 − Φ( √ 1 ) = 2(1 − Φ(2.034) = 0.042.
447· 2
(d) Is there evidence that wearing red is more favourable in some of the sports than others? We test
124
we find that the test statistic χ2 = 0.3 is not significant. We can not reject Heq , which according to (a) leads to
π̂ = 0.55.
Here we need a new chi-squared test, a chi-squared test for k proportions with k = 4 (see below). Given four
observed counts x1 = 148, x2 = 27, x3 = 25, x4 = 45, we obtain the following table of observed and expected
counts
Red Biue Total
Boxing 148 (134) 120 (134) 268
Freestyle wrestling 27 (25.5) 24 (25.5) 51
Greco-Roman wrestling 25 (24) 23 (24) 48
Tae Kwon Do 45 (40) 35 (40) 80
H0 proportions 0.5 0.5 1.00
whose null distribution is approximately χ2k . The last fact follows from
k k k
(xi −ni pi )2 (ni −xi −ni (1−pi ))2 i −ni pi )
2
X X (x
X
ni pi + ni (1−pi ) = ni pi (1−pi ) = zi2 ,
i=1 i=1 i=1
Zi = √Xi −ni pi , i = 1, . . . , k
ni pi (1−pi )
xi
Using π̂i = ni , we compute the likelihood ratio as
Qk k
L(p1 , . . . , pk ) pxi (1 − pi )ni −xi Y
Λ= = Qk i=1 x i = ( nxi pi i )xi ( nni (1−p
i −xi
i ) ni −xi
) .
L(π̂1 , . . . , π̂k ) i xi ni −xi ni −xi
i=1 ( ) ( ni ) i=1
ni
we find that
k
2
i −ni pi )
X (x
− ln Λ ≈ ni pi (1−pi ) = χ2 .
i=1
125
13.9 Solutions to Section 10 (multiple regression)
Solution 1
Recall that the sample covariance and the population covariance are
n
1 X
sxy = (xi − x̄)(yi − ȳ), Cov(X, Y ) = E(XY ) − E(X)E(Y ).
n − 1 i=1
and
n
X n
X n
X n
XX
n2 x̄ȳ = xi yi = xi yi + xi yj ,
i=1 i=1 i=1 i6=j j=1
so that
n n n
X n−1X 1 XX
(xi − x̄)(yi − ȳ) = xi yi − xi yj .
i=1
n i=1 n j=1 i6=j
Solution 2
We have after ordering
x -1.75 -1.18 -0.88 -0.65 -0.30 0.34 0.50 0.68 1.38 1.40
y -1.59 -0.81 -0.98 -0.53 -0.72 0.27 0.64 0.35 1.34 1.28
and
x̄ = −0.046, ȳ = −0.075, sx = 1.076, sy = 0.996, r = 0.98.
(a) Simple linear regression model
Y = β0 + β1 x + , ∼ N(0, σ 2 ).
X = β0 + β1 y + , ∼ N(0, σ 2 ).
126
Estimated σ 2
s2 = n−1 2
n−2 sx (1 − r2 ) = 0.06.
(c) First fitted line
y = −0.033 + 0.904 · x
is different from the second
y = −0.031 + 0.948 · x.
They are different since in (a) we minimise the vertical residuals while in (b) - horizontal.
Solution 3
Using an extra explanatory variable f which equal 1 for females and 0 for males, we rewrite this model in the form
of a multiple regression
Y = f βF + (1 − f )βM + β1 x + = β0 + β1 x + β2 f + ,
where
β0 = βM , β2 = βF − βM .
Here p = 3 and the design matrix is
1 x1 f1
.. .. .. .
X= . . .
1 xn fn
After β0 , β1 , β2 are estimated, we compute
βM = β0 , βF = β0 + β2 .
Solution 4
(a) The predicted value Ŷ0 and actual observation Y0 are independent random variables, therefore
where
Var(b0 )+Var(b1 )x20 −2x0 Cov(b0 ,b1 ) x2 +x20 −2x̄x0 x2 −x̄2 +(x0 −x̄)2 (x0 −x̄)2
a2n = 1 + σ2 =1+ (n−1)s2x =1+ (n−1)s2x =1+ 1
n + (n−1)s2x .
(b) A 95% prediction interval I for the new observation Y0 is obtained from
Y0 −Ŷ0
San ∼ tn−2 .
Since
0.95 = P(|Y0 − Ŷ0 | ≤ tn−2 (0.025) · s · an ) = P(Y0 ∈ Ŷ0 ± tn−2 (0.025) · s · an ),
we conclude that a 95% prediction interval for the new observation Y0 is given by
r
1 (x0 −x̄)2
I = b0 + b1 x0 ± tn−2 (0.025) · s 1 + n + (n−1)s2x .
The further from x̄ lies x0 , the more uncertain becomes the prediction.
Solution 5
(a) Given x = 95, we predict the final score by
Regression to mediocrity.
(b) Given y = 85 and we do not know the midterm score, we predict the midterm score by
127
Solution 6
(a) Find the correlation coefficient ρ for (X, Y ). Since EX = 0, we have
Cov(X, Y ) = E(XY ) = E(X 2 + βXZ) = 1, VarY = VarX + VarZ = 1 + β 2 ,
and we see that the correlation coefficient is always positive
1
ρ= √ .
1+β 2
Solution 7
Matlab commands (x and y are columns)
[b,bint,res,rint,stats]=regress(y,[ones(6,1),x])
[b,bint,res,rint,stats]=regress(sqrt(y),[ones(6,1),x])
give two sets of residuals - see the plot. Two simple linear regression models
y = −62.05 + 3.49 · x, r2 = 0.984,
√
y = −0.88 + 0.2 · x, r2 = 0.993.
Kinetic energy formula explains why the second model is better.
128
14 Miscellaneous exercises
14.1 Exercises
Exercise 1
From Wikipedia:
”The American Psychological Association’s 1995 report Intelligence: Knowns and Unknowns stated
that the correlation between IQ and crime was -0.2. It was -0.19 between IQ scores and number of
juvenile offences in a large Danish sample; with social class controlled, the correlation dropped to -0.17.
A correlation of 0.20 means that the explained variance is less than 4%.”
Exercise 2
The Laplace distribution with a positive parameter λ is a two-sided exponential distribution. Its density function
is f (x) = λ2 e−λ|x| for x ∈ (−∞, ∞).
R∞
(a) The variance of this distribution is 2λ−2 and kurtosis is 6. Prove this using the formula 0 xk e−x dx = k!
valid for any natural
√ number k.
(b) Take λ = 2. Plot carefully the density f (x) together with the standard normal distribution density.
(c) Use the drawn picture to explain the exact meaning of the following citation. ”Kurtosis is a measure of the
peakedness of the probability distribution of a real-valued random variable, although some sources are insistent
that heavy tails, and not peakedness, is what is really being measured by kurtosis”.
Exercise 3
The following 16 numbers came from normal random number generator on a computer:
(a) Write down the likelihood function for the mean and variance of the generating normal distribution. (Hint:
to avoid tedious calculations on your calculator use the numbers in the next subquestion.)
(b) In what sense the sum of the sample values (which is close to 58), and the sum of their squares (which is
close to 260) are sufficient statistics in this case?
(c) Turning to the log-likelihood function compute the maximum likelihood estimates for the mean and variance.
Is the MLE for the variance unbiased?
Exercise 4
Questions concerning hypotheses testing methodology. Try to give detailed answers.
(a) Consider a hypothetical study of the effects of birth control pills. In such a case, it would be impossible to
assign women to a treatment or a placebo at random. However, a non-randomized study might be conducted by
carefully matching control to treatments on such factors as age and medical history.
The two groups might be followed up on for some time, with several variables being recorded for each subject
such as blood pressure, psychological measures, and incidences of various problems. After termination of the study,
the two groups might be compared on each of these many variables, and it might be found, say, that there was a
”significant difference” in the incidence of melanoma.
What is a common problem with such ”significant findings”?
(b) You analyse cross-classification data summarized in a two by two contingency table. You wanted to apply
the chi-square test but it showed that one of the expected counts was below 5. What alternative statistical test
you may try applying?
(c) Why tests like rank sum test, Friedman test, and Kruskal-Wallis tests are often called distribution-free tests?
129
Exercise 5
A public policy polling group is investigating whether people living in the same household tend to make independent
political choices. They select 200 homes where exactly three voters live. The residents are asked separately for
their opinion (”yes” or ”no”) on a city charter amendment. The results of the survey are summarized in the table:
Number of saying ”yes” 0 1 2 3
Frequency 30 56 73 41
Based on these data can we claim that opinions are formed independently?
Exercise 6
Suppose you have a data of size n for which the linear regression model seems to work well. The key summary statis-
tics are represented by sample means x̄, ȳ, sample standard deviations sx , sy , and a sample correlation coefficient
r.
An important use of the linear regression model is forecasting. Assume we are interested in the response to a
particular value x of the explanatory variable.
(a) The exact 100(1 − α)% confidence interval for the mean response value is given by the formula:
s 2
1 1 x − x̄
b0 + b1 x ± tα/2,n−2 · s · + .
n n−1 sx
Explain carefully the meaning and role of each of the terms.
(b) Another important formula in this context
s 2
1 1 x − x̄
b0 + b1 x ± tα/2,n−2 · s · 1+ +
n n−1 sx
is called the exact 100(1 − α)% prediction interval. Explain the difference between these two formulae. Illustrate
by a simple example.
(c) Comment on the predictor properties depending on the distance from the given value x to the sample mean
x̄. Illustrate using appropriate plots.
Exercise 7
In an experimental study two volunteer male subjects aged 23 and 25 underwent three treatments to compare a
new drug against no drug and placebo. Each volunteer had one treatment per day and the time order of these
three treatments was randomized.
Exercise 8
You have got a grant to measure the average weight of the hippopotamus at birth. You have seen in a previous
publication by Stanley and Livingstone that for male calves the distribution of weights has a mean of roughly 70
kg and a standard deviation of 10 kg, while these numbers are 60 kg and 5 kg for females, but you are interested
in a better remeasurement of the overall average.
The experimental procedure is simple: you wait for the herd of hippopotami to be sleeping, you approach a
newborn, you put it quickly on the scales, and you pray for the mother not to wake up. You managed to weigh 13
female and 23 male newborns with the following results:
Female Male
Sample mean 62.8 69.7
Sample standard deviation 6.8 11.7
(a) Test the null hypothesis of the equal sex ratio for the newborn hippopotami (meaning that the ratio of
males to females at birth is 1 to 1).
(b) Assuming the ratio of males to females at birth is 1 to 1, suggest two different unbiased point estimates for
the overall average weight of the hippopotamus at birth.
(c) Compute the standard errors for these point estimates.
(d) What assumptions do you make for these calculations?
130
Exercise 9
Identify important statistical terms hiding behind “bla-bla-bla” in the following extracts from the Internet (one
term per item).
(a) From a sample you can only get one value of a statistic like trimmed mean. You do not know the confidence
interval of the trimmed mean or its distribution. Bla-bla-bla samples give more detail on the sampling distribution
of the statistic of interest.
(b) Bla-bla-bla is a measure of the ”peakedness” of the probability distribution of a real-valued random variable,
although some sources are insistent that heavy tails, and not peakedness, is what is really being measured by bla-
bla-bla.
(c) Note that bla-bla-bla is the probability of finding a difference that does exist, as opposed to the likelihood
of declaring a difference that does not exist (which is known as a Type I error, or ”false positive”).
(d) The bla-bla-bla is a measure of the tendency to fail; the greater the value of the bla-bla-bla, the greater the
probability of impending failure... The bla-bla-bla is also known as the instantaneous failure rate.
(e) Naive interpretation of statistics derived from data sets that include bla-bla-bla may be misleading. For
example, if one is calculating the average temperature of 10 objects in a room, and most are between 20 and 25
degrees Celsius, but an oven is at 175 C, the median of the data may be 23 C but the mean temperature will be
between 35.5 and 40 C. In this case, the median better reflects the temperature of a randomly sampled object than
the mean; however, naively interpreting the mean as ”a typical sample”, equivalent to the median, is incorrect. As
illustrated in this case, bla-bla-bla may be indicative of data points that belong to a different population than the
rest of the sample set.
Exercise 10
Officials of a small transit system with only five buses want to evaluate four types of tires with respect to wear.
Applying a randomized block design, they decided to put one tire of each type on each of the five buses. The tires
are run for 15,000 miles, after which the tread wear, in millimeters, is measured.
(a) State the most appropriate null hypothesis by referring to a suitable parametric model. What are the main
assumptions of the parametric model?
(b) Using a non-parametric procedure test the null hypothesis of no difference between the four types of tires.
(c) What kind of external effects are controlled by the suggested randomised block design? How the wheel
positions for different tire types should be assigned for each of the five buses?
Exercise 11
A study is conducted of the association between the rate at which words are spoken and the ability of a “talking
computer” to recognise commands that it is programmed to accept. A random sample of 50 commands is spoken
first at a rate under 60 words per minute, and then the SAME commands are repeated at a rate over 60 words per
minute. In the first case the computer recognised 42 out of 50 commands while in the second case it recognised
only 35 commands. Is the observed difference statistically significant?
Exercise 12
Suppose your prior beliefs about the probability p of success have mean 1/3 and variance 1/32. What is the
posterior mean after having observed 8 successes in 20 trials?
Exercise 13
The data of the following table were gathered for an environmental impact study that examined the relationship
between the depth of a stream and the rate of its flow
131
Depth Flow rate
0.34 0.64
0.29 0.32
0.28 0.73
0.42 1.33
0.29 0.49
0.41 0.92
0.76 7.35
0.73 5.89
0.46 1.98
0.40 1.12
(a) Draw the scatter plot for the given data using the x axis for depth. Fit by eye a regression line and plot
the residuals against the depth. What does it say to you about the relevance of the simple linear regression model
for this particular data?
(b) The least square estimates for the parameters of the simple linear regression model are b0 = −3.98, b1 =
13.83. Given the standard deviations are sx = 0.17 and sy = 2.46 estimate the noise size (σ) and find the coefficient
of determination.
(c) The statistics for a quadratic model are given in the following table:
Coefficient Estimate Standard error t value
β0 1.68 1.06 1.59
β1 -10.86 4.52 -2.40
β2 23.54 4.27 5.51
Compute a 95 procent confidence interval for β0 .
(d) Is the quadratic term statistically significant? Carefully explain.
Exercise 14
The article ”Effects of gamma radiation on juvenile and mature cuttings of quaking aspen” (Forest science, 1967)
reports the following data on exposure time to radiation (x, in kr/16 hr) and dry weight of roots (y, in mg×10−1 ):
x 0 2 4 6 8
y 110 123 119 86 62
The estimated quadratic regression function is y = 111.9 + 8.1x − 1.8x2 .
(a) What is the underlying multiple regression model? Write down the corresponding design matrix.
(b) Compute the predicted responses. Find an unbiased estimate s2 of the noise variance σ 2 .
(c) Compute the coefficient of multiple determination.
Exercise 15
The accompanying data resulted from an experiment carried out to investigate whether yield from a certain chemical
process depended either on the formulation of a particular input or on mixer speed.
Speed
60 70 80 Means
1 189.7 185.1 189.0
1 188.6 179.4 193.0 187.03
1 190.1 177.3 191.1
Formulation
2 165.1 161.7 163.3
2 165.9 159.8 166.6 164.66
2 167.6 161.6 170.3
Means 177.83 170.82 178.88 175.84
A statistical computer package gave
SSF orm = 2253.44, SSSpeed = 230.81, SSF orm ∗ Speed = 18.58, SSE = 71.87.
(a) Calculate estimates of the main effects.
(b) Does there appear to be interaction between the factors? In which various ways interaction between such
two factors could manifest itself? Illustrate with graphs.
(c) Does yield appear to depend either on formulation or speed.
(d) Why is it important to inspect the scatter plot of residuals?
132
Exercise 16
A study of the relationship between facility conditions at gasoline stations and aggressiveness in the pricing of
gasoline is based on n = 441 stations.
Pricing policy
Aggressive Neutral Nonaggressive Total
Substandard condition 24 15 17 56
Standard condition 52 73 80 205
Modern condition 58 86 36 180
Total 134 174 133 441
(a) Suggest a parametric model for the data and write down the corresponding likelihood function.
(b) What is a relevant null hypothesis for the data?
(c) Properly analyse the data and draw your conclusions.
Exercise 17
Mice were injected with a bacterial solution: some of the mice were also given penicillin. The results were
(a) Find a 95% confidence interval for the difference between two probabilities of survival.
(b) Assume that both groups have the probability of survival p. How would you compute an exact credibility
interval for the population proportion p, if you could use a computer? Compute an approximate 95% credibility
interval using a normal approximation.
Exercise 18
In a controlled clinical trial which began in 1982 and ended in 1987, more than 22000 physicians participated. The
participants were randomly assigned in two groups: Aspirin and Placebo. The aspirin group have been taking
325 mg aspirin every second day. At the end of trial, the number of participants who suffered from myocardial
infarctions was assessed.
The popular measure in assessing the results in clinical trials is Risk Ratio
104/11037
RR = RA /RP = = 0.55.
189/11034
(a) How would you interpret the obtained value of the risk ratio? What ratio of conditional probabilities is
estimated by RR?
(b) Is the observed value of RR significantly different from 1?
Exercise 19
Given a sample (x1 , . . . , xn ) of independent and identically distributed observations, we are interested in testing
H0 : m = m0 against the two-sided alternative H1 : mP 6= m0 concerning the population median m. No parametric
n
model is assumed. As a test statistic we take y = i=1 1{xi ≤m0 } , the number of observations below the null
hypothesis value.
(a) Find the exact null distribution of Y . What are your assumptions?
(b) Suppose n = 25. Suggest an approximate confidence interval formula for m.
133
Exercise 20
Consider the problem of comparison of two simple hypotheses H0 : p = p0 , H1 : p = p1 with p1 > p0 using the
large-sample test for the proportion.
(a) Let Y have a binomial distribution with parameters (n, p). The power function of the one-sided test is given
by
Pw(p1 ) = P( √ Y −np0 ≥ zα | p = p1 ).
np0 (1−p0 )
Hint: under the alternative hypothesis, √ Y −np1 is approximately normally distributed with parameters (0,1).
np1 (1−p1 )
(c) What happens to the planned sample size if the alternatives are very close to each other? What happens if
we decrease the levels α and β?
Exercise 21
A sports statistician studied the relation between the time (Y in seconds) for a particular competitive swimming
event and the swimmer’s age (X in years) for 20 swimmers with age ranging from 8 to 18. She employed quadratic
regression model and obtained the following result
The standard error for the curvature effect coefficient was estimated as sb2 = 0.1157.
(a) Plot the estimated regression function. Would it be reasonable to use this regression function when the
swimmer’s age is 40?
(b) Construct a 99 percent confidence interval for the curvature effect coefficient. Interpret your interval esti-
mate.
(c) Test whether or not the curvature effect can be dropped from the quadratic regression model, controlling
the α risk at 0.01. State the alternatives, the decision rule, the value of the test statistic, and the conclusion. What
is the P -value of the test?
Exercise 22
In the Bayesian estimation framework we search for an optimal action
The optimal choice depends on the particular form of the loss function l(θ, a). Bayes action minimizes the posterior
risk Z X
R(a|x) = l(θ, a)h(θ|x)dθ or R(a|x) = l(θ, a)h(θ|x).
θ
(a) Explain the meaning of the posterior risk function. What does h(θ|x) stand for? How is h(θ|x) computed?
(b) The zero-one loss function is defined by l(θ, a) = 1{θ6=a} . Compute the posterior risk using the discrete
distribution formula. Why is it called the probability of misclassification?
(c) What Bayesian estimator corresponds to the optimal action with the zero-one loss function? Compare this
estimator to the maximum likelihood estimator.
134
Exercise 23
See the picture to the right.
From this observation we would like to estimate the
amount of work required to clean a street
from chewing gums.
(d) Estimate the proportion of tiles free from chewing gums using the fitted Poisson model.
Exercise 24
Miscellaneous questions.
(a) Describe a situation when a stratified sampling is more effective than a simple random sampling for esti-
mating the population mean. Which characteristics of the strata will influence your sample allocation choice?
(b) Given a dataset how do you compute kurtosis? What is the purpose of this summary statistic? Why is it
important to compute the coefficient of skewness for a proper interpretation of the kurtosis value?
(c) What is the difference between the parametric and non-parametric bootstrap methods?
(d) Suppose we are interested in the average height for a population of size 2,000,000. To what extend can a
sample of 200 individuals be representative for the whole population?
Exercise 25
Three different varieties of tomato (Harvester, Pusa Early Dwarf, and Ife No. 1) and four different plant densities
(10, 20, 30, and 40 thousands plants per hectare) are being considered for planting in a particular region. To see
whether either variety or plant density affects yield, each combination of variety and plant density is used in three
different plots, resulting in the following data on yields:
Variety Density 10,000 Density 20,000 Density 30,000 Density 40,000 mean
H 10.5, 9.2, 7.9 12.8, 11.2, 13.3 12.1, 12.6, 14.0 10.8, 9.1, 12.5 11.33
Ife 8.1, 8.6, 10.1 12.7, 13.7, 11.5 14.4, 15.4, 13.7 11.3, 12.5, 14.5 12.21
P 16.1, 15.3, 17.5 16.6, 19.2, 18.5 20.8, 18.0, 21.0 18.4, 18.9, 17.2 18.13
mean 11.48 14.39 15.78 13.91 13.89
Source of variation SS df MS F
Varieties
Density
Interaction 8.03
Errors 38.04
(b) Clearly state the three pairs of hypotheses of interest. Test them using the normal theory approach.
(c) Estimate the noise size σ.
Exercise 26
For each of nine horses, a veterinary anatomist measured the density of nerve cells at specified sites in the intestine:
135
Animal Site I Site II
1 50.6 38.0
2 39.2 18.6
3 35.2 23.2
4 17.0 19.0
5 11.2 6.6
6 14.2 16.4
7 24.2 14.4
8 37.4 37.6
9 35.2 24.4
The null hypothesis of interest is that in the population of all horses there is no difference between the two sites.
(a) Which of the two non-parametric tests is appropriate here: the rank-sum test or the signed-rank test?
Explain your choice.
(b) On the basis of the data, would you reject the null-hypothesis? Use one of the tests named in the item (a).
(c) Explain the following extract from the course text book:
More precisely, with the signed rank test, H0 states that the distribution of the differences is symmetric
about zero. This will be true if the members of pairs of experimental units are assigned randomly to
treatment and control conditions, and the treatment has no effect at all.
Exercise 27
Suppose that grades of 10 students on a midterm and a final exams have a correlation coefficient of 0.5 and both
exams have an average score of 75 and a standard deviation of 10.
(a) Sketch a scatterplot illustrating performance on two exams for this group of 10 students.
(b) If Carl’s score on the midterm is 90, what would you predict his score on the final to be? How uncertain is
this prediction?
(c) If Maria scored 80 on the final, what would you guess that her score on the midterm was?
(d) Exactly what assumptions do you make to make your calculations in (b) and (c)?
Exercise 28
The gamma distribution Gamma(α, λ) is a conjugate prior for the Poisson data distribution with a parameter θ. If
x is a single observed value randomly sampled from the Poisson distribution, then the parameters (α0 , λ0 ) for the
posterior gamma distribution of θ are found by the following updating rule:
- the shape parameter α0 = α + x,
- the inverse scale parameter λ0 = λ + 1.
(a) Find θ̂PME , the posteriori mean estimate for the θ, under the exponential prior with parameter 1, given the
following iid sample values from the Poisson(θ) population distribution
x1 = 2, x2 = 0, x3 = 2, x4 = 5.
(b) What is the updating rule for an arbitrary sample size n? Compare the value of θ̂PME with the maximum
likelihood estimator θ̂MLE as n → ∞. Your conclusions?
Exercise 29
Extracorporeal membrane oxygenation (ECMO) is a potentially life-saving procedure that is used to treat newborn
babies who suffer from severe respiratory failure. An experiment was conducted in which 29 babies were treated
with ECMO and 10 babies were treated with conventional medical therapy (CMT). In the ECMO group only 1
patient died, while in the CMT group 4 patients died.
(a) Suggest a statistical model and compute the likelihood function for the data as a function of two parameters:
p - the probability to die under the ECMO treatment and q - the probability to die under the CMT treatment.
(b) Write down a relevant pair of statistical hypotheses in the parametric form. Perform the exact Fisher test.
136
Exercise 30
Suppose that we have an iid sample of size 100 from the normal distribution with mean µ and standard deviation
σ = 10. For H0 : µ = 0 and H1 : µ 6= 0 we use the absolute value of the sample mean T = |X̄| as the test statistic.
Denote by V the P-value of the test.
(a) Show that V = 2(1 − Φ(Tobs )), where Tobs is the observed value of the test statistic and Φ(x) is the standard
normal distribution function. Plot the null distribution curve for X̄ and graphically illustrate this formula.
(b) In what sense the P-value V is a random variable? Using (a) show that
(c) Suppose that the true value of the population mean is µ = 4. Using (b) show that P (V ≤ 0.05) ≈ 0.975.
Illustrate by drawing the density curve for the true distribution of X̄.
(d) Comment on the result (c) in the light of the statement: ”P values, the ’gold standard’ of statistical validity,
are not as reliable as many scientists assume”.
Exercise 31
A population with mean µ consists of three subpopulations with means µ1 , µ2 , µ3 and the same variance σ 2 . Three
independent iid samples, each of size n = 13, from the three subpopulation distributions gave the following sample
means and standard deviations:
Sample 1 Sample 2 Sample 3
Mean 6.3 5.6 6.0
SD 2.14 2.47 3.27
(a) Compute a stratified sample mean, assuming that the three subpopulation sizes have the ratios N1 : N2 :
N3 = 0.3 : 0.2 : 0.5. Prove that this is an unbiased estimate for the population mean µ.
(b) Assume that all three subpopulation distributions are normal. Write down simultaneous confidence intervals
for the three differences µ1 − µ2 , µ1 − µ3 , and µ2 − µ3 .
(c) Would you reject the null hypothesis of equality µ1 = µ2 = µ3 in this case?
Exercise 32
The following table shows admission rates for the six most popular majors at the graduate school at the University
of California at Berkeley. The numbers in the table are the number of applicants and the percentage admitted.
Men Women
Major A 825 (62%) 108 (82%)
Major B 560 (63%) 25 (68%)
Major C 325 (37%) 593 (34%)
Major D 417 (33%) 375 (35%)
Major E 191 (28%) 393 (34%)
Major F 373 (6%) 341 (7%)
(a) If the percentage admitted are compared, women do not seem to be unfavourably treated. But when the
combined admission rates for all six majors are calculated, it is found that 44% of the men and only 30% of the
women were admitted. How this paradox is resolved?
(b) This is an example of an observational study. Suggest a controlled experiment testing relevant statistical
hypotheses.
Exercise 33
Represent the large sample test for a proportion as a chi-square test.
137
Answer 2
(a) Since the mean is 0, the variance is computed as
Z ∞ Z ∞
2
σ = 2
x f (x)dx = λ x2 e−λx dx = 2λ−2 .
−∞ 0
(b) The Laplace curve is symmetric. Its shape is formed by two exponentially declining curves: one for positive
x and the other for the negative x.
√
(c) For λ = 2 the mean is 0, the skewness is 0, and the kurtosis is 6. Compared to the normal curve with the
same mean but smaller kurtosis (=3), the Laplace distribution has heavier tails. Moreover, since the variances are
equal, the two curves should cross 4 times. This implies that the Laplace curve must also have higher peakedness.
Answer 3
Pn Pn
(a) Given i=1 xi = 58 and i=1 x2i = 260, the likelihood function is
n n x2 −2µ n 2
1 (xi −µ)2 1 1 2
P P
− i=1 i i=1 xi +nµ − 260−116µ+16µ
Y
2
L(µ, σ ) = √ e 2σ2 = e 2σ 2 = e 2σ 2 .
i=1
2πσ (2π)n/2 σ n (2π)8 σ 16
Pn Pn
(b) It is sufficient to know i=1 xi and i=1 x2i to compute the likelihood function.
Pn
(c) The MLE for the mean is x̄ = 3.63 and the MLE for the variance σ̂ 2 = n−1 i=1 x2i − x̄2 = 3.11. These are
computed by taking the derivative of the log-likelihood
Pn Pn
n x2 − 2µ i=1 xi + nµ2
l(µ, σ 2 ) := ln L(µ, σ 2 ) = − ln(2π) − n ln σ − i=1 i
2 2σ 2
and solving a pair of equations
Pn
−2 i=1 xi + 2nµ
2
= 0,
2σ
Pn Pn
2 2
n i=1 xi − 2µ i=1 xi + nµ
− + = 0.
σ σ3
Since
n
X σ2 n−1 2
E(n−1 Xi2 − X̄ 2 ) = σ 2 − = σ ,
i=1
n n
σ̂ 2 is a biased estimate of σ 2 .
Answer 4
(a) Multiple testing.
(c) Nonparametric tests do not assume a particular form of the population distribution like normal distribution.
Answer 5
The null hypothesis is that everybody votes independently. Let p be the population proportion for ’yes’. Then
the number of ’yes’ for three voters in a household has the binomial distribution model X ∼ Bin(3, p) with an
unspecified parameter p. So the null hypothesis can be expressed in the following form
H0 : p0 = (1 − p)3 , p1 = 3p(1 − p)2 , p2 = 3p2 (1 − p), p3 = p3 .
The MLE of p is the sample mean p̂ = 0.5417. We use the Pearson chi-square test with expected counts
E0 = n(1 − p̂)3 = 19, E1 = 3np̂(1 − p̂)2 = 68, E2 = 3np̂2 (1 − p̂) = 81, E3 = 3np̂3 = 32.
The observed chi-square test statistic is χ2 = 11.8 which has a P-value less than 0.5% according to the approximate
null distribution χ2df with df = 4 − 1 − 1 = 2.
Reject the null hypothesis of independent voting.
138
Answer 7
(b) Friedman’s test for I = 3 treatments and J = 2 blocks. The test statistic
I
12J X I+1 2
Q= (R̄i. − 2 )
I(I + 1) i=1
is obtained from the ranks given by two subjects (Rij ) to the three treatments. Under the null distribution all 36
possible rank combinations
1 1 1 1 1 2 3 1 3 1 3 3
(Rij ) = 2 2 , 2 3 , 2 1 ,..., 2 2 , 2 3 , 2 2
3 3 3 2 3 3 1 3 1 2 1 1
are equally likely. The corresponding vector of rank averages (R̄1. , R̄2. , R̄3. ) takes 5 values (up to permutations)
A1 = (1, 2, 3), A2 = (1, 2.5, 2.5), A3 = (1.5, 1.5, 3), A4 = (1.5, 2, 2.5), A5 = (2, 2, 2)
1, 2, 3 1, 3, 2 2, 1, 3 2, 3, 1 3, 1, 2 3, 2, 1
1, 2, 3 A1 A2 A3 A4 A4 A5
1, 3, 2 A2 A1 A4 A3 A5 A4
2, 1, 3 A3 A4 A1 A5 A2 A4
2, 3, 1 A4 A3 A5 A1 A4 A2
3, 1, 2 A4 A5 A2 A4 A1 A3
3, 2, 1 A5 A4 A4 A2 A3 A1
Next we have
(R̄ , R̄ , R̄ ) = A1 A2 A3 A4 A5
P3 1. 2. 3.2
i=1 (R̄i. − 2) = 2 1.5 1.5 0.5 0
Probability = 1/6 1/6 1/6 1/3 1/6
Thus the null distribution of Q is the following one
Answer 8
(a) Binomial model for the number of females Y ∼ Bin(36, p). Given Yobs = 13 we have to test H0 : p = 0.5
against the two-sided alternative H1 : p 6= 0.5. The approximate null distribution is Y ∼ N(18, 9), therefore, an
approximate two-sided P-value becomes
2 × (1 − Φ( 18−13
3 ) = 2(1 − Φ(1.67)) = 2 × 0.048 = 9.6%.
With such a high P-value we can not reject the null hypothesis of equal sex ratio.
To compute the sample variance from the simple random sample take some effort. First we observe that
X X
(x1i − x̄1 )2 = x21i − nx̄21 .
139
It follows,
X X
x21i = (x1i − x̄1 )2 + n1 x̄21 = (n1 − 1)s21 + n1 x̄21
= 12 × (6.8)2 + 13 × (62.8)2 = 51825,
X
x22j = (n2 − 1)s22 + n2 x̄22
= 22 × (11.7)2 + 23 × (69.7)2 = 114748,
P 2
x1i + x22j
P
2 36
s = − x̄2 = 114.4.
35 35
q
114.4
So that sx̄ = 36 = 1.78.
Answer 9
(a) Bootstrap.
(b) Kurstosis.
(e) Outliers.
Answer 10
(a) Under the two-way ANOVA model the most interesting is H0 : α1 = α2 = α3 = α4 = 0 the null hypothesis
of no difference among different types of tires.
Answer 11
Matched pairs design for 50 independent trials with four possible outcomes (correct, correct), (correct, wrong),
(wrong, correct), (wrong, wrong). Assuming that in the slow regime the ”talking computer” recognizes correctly
all correct answers made in the fast regime we can summarize the results as follows
Fast correct Fast wrong Totals
Slow correct 35 7 42
Slow wrong 0 8 8
Totals 35 15 50
2
McNemara’s test statistics is (7−0) 2
7+0 = 7. The null distribution is approximated by the χ1 −distribution. Since the
square root of 7 is 2.65, the standard normal distribution gives a (two-sided) P-value 0.8%. We conclude that the
observed difference is statistically significant.
The conclusion will be different if our assumption is wrong. In the worst case the slow regime correct answers
are totally different and the table of the outcomes looks as
Fast correct Fast wrong Totals
Slow correct 27 15 42
Slow wrong 8 0 8
Totals 35 15 50
140
2
McNemara’s test statistics is then (7−0)
8+15 = 2.13. Since the square root of 2.13 is 1.46, the standard normal
distribution gives a two-sided p-value 14%. We can not reject the null hypothesis in this case.
Answer 12
We use a Beta prior with parameters (a, b) satisfying
1
a 1 (1 − 13 ) 1
= , 3 = .
a+b 3 a+b+1 32
The prior pseudo-counts are well approximated by a = 2 and b = 4. Thus the posterior Beta distribution has
parameters (10, 16) giving the posterior mean estimate p̂pme = 0.38.
Answer 13
(b) First we find the sample correlation coefficient by r = b1 ssxy = 0.96. The coefficient of determination is
r2 = 0.91. Using formula
n−1 2
s2 = s (1 − r2 ) = 0.589
n−2 y
√
the noise size is estimated as s = 0.589 = 0.77.
(c) An exact 95% CI for β0 is b0 ± tn−3 (0.025)sb0 = 1.68 ± 2.365 × 1.06 = [−0.83, 4.19].
(d) The observed test statistic t = 5.51 for the model utility test for H0 : β2 = 0 has an exact null distribution
t7 . After consulting the t7 −distribution we reject this null hypothesis at 0.5% significance level. The quadratic
term is therefore highly statistically significant.
Answer 14
(a) Multiple regression model Yi = β0 + β1 xi + β2 x2i + i , where the random variables i , i = 1, . . . , 5 are
independent and have the same normal distribution N(0, σ 2 ). The corresponding design matrix has the form
1 0 0
1 2 4
X= 1 4 16
1 6 36
1 8 64
xi 0 2 4 6 8
yi 110 123 119 86 62
ŷi 111.9 120.9 115.5 95.7 61.5
ky−ŷk2 114.6
and then s2 = n−p = 2 = 57.3.
SSE 114.6
R2 = 1 − =1− = 0.956.
SST 2630
Answer 15
(a) In terms of the two-way ANOVA model Yijk = µ + αi + βj + δij + ijk ( grand mean + main effects +
interaction+noise), we estimate the main effects as
α̂1 = 11.9, α̂2 = −11.8, β̂1 = 1.99, β̂2 = −5.02, β̂3 = 3.04.
141
Speed
60 70 80
1 189.7 185.1 189.0
1 188.6 179.4 193.0
1 190.1 177.3 191.1
Cell means 189.5 180.6 191.0
2 165.1 161.7 163.3
2 165.9 159.8 166.6
2 167.6 161.6 170.3
Cell means 166.2 161.0 166.7
and draw two lines for the speed depending on two different formulations, see the left panel on the figure below.
These two lines are almost parallel indicating to the absence of interaction between two main factors. This is
confirmed by the ANOVA table below showing that the interaction is not significant.
190 190
160 160
60 70 80 60 70 80
One possible interaction effect could have the form on the right panel. In this case the formulation 2 interacts
with the speed factor in such a way that the yield becomes largest at the speed 70.
(d) To check the normality assumption for the noise with the same variance across different values of the
explanatory variable.
Answer 16
(a) This is a single sample of size n = 441. Each of n observations falls in of 9 groups. The multinomial
distribution model
(n11 , n12 , n13 , n21 , n22 , n23 , n31 , n32 , n33 ) ∼ Mn(n, p11 , p12 , p13 , p21 , p22 , p23 , p31 , p32 , p33 )
(c) The chi-square test statistic χ2 = 22.5 should be compared with the critical values of χ24 -distribution. Even
though the corresponding table is not given we may guess that the result must be significant as the square root of
22.5 is quite large. We reject the null hypothesis of independence and conclude that that there is a relationship
between facility conditions at gasoline stations and aggressiveness in the pricing of gasoline.
142
Pricing policy
Aggressive Neutral Nonaggressive Total
Substandard condition 24 (17) 15 (22) 17 (17) 56
Standard condition 52 (62.3) 73 (80.9) 80 (61.8) 205
Modern condition 58 (54.7) 86 (71) 36 (54.3) 180
Total 134 174 133 441
It looks like the standard conditions are coupled with the least aggressive pricing strategy.
Answer 17
8
(a) Two independent dichotomous samples with n = 56, p̂1 = 56 = 0.143 and m = 74, p̂2 = 1274 = 0.162. An
asymptotic 95% confidence interval for the population difference is given by
r
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
Ip1 −p2 = p̂1 − p̂2 ± 1.96 · + = −0.019 ± 0.125 = [−0.144, 0.106].
n−1 m−1
(b) For a credibility interval we can use the non-informative uniform prior p ∈ Beta(1, 1). Adding the pseudo-
counts (1, 1) to the total counts (8 + 12, 48 + 62) we get p ∈ Beta(21, 111) as the posterior distribution. Using
Matlab one can find the exact 95% credibility interval [a, b] for p by finding the 2.5% and 97.5% quantiles of the
posterior distribution. q
21 0.16(1−0.16)
Posterior mean µ = 21+111 = 0.16 and standard deviation σ = 132 = 0.03 leads to the normal
approximation of the posterior distribution with mean 0.16 and standard deviation 0.03. This yield an approximate
95% credibility interval
Jp = 0.16 ± 1.96 · 0.03 = [0.10, 0.22].
Answer 18
(a) The risk ratio compares the chances to suffer from myocardial infarction under the aspirin treatment vs the
chances to suffer from myocardial infarction under the placebo treatment:
P(MyoInf|Aspirin)
RR = .
P(MyoInf|Placebo)
Answer 19
(a) The null distribution of Y is Bin(n, 12 ) as each observation is smaller than the true median (assuming that
the distribution is continuous) with probability 0.5.
(b) A non-parametric CI for the midean M is given by (x(k) , x(n−k+1) ) where k is such that
143
Answer 20
(a) The null distribution of Y is approximately normally distributed with parameters (np0 , np0 q0 ), where
q0 = 1 − p0 . At the significance level α, the rejection region for the one-sided alternative is
y − np0
√ ≥ zα .
np0 q0
The power function is the probability of rejecting the null hypothesis given the alternative one is true
Y −np0
Pw(p1 ) = P( √ np0 q0 ≥ zα | p = p1 ).
Now, since under the alternative hypothesis Y is approximately normally distributed with parameters (np1 , np1 q1 ),
we get
√ √
zα p0 q0 + n(p0 −p1 )
β ≈ Φ( √
p1 q 1 ).
which brings the desired formula for the optimal sample size
√ √
√ zα p0 q0 + zβ p1 q1
n= .
|p1 − p0 |
(c) If the alternatives are very close to each other, the denominator goes to zero and the sample size becomes
very large. This is very intuitive as it becomes more difficult to distinguish between two close parameter values.
If we decrease the levels α and β, the values zα and zβ from the normal distribution table become larger and
the sample size will be larger as well. Clearly, if you want have more control over both types of errors, you have to
pay by collecting more data.
Answer 21
(a) The underlying parabola makes unrealistic prediction that ŷ40 = 139 sec compared to ŷ10 = 63 sec and
ŷ20 = 34 sec. One should be careful to extend the range of explanatory variable from that used in the data.
(b) Using t17 (0.005) = 2.898 we get the exact confidence interval (under the assumption of normality and
homoscedasticity for the noise component)
(c) Since the confidence interval from 2b covers zero, we do not reject the null hypothesis H0 : β2 = 0 at the 1%
0.2730
significance level. The observed t-test statistic 0.1157 = 2.36 ∈ (2.110, 2.567), and according to the t17 -distribution
table says that the two-sided p-value is between 2% and 5%.
Answer 22
(a) For a given action a, the posterior risk function
X
R(a|x) = l(θ, a)h(θ|x) = E(l(Θ, a)|x).
θ
is the expected loss when the unknown parameter θ is treated as a random variable Θ with the posterior distribution:
(b) For the 0-1 loss function in the discrete distribution case,
X
R(a|x) = h(θ|x) = 1 − h(a|x) = P(Θ 6= a|x)
θ6=a
144
is the probability of misclassification, that is the posterior probability that the chosen action a is different from the
true value of the parameter θ.
(c) The corresponding Bayesian estimator minimizing the risk function R(a|x) = 1 − h(a|x) maximizes h(a|x),
the posterior probability. It is denoted θ̂MAP and called the maximum a posteriori probability estimate. In the
case the prior distribution is non-informative, so that the posterior distribution is proportional to the likelihood
function, we have θ̂MAP = θ̂MLE .
Answer 23
(a) The numbers of chewing gums for different tiles are summarized in the form of observed counts
Number of gums per tile 0 1 2 3 4 ≥5
Counts 11 8 2 0 1 0
with the total number of tiles n = 22. The Poisson model assumes that the number of gums X1 , . . . , Xn are
independent random variable with the common one-parameter distribution
λk −λ
P(X = k) = e , k = 0, 1, . . .
k!
(b) The likelihood function
n
Y λxi e−nλ λnx̄ e−22λ λ16
L(λ) = e−λ = = .
i=1
xi ! x1 ! · · · xn ! 2!2!4!
(c) Use the Pearson chi-square test of goodness of fit. Combine the cells for 2 and more gums. Compute the
expected counts by
E0 = n · e−λ̂ , E1 = n · λ̂e−λ̂ , E2 = n − E0 − E1 .
P (Ok −Ek )2
Then find the test statistic χ2 = Ek and use the chi-square distribution with df = 3 − 1 − 1 = 1 table to
p
see if the result is significant. For example, if χ2 > 1.96, we reject the Poisson model hypothesis at α = 5%.
(d) Using the Poisson model we estimate p0 by p̂0 = e−λ̂ = 0.48, which is close to the sample proportion
11
22 = 0.50.
Answer 24
(a) When the population under investigation has a clear structure it is more effective to use stratified sampling
for estimating the overall population mean. In accordance with the optimal allocation formula:
wi σi
ni = n ,
σ̄
the allocation of observations should follow the next two key rules: put more observations in the larger strata, and
put more observations in the strata with higher variation.
145
(c) Bootstrap is a resampling technique used to study the sampling distribution of a parameter estimator. In
the parametric bootstrap resampling is done from the given parametric distribution with the unknown parameters
replaced by their estimates obtained from the underlying sample. In the non-parametric bootstrap resampling is
performed with replacement directly from the the underlying sample.
s
(d) The standard error for the sample mean is sx̄ = √200 . Roughly: the range of heights 160−200 in centimeters
covers 95% of the population distribution. Treating this interval as the mean plus-minus two standard deviations,
we find s ≈ 10 cm and sx̄ is something like 0.7 cm. Thus a random sample of size 200 may give a decent estimate
of the population mean height.
Answer 25
(a)
Source of variation SS df MS F
Varieties 328.24 2 164.12 103.55
Density 86.68 3 28.89 18.23
Interaction 8.03 6 1.34 0.84
Errors 38.04 24 1.59
we reject both null hypotheses on the main factors and do not reject the null hypothesis on interaction.
√
(c) s = 1.59 = 1.26.
Answer 26
(a) This is an example of a paired sample, therefore the signed-rank test is appropriate for testing the null
hypothesis of no difference.
(b) We use the signed-rank test. The observed test statistics are W+ = 39 and W− = 6.
According to Figure 2, the two-sided p-value is larger than 5% because the smaller test statistic w− = 6 is larger
than the critical value 5 for n = 9. Therefore, we do not reject the null hypothesis of equality in favour of the
two-sided alternative.
(c) The extract from the course text book reminds that the null hypothesis for the signed rank test, beside
equality of two population distributions, assumes a symmetric distribution for the differences. It also explains why
such an assumption is reasonable.
Answer 27
(b) The fitted regression line for the final score y as a function of the midterm score x is y = 37.5 + 0.5x. Given
x = 90 we get a point prediction y = 82.5. The estimate of σ 2 is
n−1 2
s2 = s (1 − r2 ) = 84.4.
n−2 y
146
Thus the 95% prediction interval for Carl’s final score is
q
1
I = 82.5 ± t8 (0.025)s 1 + 9 + 81 ( 15 2
10 ) = 82.5 ± 24.6.
(c) The fitted regression line for the midterm score x as a function of the final score y is x = 37.5 + 0.5y. Given
y = 80 we get a point prediction x = 77.5.
Answer 28
(a) The exponential prior with parameter 1 is Gamma(1, 1). Applying the updating rule four times:
(1, 1) → (3, 2) → (3, 3) → (5, 4) → (10, 5),
we find the posterior distribution to be Gamma(10, 5). Therefore, θ̂PME = 10/5 = 2.
(b) The general updating rule for an arbitrary sample (x1 , . . . , xn becomes
- the shape parameter α0 = α + nx̄,
- the inverse scale parameter λ0 = λ + n.
α+nx̄
We have θ̂PME = λ+n . Comparing this to the maximum likelihood estimator θ̂MLE = x̄, we see that
α + nx̄ α − λx̄
θ̂PME − θ̂MLE = − x̄ = → 0,
λ+n λ+n
as n → ∞. This means that the role of the prior is less important with large sample sizes.
Answer 29
(a) We have two independent samples from two distributions: one with parameter p, and the other with
parameter q. Using Bin(29, p) and Bin(10, q) we compute the likelihood function as
29 28 10
L(p, q) = p(1 − p) q 4 (1 − q)6 .
1 4
(b) We test H0 : p = q against H1 : p 6= q using the exact Fisher test.
ECMO CMT Total
Died 1 4 5
Alive 28 6 34
Total 29 10 39
5
The count y = 1 is our observed test statistics whose null distribution is Hg(39, 29, 39 ). The one-sided p-value is
5 34 5 34
34!29!10! 5 · 34!29!10!
P(Y = 0) + P(Y = 1) = 0 3929 + 1 3928 = +
29 29
5!29!39! 6!28!39!
10 · 9 · 8 · 7 · 6 10 · 9 · 8 · 7 · 5 · 29 10 · 9 · 8 · 7 · (6 + 5 · 29)
= + = = 0.011.
39 · 38 · 37 · 36 · 35 39 · 38 · 37 · 36 · 35 39 · 38 · 37 · 36 · 35
The two-sided p-value becomes 2% and we can reject the null hypothesis.
Answer 30
(a) The null distribution of |X̄| is standard normal. The p-value of the test is the probability under the null
distribution that |X̄| > tobs . Thus
V = P(|X̄| > tobs |H0 ) = 2(1 − Φ(tobs )).
(b) Different samples will give different observed values tobs = |x̄obs |, in this sense the p-value
V = 2(1 − Φ(Tobs ))
is a random variable. We have
P(V ≤ 0.05) = P(1 − Φ(|X̄obs |) ≤ 0.025) = P(Φ(|X̄|) ≥ 0.975) = P(|X̄| > 1.96).
(c) If the true value of the population mean is µ = 4, then X̄ has distribution N(4, 1). Using (b) we find
P(V ≤ 0.05) ≈ P(X̄ > 2) = 1 − Φ(2 − 4) = Φ(2) ≈ 0.975.
(d) We see from (c) that even with such a big separation between the null-hypothesis and the true parameter
values, there is a probability of 2.5% that the p-value will exceed 5%. One has to be aware of this variability while
interpreting the p-value produced by your statistical analysis.
147
Answer 31
(a) Stratified sample mean x̄s = 0.3 · 6.3 + 0.2 · 5.6 + 0.5 · 6.0 = 6.01.
(b) We are in the one-way Anova setting with I = 3 and J = 13. The 95% Bonferroni simultaneous confidence
intervals for the three differences µ1 − µ2 , µ1 − µ3 , and µ2 − µ3 are computed as
p
Bµu −µv = x̄u − x̄v ± t36 (0.05/6)sp 2/13,
Answer 32
(a) This is another example of the Simpson paradox. The confounding factor here is the difficulty to enter
the programmes. Men tend to apply for easy programs, while women more often apply for programs with low
admission rates.
(b) A simple hypothetical experimental study could be based on two independent random samples. Focus
on one major program, say major F. Take n randomly chosen female candidates and n randomly chosen male
candidates. Ask all of them to apply for major F. Compare two sample proportions of the admitted applicants.
Of course this experiment is impossible to perform in practice!
Answer 33
Let X ∼ Bin(n, p). Two observed counts (x, n − x) are used to test H0 : p = p0 . The corresponding chi-square test
statistic
2
2
(O j −Ej ) (x−np0 )2 (n−x−n(1−p0 ))2 (x−np0 )2
X
Ej = np0 + n(1−p0 ) = np0 (1−p0 ) = z2,
j=1
where
z = √ x−np0
np0 (1−p0 )
is the test statistic for the large sample test for a proportion.
148