0% found this document useful (0 votes)
266 views148 pages

Statistical Inference

This document provides an overview of the course "Statistical Inference" which covers topics in mathematical statistics including parameter estimation methods like maximum likelihood estimation and Bayesian inference, hypothesis testing techniques such as chi-squared tests, and data analysis methods for summarizing data distributions. The course introduces tools for statistical inference through random sampling, point estimation, confidence intervals, parameter estimation, hypothesis testing and Bayesian methods. Exercises are provided at the end of each section to help students apply the statistical concepts.

Uploaded by

Sara Zeynalzade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
266 views148 pages

Statistical Inference

This document provides an overview of the course "Statistical Inference" which covers topics in mathematical statistics including parameter estimation methods like maximum likelihood estimation and Bayesian inference, hypothesis testing techniques such as chi-squared tests, and data analysis methods for summarizing data distributions. The course introduces tools for statistical inference through random sampling, point estimation, confidence intervals, parameter estimation, hypothesis testing and Bayesian methods. Exercises are provided at the end of each section to help students apply the statistical concepts.

Uploaded by

Sara Zeynalzade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

Statistical Inference

Serik Sagitov, Chalmers University of Technology and Gothenburg University

Abstract

This text is a compendium for the undergraduate course on course MVE155 ”Statistical Inference” worth of
7.5 hp, which is a second course in mathematical statistics suitable for students with different backgrounds. A
main prerequisite is an introductory course in probability and statistics. The course gives a deeper understanding
of some traditional topics in mathematical statistics such as methods based on likelihood, topics in descriptive
statistics and data analysis with special attention to graphical displays, aspects of experimental design.
The course introduces a number of standard tools of statistical inference including bootstrap, parametric
and non-parametric testing, analysis of variance, introduction to Bayesian inference, chi-squared tests, multiple
regression.
Recommended textbook: John Rice, Mathematical statistics and data analysis, 3rd edition. The com-
pendium includes a collection of solved exercises many of which are the end-of-chapter exercises from the Rice’s
book. Do not read a solution before you tried to solve an exercise on your own.
Please send your corrections to [email protected]. (I gratefully acknowledge multiple corrections kindly
proposed by the students of the year 2019.)
Last updated: June 10, 2019

Contents
Abstract 1

1 Introduction 2
1.1 List of course topics: part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 List of course topics: part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Random sampling 5
2.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Sample mean and sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Approximate confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Stratified random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Parameter estimation 10
3.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Large sample properties of the maximum likelihood estimates . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Exact confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Hypothesis testing 19
4.1 Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Large-sample test for the proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Small-sample test for the proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Two tests for the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Pearson’s chi-squared test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Example: sex ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1
5 Bayesian inference 27
5.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Credibility interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Bayesian hypotheses testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Summarising data 34
6.1 Empirical probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Quantiles and QQ-plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.4 Testing normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.5 Measures of location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.6 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7 Comparing two samples 41


7.1 Two independent samples: comparing population means . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 Two independent samples: comparing population proportions . . . . . . . . . . . . . . . . . . . . . . 44
7.3 Paired samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.4 External and confounding factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

8 Analysis of variance 51
8.1 One-way layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.2 Simultaneous confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.3 Kruskal-Wallis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.4 Two-way layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.5 Example: iron retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.6 Randomised block design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.7 Friedman test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

9 Categorical data analysis 61


9.1 Chi-squared test of homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.2 Chi-squared test of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.3 Matched-pairs designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.4 Odds ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

10 Multiple regression 68
10.1 Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
10.3 Confidence intervals and hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10.4 Intervals for individual observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.5 Linear regression and ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.6 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

11 Overview of basic probability theory 77


11.1 Probability rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
11.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
11.3 Important parametric distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
11.4 Joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

12 Statistical distribution tables 80


12.1 Normal distribution table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
12.2 Chi-square distribution table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
12.3 Tails of t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
12.4 Critical values of the F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
12.5 Critical values of the Studentised range distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
12.6 Critical values for the rank sum test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

2
12.7 Critical values for the signed rank test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

13 Solutions to exercises 91
13.1 Solutions to Section 2 (random sampling) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
13.2 Solutions to Section 3 (parameter estimation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
13.3 Solutions to Section 4 (hypothesis testing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
13.4 Solutions to Section 5 (Bayesian inference) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
13.5 Solutions to Section 6 (summarising data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
13.6 Solutions to Section 7 (two samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
13.7 Solutions to Section 8 (analysis of variance) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
13.8 Solutions to Section 9 (categorical data analysis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
13.9 Solutions to Section 10 (multiple regression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

14 Miscellaneous exercises 127


14.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
14.2 Numerical answers to miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

”Unfortunately, considerable propaganda is now being developed


to convince the public that cigarette smoking is dangerous.”

Sir Ronald Aylmer Fisher (Nature, 1958)

3
1 Introduction
Statistical analysis consists of three parts: collection of data, summarising data, and making inferences. The graph
below presents

Mathematical Statistics
parameter estimation
hypotheses testing
Statistical Models
REAL WORLD

DATA Parameters of the model

Probability Theory
predicts data patterns
for a given parametric model

the relationship between two sister branches of mathematics: probability theory and mathematical statistics. Sec-
tion 11 summarises some key formulas from probability theory considered as prerequisite for this course. Example
of probability versus statistics:

probability. Previous studies showed that the drug was 80% effective. Then we can anticipate that
for a study on 100 patients, in average 80 will be cured and at least 65 will be cured with 99.99% chances.

statistics. It was observed that 78 out of 100 patients were cured. We are 95% confident that
for other similar studies, the drug will be effective on between 69.9% and 86.1% of patients.

The main focus of this course is on the issues of parameter estimation and hypothesis testing using properly
collected, relatively small data sets. Special attention, therefore, is paid to the basic principles of experimental
design: randomisation, blocking, and replication.

Statistical modelling: a complicated process is represented by a


simple process governed by few parameters, plus random noise.

A key parametric statistical model is the normal distribution model N(µ, σ 2 ) with mean µ and variance σ 2 .
The standard normal distribution N(0, 1) has the cumulative distribution function
Z z
2
Φ(z) = √1

e−y /2
dy.
−∞

The standard deviation σ plays the role of a scale parameter of the model in that if X ∼ N(µ, σ 2 ), then X−µ
σ ∼
N(0, 1).
The importance of the normal model is due to the central limit theorem as well as due to the remarkable
analytical properties of the normal density function. An example of the central limit theorem is the normal
approximation of the binomial distribution. With a continuity correction the claim is that for X ∼ Bin(n, p),

k− 1 −np

P(X ≤ k − 1) = P(X < k) ≈ Φ √ 2 .
np(1−p)

Notational agreements:

• Random variables X, Y, . . . are denoted by capital letters and their realisations x, y, . . . are denoted by small
letters (but not always). It is important to recognise the difference between the observed data values and
the observed test statistics on one hand, and on the other hand, the corresponding random variables whose
randomness is due to a random mechanism generating the data.

• Normal approximation Zn ≈ N(µn , a2n ) for n 1 means that for any real x,

x−µn
P(Zn ≤ x) ≈ Φ an ,

or more precisely for any real z,


P( Zna−µ
n
n
≤ z) → Φ(z), n → ∞.

4
1.1 List of course topics: part I
Statistical inference vs probability theory. Statistical models.
Population distribution. Population mean and standard deviation, population proportion.
Randomisation.
Sampling with replacement, iid-sample.
Sampling without replacement, simple random sample.

Point estimate, sampling distribution.


Mean square error, systematic error and random (sampling) error.
Sample mean, sample variance, sample standard deviation, sample proportion.
Finite population correction.
Standard error of the sample mean and sample proportion.
Approximate confidence interval for the mean.
Stratified random sampling. Optimal allocation of observations, proportional allocation.

Parametric models, population parameters.


Binomial, geometric, Poisson, discrete uniform models.
Continuous uniform, exponential, gamma models.
Normal distribution, central limit theorem, continuity correction.
Method of moments for point estimation.
Maximum likelihood estimate (MLE). Likelihood function.
Normal approximation for the sampling distribution of MLE.
Sufficient statistics for population parameters.
Exact confidence intervals for the mean and variance. Chi-squared and t-distributions.

Statistical hypotheses, simple and composite, null and alternative.


Rejection region. Two types of error.
Significance level, test power.
P-value of the test, one-sided and two-sided p-values.
Large-sample test for the proportion. Small-sample test for the proportion.
Large-sample test for the mean. One-sample t-test.
Nested hypotheses, generalised likelihood ratio test.
Pearson’s chi-squared test, its approximate nature. Multinomial distribution.

Bayes formulas for probabilities and densities.


Prior and posterior distributions.
Loss function, posterior risk, 0-1 loss function and squared error loss.
Conjugate priors. Normal-normal model.
Beta and Dirichlet distributions. Beta-binomial model and Dirichlet-multinomial model.
Bayesian estimation, MAP and PME. Credibility interval.
Posterior odds.

1.2 List of course topics: part II


Empirical cumulative distribution function. Empirical variance.
Survival function and hazard function. Weibull distribution. Empirical survival function.
Kernel density estimate. Steam-and-leaf plot.
Population quantiles. Ordered sample and empirical quantiles.
QQ-plots, normal probability plot.
Coefficient of skewness and kurtosis. Light tails and heavy tails of probability distributions.
Leptokurtic and platykurtic distributions.

Population mean, mode, and median. Sample median, outliers.


Sign test and non-parametric confidence interval for the median.
Trimmed means.
Parametric and non-parametric bootstraps.
Sample range, quartiles, IQR and MAD. Boxplots.

Two independent versus paired samples.


Approximate confidence interval and large sample test for the mean difference.

5
Two-sample t-test, pooled sample variance.
Exact confidence interval for the mean difference. Transformation of variables.
Ranks vs exact measurements. Rank sum test. Signed rank test.
Approximate confidence interval for the difference p1 − p2 .
Large sample test for two proportions.
Fisher’s exact test.
Double-blind randomised controlled experiments.
Confounding factors, Simpson’s paradox.

One-way ANOVA, sums of squares and mean squares.


Normal theory model, F-test, F-distribution.
Normal probability plots for the residuals.
Multiple comparison or multiple testing problem.
Simultaneous CI, Bonferroni’s method and Tukey’s method.
Kruskal-Wallis test.
Two-way ANOVA, main effects and interaction. Three F-tests.
Additive model. Randomised block design.
Friedman’s test.

Categorical data.
Chi-squared tests of homogeneity and independence.
Chi-squared test for k proportions.
Prospective and retrospective studies. Matched-pairs design, McNemar’s test. Odds ratio.

Simple linear regression model. Normal equations. Least square estimates.


Sample correlation coefficient, sample covariance.
Corrected MLE of the noise variance. Coefficient of determination.
confidence interval and hypotheses testing for the intercept and slope. Model utility test.
Prediction interval for a new observation.
Standardised residuals.
Linear regression and ANOVA.
Multiple regression. Design matrix.
Coefficient of multiple determination. Adjusted coefficient of multiple determination.
Collinearity problem.

6
2 Random sampling
Population of size N can be viewed as a set of N elements characterised by values {x1 , x2 , . . . , xN }.

If we pick at random one element from the population, then its value x is a
realisation of a random variable X whose distribution is the population distribution.

In many situations finding a population distribution by enumeration is either very expensive or even impossible.
However, a good guess is available using a random sample of n observations (x1 , . . . , xn ). Such a sample will be
treated as a single realisation of a random vector (X1 , . . . , Xn ). If the sampling experiment is repeated, the new
values (x01 , . . . , x0n ) usually will be different from (x1 , . . . , xn ).

Randomisation is a guard against investigator’s biases even unconscious.

To reduce the problem of finding the whole probability distribution curve, one can focus on estimating population
mean and population standard deviation
p
µ = E(X), σ = Var(X).

This works for the quantitative (continuous or discrete) data. If the data is categorical, then one may translate
it to numbers. An important special case of categorical data is dichotomous data. Consider the example of
xi ∈ {male, female}. After converting categorical values to a quantitative form with xi ∈ {0, 1}, the population
distribution becomes a Bernoulli distribution with parameter

p = P(X = 1),

called a population proportion.

There are two basic ways of random sampling. Sampling without replacement produces a so called simple
random sample. Sampling with replacement produces an iid-sample, with (X1 , . . . , Xn ) being independent and
identically distributed random variables. An iid-sample is easier to analyse, and the results give a good approxi-
mation to the simple random sample, provided n/N is small.

Example: in class experiment


We collect data on students heights and gender using a two colour histogram. The collected hight values form a
random sample taken from the population of Gothenburg students. Motivating questions:

- can this group be viewed as a simple random sample?


- what is the shape of the population distribution of heights?
- what is an estimate of population proportion of women?

2.1 Point estimation


To estimate a population parameter θ based on a given random sample (x1 , . . . , xn ), we need a sensible point
estimate θ̂ = g(x1 , . . . , xn ). Observe, that in the same way as (x1 , . . . , xn ) is a realisation of a random vector
(X1 , . . . , Xn ), the point estimate θ̂ is a realisation of a random variable

Θ
b = g(X1 , . . . , Xn )

which we will call a point estimator of θ. Sampling distribution of Θ


b around unknown θ: different values of θ̂ will
be observed for different samples. The sampling distribution has mean and variance
2 2
µΘ
b = E(Θ), b = E(Θ − µΘ
σΘ b) .
b b

The quality of the the point estimator Θ


b is measured by the mean square error

b − θ)2 = E(Θ
E(Θ b − µ b )2 + 2E(Θ
b − µ b )(µ b − θ) + (µ b − θ)2 = σ 2 + (µ b − θ)2 .
Θ Θ Θ Θ Θb Θ

The mean square error has two components

b − θ is the size of systematic error, bias, lack of accuracy,


µΘ
σΘ
b is the size of the random error, lack of precision.

Desired properties of point estimates:

7
θ̂ is an unbiased estimate of θ, that is µΘ
b = θ,
θ̂ is a consistent estimate, in that the mean square error
b − θ)2 → 0 as the sample size n → ∞.
E(Θ
q
b is its standard deviation σ b =
The standard error for an estimator Θ Θ Var(Θ).
b

The estimated standard error of the point estimate θ̂ is given by sθ̂ ,


which is a point estimate of σΘ
b computed from the data.

2.2 Sample mean and sample variance


For a given random sample (x1 , . . . , xn ), the most basic summary statistics are the sample mean and sample
variance
x1 + . . . + xn 1 X
x̄ = , s2 = (xi − x̄)2 .
n n−1
An alternative formula for the sample variance
x21 +...+x2n
s2 = n
n−1 (x
2 − x̄2 ), x2 = n .

In the same way as (x1 , . . . , xn ) is a realisation of a random vector (X1 , . . . , Xn ), the summary statistics x̄ and s2
are realisation of random variables
X1 + . . . + Xn 1 X
X̄ = , S2 = (Xi − X̄)2 .
n n−1
Consider an iid-sample. The sample mean x̄ and sample variance s2 are unbiased and consistent estimators for the
population mean µ and variance σ 2 respectively
2 4

E(X̄) = µ, Var(X̄) = σn , E(S 2 ) = σ 2 , Var(S 2 ) = σn E( X−µ
σ )4
− n−3
n−1 .

Notice that the sample standard deviation s systematically underestimates the population standard deviation σ
since

E(S 2 ) = σ 2 and (ES)2 < E(S 2 ).


We will pay a special attention to the dichotomous case, when X takes values 1 or 0 with probabilities p and
1 − p, so that
µ = p, σ 2 = p(1 − p).
Here, the sample mean turns into a sample proportion p̂ = x̄ giving an unbiased and consistent estimate of p. In
the dichotomous case,
s2 = n−1
n
p̂(1 − p̂).
q
p̂(1−p̂)
Estimated standard errors for the sample mean and proportion sx̄ = √s , sp̂ =
n n−1

Now consider simple random sampling, when there is dependence between observations. In this case the sample
mean x̄ is again an unbiased and consistent estimate for the population mean. However, the sample variance s2 is
a biased estimate of σ 2 , since
E(S 2 ) = σ 2 NN−1 ,
where N is the finite population size. We also have
σ2 n−1
Var(X̄) = n (1 − N −1 ),

so that in the light of the previous equality we get an unbiased estimate of Var(X̄) to be
s2 N −1 s2
s2x̄ = n N (1 − n−1
N −1 ) = n (1 − n
N ).

Thus, for the sampling without replacement, the formulas for the estimated standard errors of X̄ and p̂ for the
simple random sample take the new form
q q q
sx̄ = √sn 1 − N
n
, sp̂ = p̂(1−
n−1
p̂) n
1− N .

8
2.3 Approximate confidence intervals
By the Central Limit Theorem, the sample mean distribution is approximately normal
2
X̄ ≈ N(µ, σn ),
in that for large sample sizes n, we have

P(| X̄−µ
σ | > z) ≈ 2(1 − Φ(z)).

Since SX̄ ≈ σX̄ , we derive that

P(X̄ − zSX̄ < µ < X̄ + zSX̄ ) = P(| X̄−µ


S | > z) ≈ P(| X̄−µ
σ | > z) ≈ 2(1 − Φ(z)).
X̄ X̄

This yields the following formulas of approximate 100(1–α)% two-sided confidence intervals for µ and p:
Iµ = x̄ ± zα/2 · sx̄ ,
Ip = p̂ ± zα/2 · sp̂ ,
where zα stands for the normal quantile. These two formulas are valid even for the sampling without replacement
due to a more advanced version of the central limit theorem. According to the normal distribution table we have
100(1–α)% 68% 80% 90% 95% 99% 99.7%
zα/2 1.00 1.28 1.64 1.96 2.58 3.00
The higher is confidence level the wider is the confidence interval. On the other hand, the larger is sample the
narrower is the confidence interval.
The exact meaning of the confidence level is a bit tricky. For example, a 95% confidence interval is a random
interval, such that out of 100 intervals Iµ computed for 100 samples, only 95 are expected cover the true value of µ.
Notice that the rundom number of successful realisations of the confidence interval has distribution Bin(100,0.95)
which is approximately normal with mean 95 and standard deviation 2.2.

2.4 Stratified random sampling


Given additional information on population structure, one can reduce the sampling error using the method of
stratified sampling. Assume that a population consists of k strata with known strata fractions (w1 , . . . , wk ) such
that
w1 + . . . + wk = 1.
(Example: Swedish population consists of two strata, females and males with k = 2 and w1 = w2 = 0.5.) In terms
of unknown strata means and standard deviations
(µj , σj ), j = 1, . . . , k,
we have the following expressions for the population mean and variance
µ = w1 µ1 + . . . + wk µk ,
k
X
σ2 = σ2 + wj (µj − µ)2 ,
j=1

where the last equality is due to the total variance formula, with
σ 2 = w1 σ12 + . . . + wk σk2
being the average variance. Stratified random sampling consists of taking k independent iid-samples from each
stratum with sample sizes (n1 , . . . , nk ) and sample means x̄1 , . . . , x̄k .

Stratified sample mean: x̄s = w1 x̄1 + . . . + wk x̄k

Observe that for any allocation (n1 , . . . , nk ), the stratified sample mean is an unbiased estimate of µ
E(X̄s ) = w1 E(X̄1 ) + . . . + wk E(X̄k ) = w1 µ1 + . . . + wk µk = µ.
The variance of X̄s
2 w12 σ12 2 2
wk σk
σX̄s
= w12 σX̄
2
1
+ . . . + wk2 σX̄
2
k
= n1 + ... + nk
is estimated by
w12 s21 2 2
wk sk
s2x̄s = w12 s2x̄1 + . . . + wk2 s2x̄k = n1 + ... + nk ,
where sj is the sample standard deviation corresponding to the sample mean x̄j .

9
Approximate confidence interval Iµ = x̄s ± zα/2 · sx̄s

Optimisation problem: allocate n = n1 + . . . + nk observations among different strata to minimise the sampling
error of x̄s .
wj σj 1
Optimal allocation: nj = n σ̄ , gives the minimum variance Var(X̄so ) = n · σ̄ 2

Here σ̄ 2 is the squared average standard deviation


σ̄ = w1 σ1 + . . . + wk σk .
The optimal allocation assigns more observations to larger strata and strata with larger variation. The major
drawback of the optimal allocation formula is that it requires knowledge of the standard deviations σj . If σj are
unknown, then a sensible choice is to allocate observations proportionally to the strata sizes.
1
Proportional allocation: nj = nwj , Var(X̄sp ) = n · σ2
Comparing three unbiased estimates of the population mean we see that their variances are ordered in the
following way
Var(X̄so ) ≤ Var(X̄sp ) ≤ Var(X̄).
Variability of σj across strata makes optimal allocation more effective than proportional
X
Var(X̄sp ) − Var(X̄so ) = n1 (σ 2 − σ̄ 2 ) = n1 wj (σj − σ̄)2 .
Variability in µj across strata makes proportional allocation more effective than iid-sample
X
Var(X̄) − Var(X̄sp ) = n1 (σ 2 − σ 2 ) = n1 wj (µj − µ)2 .

2.5 Exercises
Problem 1
Consider a population consisting of five values
1, 2, 2, 4, 8.
Find the population mean and variance. Calculate the sampling distribution of the mean of a sample of size 2 by
generating all possible such samples. From them, find the mean and variance of the sampling distribution, and
compare the results to those obtained by the formulas from this section.

Problem 2
In a simple random sample of 1500 voters, 55% said they planned to vote for a particular proposition, and 45%
said they planned to vote against it. The estimated margin of victory for the proposition is thus 10%. What is the
standard error of this estimated margin? What is an approximate 95% confidence interval for the margin?

Problem 3
This problem introduces the concept of a one-sided confidence interval. Using the central limit theorem, how should
the constant k1 be chosen so that the interval
(−∞, x̄ + k1 sx̄ )
is a 90% confidence interval for µ? How should k2 be chosen so that
(x̄ − k2 sx̄ , ∞)
is a 95% confidence interval for µ?

Problem 4
Warner (1965) introduced the method of randomised response to deal with surveys asking sensitive questions.
Suppose we want to estimate the proportion q of illegal drug users among prison inmates. We are interested in
the population as a whole - not in punishing particular individuals. Randomly chosen n inmates have responded
yes/no to a randomised statement (after rolling a die):
“I use heroin” (with probability 5/6)
“I do not use heroin” (with probability 1/6).
Suggest a probability model for this experiment, find a method of moments estimate for q and its standard error.

10
Problem 5
A simple random sample of a population size 2000 yields 25 values with
104 109 11 109 87
86 80 119 88 122
91 103 99 108 96
104 98 98 83 107
79 87 94 92 97

(a) Calculate an unbiased estimate of the population mean.


(b) Calculate an unbiased estimates of the population variance and Var(X̄).
(c) Give an approximate 95% confidence interval for the population mean.

Problem 6
For a simple random sample, take x̄2 as a point estimate of µ2 . (This is an example of the method of moments
estimate.) Compute the bias of this point estimate.

Problem 7
The following table (Cochran 1977) shows the stratification of all farms in a county by farm size and the mean and
standard deviation of the number of acres of corn in each stratum.
Farm size 0-40 41-80 81-120 121-160 161-200 201-240 241+
Number of farms Nj 394 461 391 334 169 113 148
Stratum mean µj 5.4 16.3 24.3 34.5 42.1 50.1 63.8
Stratum standard deviation σj 8.3 13.3 15.1 19.8 24.5 26.0 35.2

(a) For a sample size of 100 farms, compute the sample sizes from each stratum for proportional and optimal
allocation, and compare them.
(b) Calculate the variances of the sample mean for each allocation and compare them to each other and to the
variance of an estimated formed from simple random sampling.
(c) What are the population mean and variance?
(d) Suppose that ten farms are sampled per stratum. What is Var(X̄s )? How large a simple random sample
would have to be taken to attain the same variance? Ignore the finite population correction.
(e) Repeat part (d) using proportional allocation of the 70 samples.

Problem 8
How might stratification be used in each of the following sampling problems?

(a) A survey of household expenditures in a city.


(b) A survey to examine the lead concentration in the soil in a large plot of land.
(c) A survey to estimate the number of people who use elevators in a large building with a single bank of
elevators.
(d) A survey of programs on a television station, taken to estimate the proportion of time taken up by by
advertising on Monday through Friday from 6 pm until 10 pm. Assume that 52 weeks of recorded broadcasts are
available for the analysis.

Problem 9
Consider stratifying the population of Problem 1 into two strata (1,2,2) and (4,8). Assuming that one observation
is taken from each stratum, find the sampling distribution of the estimate of the population mean and the mean
and standard deviation of the sampling distribution. Check the formulas of Section 1.4.

11
3 Parameter estimation
Given a parametric model with unknown parameter(s) θ, we wish to estimate θ from an iid-sample. There are two
basic methods of finding good point estimates
1. method of moments,
2. maximum likelihood method.
The first is a simple intuitive method that also can be used as a first approximation for the maximum likelihood
method, which is optimal for large samples.

3.1 Method of moments


Suppose we are given iid-sample (x1 , . . . , xn ) from a population distribution characterised by a pair of parameters
(θ1 , θ2 ). Suppose we have the following formulas for the first and second population moments:
E(X) = f (θ1 , θ2 ), E(X 2 ) = g(θ1 , θ2 ).
Method of moments estimates (θ̃1 , θ̃2 ) are found after replacing the population moments with the corresponding
sample moments, and solving the obtained equations
x̄ = f (θ̃1 , θ̃2 ), x2 = g(θ̃1 , θ̃2 ).
This approach is justified by the Law of Large Numbers telling that
X12 +...+Xn
2
X1 +...+Xn
n → µ, n → E(X 2 ), n → ∞.

Example: geometric model


Consider the dataset summarising an iid-sample (x1 , . . . , xn ) of the hop counts for n = 130 birds,
Number of hops j 1 2 3 4 5 6 7 8 9 10 11 12 Tot
Observed frequency Oj 48 31 20 9 6 5 4 2 1 1 2 1 130
where
xi = number of hops that a bird i does between flights.
The observed frequency Oj is the number of birds who hopped j times
Oj = 1{x1 =j} + . . . + 1{xn =j} .
The data produces the following summary statistics
total number of hops
x̄ = number of birds = 363
130 = 2.79,
x2 = 1 · 130 + 2 · 130 + . . . + 112 · 130
2 48 2 31 2
+ 122 · 1
130 = 13.20,
2 130 2 2
s = 129 (x − x̄ ) = 5.47,
q
5.47
sx̄ = 130 = 0.205.

An approximate 95% confidence interval for µ, the mean number of hops per bird is given by
Iµ = x̄ ± z0.025 · sx̄ = 2.79 ± 1.96 · 0.205 = 2.79 ± 0.40.
After inspecting the histogram of the data values, we observe geometrically descending frequencies which sug-
gests a geometric model for the number of jumps for a random bird. Geometric model X ∼ Geom(p) assumes
that a bird ”does not remember” the number of hops made so far, and the next move of the bird is to jump with
probability 1 − p or to fly away with probability p. Method of moment estimate for the parameter θ = p of the
geometric model requires a single equation arising from the expression for the first population moment
µ = p1 .
1
This expression leads to the equation x̄ = p̃ which gives the method of moment estimate
1
p̃ = x̄ = 0.36.
We can compute an approximate 95% confidence interval for p using the above mentioned Iµ :
1 1
Ip = ( 2.79+0.40 , 2.79−0.40 ) = (0.31, 0.42).
To answer the question of how well does the geometric distribution fit the data, let us compare the observed
frequencies to the frequencies expected from the geometric distribution with parameter p̃:

12
j 1 2 3 4 5 6 7+
Oj 48 31 20 9 6 5 11
Ej 46.5 29.9 19.2 12.3 7.9 5.1 9.1

Expected frequencies ej are computed in terms of independent geometric random variables (X1 , . . . , Xn )

Ej = E(Oj ) = E(1{X1 =j} + . . . + 1{Xn =j} )


= nP(X = j) = n(1 − p̃)j−1 p̃ = 130 · (0.64)j−1 (0.36), j = 1, . . . , 6,
E7 = n − E1 − . . . − E6 .

An appropriate measure of discrepancy between the observed and expected counts is given by the following chi-
squared test statistic
7
(Oj −Ej )2
X
χ2 = Ej = 1.86.
j=1

As it will be explained later on, the observed smal value test statistic allows us to conclude that the geometric
model fits the data well.

3.2 Maximum likelihood estimation


In a parametric setting, given a parameter value θ, the observed sample (x1 , . . . , xn ) is a realisation of the random
vector (X1 , . . . , Xn ) which has a certain joint distribution

f (y1 , . . . yn |θ)

as a function of possible values (y1 , . . . , yn ). Fixing the variables (y1 , . . . , yn ) = (x1 , . . . , xn ) and allowing the
parameter value θ to vary, we obtain the so-called likelihood function

L(θ) = f (x1 , . . . xn |θ).

Notice, that the likelihood function usually is not a density function over θ. To illustrate this construction, draw
three density curves for three parameter values θ1 < θ2 < θ3 , then show how for a given observed value x, the
likelihood curve connects the three points on the plane

(θ1 , f (x|θ1 )), (θ2 , f (x|θ2 )), (θ3 , f (x|θ3 )).

The maximum likelihood estimate θ̂ of θ is the value of θ that maximises L(θ).

Example: binomial model


Consider the binomial distribution model X ∼ Bin(n, p), with a single observation corresponding to n observations
in the Ber(p) model. From µ = np, we see that the method of moment estimate
x
p̃ = n

is the sample proportion. To maximise the likelihood function



n x
L(p) = p (1 − p)n−x
x

we can maximise the log-likelihood function



n
l(p) = ln L(p) = log + x log p + (n − x) log(1 − p).
x

Take its derivative and solving the equation l0 (p) = 0

x n−x
l0 (p) = − = 0,
p 1−p
x
we find that the MLE of population proportion is the sample proportion p̂ = n.

13
3.3 Sufficiency
Clearly, if there is a statistic t = g(x1 , . . . , xn ) such that
L(θ) = f (x1 , . . . , xn |θ) = h(t, θ)c(x1 , . . . , xn ) ∝ h(t, θ),
where ∝ means proportional, as the coefficient of proportionality c(x1 , . . . , xn ) does not explicitly depend on θ
and can be treated as a constant which does not influence the location of the maximum likelihood estimate θ̂. It
follows that θ̂ depends on the data (x1 , . . . , xn ) only as a function of t. Given such a factorisation property, we call
t a sufficient statistic, as no other statistic that can be calculated from the same sample provides any additional
information on the value of the maximum likelihood estimate θ̂.

Example: Bernoulli distribution


For a single Bernoulli trial with probability of success p, we have
f (x) = P(X = x) = px (1 − p)1−x , x ∈ {0, 1},
and for n independent Bernoulli trials,
n
Y
f (x1 , . . . , xn |p) = pxi (1 − p)1−xi = pnx̄ (1 − p)n−nx̄ ,
i=1
Q
where stands for the product. This implies that for the Bernoulli model, the number of successes
t = x1 + . . . + xn = nx̄
is a sufficient statistic whose distribution is T ∼Bin(n, p).

Example: normal distribution model


The two-parameter normal distribution model N(µ, σ 2 ) has a two-dimensional sufficient statistic (t1 , t2 ), where
n
X n
X
t1 = xi , t2 = x2i ,
i=1 i=1

which follows from


n
Y 1 (xi −µ)2 1 t2 −2µt1 +nµ2
L(µ, σ) = √ e− 2σ2 = n n/2
e− 2σ 2 .
i=1
σ 2π σ (2π)

3.4 Large sample properties of the maximum likelihood estimates


For an iid-sample from a parametric population distribution f (x|θ), the likelihood function is given by the product
L(θ) = f (x1 |θ) · · · f (xn |θ)
due to independence. This implies that the log-likelihood function can be treated as a sum of independent and
identically distributed random variables ln f (Xi |θ). Using the central limit theorem argument one can derive a
normal approximation for the maximum likelihood estimator.
1 √ 1
Normal approximation Θ̂ ≈ N(θ, nI(θ) ), as n 1, or in other words n(Θ̂ − θ) ≈ N(0, I(θ) )

where I(θ) is the Fisher information in a single observation is defined as follows. Let
∂2
g(x, θ) = ∂θ 2 ln f (x|θ)
and put Z
I(θ) = −E[g(X, θ)] = − g(x, θ)f (x|θ)dx.

Fisher information I(θ) is the curvature of the log-likelihood function at the top. The larger information nI(θ) is
in n observations, the smaller is the asymptotic variance (and the larger is the curvature).
z
Approximate 100(1 − α)% confidence interval Iθ = θ̂ ± √ α/2
nI(θ̂)

It turns out that the maximum likelihood estimators are asymptotically unbiased, consistent, and asymptotically
efficient (have minimal variance) in the following sense.

Cramer-Rao inequality: if θ∗ is an unbiased estimator of θ, then Var(Θ∗ ) ≥ 1


nI(θ) .

14
Example: exponential model
For lifetimes of five batteries measured in hours

x1 = 0.5, x2 = 14.6, x3 = 5.0, x4 = 7.2, x5 = 1.2,

we propose an exponential model X ∼ Exp(θ), where θ is the battery death rate per hour. Method of moment
estimate: from µ = 1/θ, we find
5
θ̃ = x̄1 = 28.5 = 0.175.
The likelihood function

L(θ) = θe−θx1 θe−θx2 θe−θx3 θe−θx4 θe−θx5 = θn e−θ(x1 +...+xn ) = θ5 e−θ·28.5

first grows from 0 to 2.2 · 10−7 and then falls down towards zero. The likelihood maximum is reached at θ̂ = 0.175.
For the exponential model, t = x1 + . . . + xn is a sufficient statistic, and the maximum likelihood estimate

θ̂ = 1/x̄

is biased but asymptotically unbiased since


b ≈θ
E(Θ)
for large samples. The latter holds due to the Law of Large Numbers X̄ ≈ µ.
Fisher information for the exponential model is easy to compute:
∂2
g(x, θ) = ∂θ 2 ln f (x|θ) = − θ12 , I(θ) = −E[g(X, θ)] = 1
θ2 .

This yields
b ≈ θ2
Var(Θ) n

and we get an approximate 95% confidence interval


0.175
Iθ = 0.175 ± 1.96 · √
5
= 0.175 ± 0.153.

3.5 Gamma distribution


Gamma distribution Gam(α, λ) is described by two parameters: shape parameter α > 0 and inverse scale parameter
λ > 0. The gamma density function
1 α α−1 −λx
f (x) = λ x e , x > 0,
Γ(α)
involves the gamma function Z ∞
Γ(α) = xα−1 e−x dx,
0

which is an extension of the function (α − 1)! to non-integer α, in that

Γ(k) = (k − 1)! for k = 1, 2, . . .

It brings the mean and variance values


µ= α
λ, σ2 = α
λ2 .

The gamma distribution model is more more flexible than the normal distribution model as for different values
of the shape parameter α the density curves have different shapes. If α = 1, then

Gam(1, λ) = Exp(λ).

If α = k is integer, and Xi ∼ Exp(λ) are independent, then

X1 + . . . + Xk ∼ Gam(k, λ).

Parameter λ does not influence the shape of the density influencing - only the scaling of the random variable. It is
easy to see that
X ∼ Gam(α, λ) ⇒ λX ∼ Gam(α, 1).
Normal approximation for the gamma distribution:

Gam(α, λ) ≈ N( αλ , λα2 ), α 1.

15
Turning to the likelihood function
n
Y 1 α α−1 −λxi λnα λnα α−1 −λt1
L(α, λ) = λ xi e = n (x1 · · · xn )α−1 e−λ(x1 +...+xn ) = n t e ,
i=1
Γ(α) Γ (α) Γ (α) 2

where
(t1 , t2 ) = (x1 + . . . + xn , x1 · · · xn )
is a pair of sufficient statistics containing all information from the data needed to compute the likelihood function.
To maximise the log-likelihood function
l(α, λ) = ln L(α, λ),
set the two derivatives
0

∂α l(α, λ) = n ln(λ) − n ΓΓ(α)
(α)
+ ln t2 ,
∂ nα
∂λ l(α, λ) = λ − t1 ,
equal to zero. Then we solve numerically the system of two equations
ln(α̂/x̄) = − n1 ln t2 + Γ0 (α̂)/Γ(α̂),
λ̂ = α̂/x̄

using the method of moment estimates (α̃, λ̃) as the initial values.

Example: male heights


We illustrate the gamma model by applying it to a male height sample of size n = 24 given below in an ascending
order:
170, 175, 176, 176, 177, 178, 178, 179, 179,
180, 180, 180, 180, 180, 181, 181, 182, 183, 184, 186, 187, 192, 192, 199.
We would like to estimate the parameters (λ, α) using the data. First, compute summary statistics
x̄ = 181.46, x2 = 32964.2, x2 − x̄2 = 37.08.
To apply the method of moments we use formulas
EX = α
λ, E(X 2 ) − (EX)2 = α
λ2 .

Replacing here EX by x̄ and E(X 2 ) by x2 , we get


α̃ = x̄2 /(x2 − x̄2 ) = 887.96, λ̃ = α̃/x̄ = 4.89.
Mathematica command
FindRoot[Log[a] == 0.00055+Gamma0 [a]/Gamma[a], {a, 887.96}]
gives the maximum likelihood estimates
α̂ = 908.76, λ̂ = 5.01.

Parametric bootstrap
What is the standard error sα̂ of the maximum likelihood estimate α̂ = 908.76? No analytical formula is available.
If we could simulate from the true population distribution Gam(α, λ), then B samples of size n = 24 would generate
B independent estimates α̂j . Then the standard deviation of the sampling distribution would give us the desired
standard error:
B B
1 X 1 X
s2α̂ = (α̂j − ᾱ)2 , ᾱ = α̂j .
B j=1 B j=1

Parametric bootstrap approach: use Gam(α̂, λ̂) as a substitute of Gam(α, λ).


Example: male heights
I have simulated B = 1000 samples of size n = 24 from Gam(908.76; 5.01) and found
q X
1
ᾱ = 1039.0, sα̂ = 1000 (α̂j − ᾱ)2 = 331.3.

The standard error is large because of small sample size n = 24.

16
3.6 Exact confidence intervals
If we put a restrictive assumption on the population distribution and assume that an iid-sample (x1 , . . . , xn ) is
taken from a normal distribution N(µ, σ 2 ) with unspecified parameters µ and σ, then instead of the approximate
confidence interval formula for the mean we may apply an exact confidence interval formula based on the following
probability theory fact. If random variables X1 , . . . , Xn are independent and have the same distribution N(µ, σ 2 ),
then
X̄ − µ
∼ tn−1
SX̄
has the so-called t-distribution with n − 1 degrees of freedom. Here

S
SX̄ = √
n

stands for the random variable whose realisation is sx̄ .

Exact 100(1 − α)% confidence interval for the mean Iµ = x̄ ± tn−1 ( α2 ) · sx̄

A tk -distribution curve looks similar to N(0,1)-curve. Its density function is symmetric around zero:

Γ( k+1 ) x2
− k+1
2
f (x) = √ 2 k 1 + k , k ≥ 1.
kπΓ( 2 )

It has a larger spread the standard normal distribution. If the number of degrees of freedom k ≥ 3, then the
k
variance is k−2 . Connection to the standard normal distribution: if Z, Z1 , . . . , Zk are N(0,1) and independent, then

Z
p ∼ tk .
(Z12 + . . . + Zk2 )/k

Let α = 0.05. The exact confidence interval for µ is wider than the approximate confidence interval x̄ ± 1.96 · sx̄
valid for the very large n. For example

Iµ = x̄ ± 2.26 · sx̄ for n = 10, Iµ = x̄ ± 2.13 · sx̄ for n = 16,


Iµ = x̄ ± 2.06 · sx̄ for n = 25, Iµ = x̄ ± 2.00 · sx̄ for n = 60.

Moreover, in the N(µ, σ 2 ) case we get access to an exact confidence interval formula for the variance thanks to
the following result.

(n−1)S 2
Exact distribution σ2 ∼ χ2n−1

The chi-squared distribution with k degrees of freedom is the gamma distribution with α = k2 , λ = 12 . It is connected
to the standard normal distribution as follows: if Z1 , . . . , Zk are N(0,1) and independent, then

Z12 + . . . + Zk2 ∼ χ2k .



(n−1)s2 2
Exact 100(1 − α)% confidence interval Iσ2 = ; (n−1)s
χ2n−1 (α/2) χ2n−1 (1−α/2)

The exact confidence interval for σ 2 is non-symmetric. Examples of 95% confidence intervals for σ 2 :

Iσ2 = (0.47s2 , 3.33s2 ) for n = 10,


Iσ2 = (0.55s2 , 2.40s2 ) for n = 16,
Iσ2 = (0.61s2 , 1.94s2 ) for n = 25,
Iσ2 = (0.72s2 , 1.49s2 ) for n = 60.

q
2σ 4 2
Under the normality assumption Var(S 2 ) = n−1 , estimated standard error for s2 is 2
n−1 s .

17
3.7 Exercises
Problem 1
The Poisson distribution has been used by traffic engineers as a model for light traffic. The following table shows
the number of right turns during 300 three-min intervals at a specific intersection. Fit a Poisson distribution.
Comment on on the fit by comparing observed and expected counts. It is useful to know that the 300 intervals
were distributed over various hours of the day and various days of the week.
n Frequency
0 14
1 30
2 36
3 68
4 43
5 43
6 30
7 14
8 10
9 6
10 4
11 1
12 1
13+ 0

Problem 2
One of the earliest applications of the Poisson distribution was made by Student (1907) in studying errors made in
counting yeast cells.In this study, yeast cells were killed and mixed with water and gelatin; the mixture was then
spread on a glass and allowed to cool. Four different concentrations were used. Counts were made on 400 squares,
and the data are summarised in the following table:
Number of cells Concent. 1 Concent. 2 Concent. 3 Concent. 4
0 213 103 75 0
1 128 143 103 20
2 37 98 121 43
3 18 42 54 53
4 3 8 30 86
5 1 4 13 70
6 0 2 2 54
7 0 0 1 37
8 0 0 0 18
9 0 0 1 10
10 0 0 0 5
11 0 0 0 2
12 0 0 0 2
(a) Estimate the parameter λ for each of the four sets of data.
(b) Find an approximate 95% confidence interval for each estimate.
(c) Compare observed and expected counts.

Problem 3
Suppose that X is a discrete random variable with

P(X = 0) = 23 θ,
P(X = 1) = 13 θ,
P(X = 2) = 23 (1 − θ),
P(X = 3) = 13 (1 − θ),

where θ ∈ [0, 1] is parameter. The following 10 independent observations were taken from such a distribution:

(3, 0, 2, 1, 3, 2, 1, 0, 2, 1).

18
(a) Find the method of moments estimate of θ.
(b) Find an approximate standard error for your estimate.
(c) What is the maximum likelihood estimate of θ?
(d) What is an approximate standard error of the maximum likelihood estimate?

Problem 4
Suppose that X ∼ Bin(n, p).

x
(a) Show that the maximum likelihood estimate of p is p̂ = n.
(b) Show that p̂ = nx attains the Cramer-Rao lower bound.
(c) If n = 10 and X = 5, plot the log-likelihood function.

Problem 5
A company has manufactured certain objects and has printed a serial number on each manufactured object. The
serial numbers start at 1 and end at N , where N is the number of objects that have been manufactured. One of
these object is selected at random, and the serial number of that object is 888.

(a) What is the method of moments estimate of N ?


(b) What is the maximum likelihood estimate of N ?

Problem 6
Capture-recapture method for estimating the number N of fish living in a lake:

1. capture and tag say n = 100 fish, then release them in the lake,

2. recapture say k = 50 fish and count the number of tagged fish.

Suppose x = 20 fish were tagged among the k = 50 fish. Find a maximum likelihood estimate N after suggesting
a simple parametric model.

Problem 7
The following 16 numbers came from normal random generator on a computer

5.3299 4.2537 3.1502 3.7032


1.6070 6.3923 3.1181 6.5941
3.5281 4.7433 0.1077 1.5977
5.4920 1.7220 4.1547 2.2799

(a) What would you guess the mean and the variance of the generating normal distribution were?
(b) Give 90%, 95%, and 99% confidence intervals for µ and σ 2 .
(c) Give 90%, 95%, and 99% confidence intervals for σ.
(d) How much larger sample would you need to halve the length of the confidence interval for µ?

Problem 8
Let X1 , . . . , Xn be independent random variables uniformly distributed on [0, θ].

(a) Find the method of moments estimate of θ and its mean and variance.
(b) Find the maximum likelihood estimate of θ.
(c) Find the probability density of the maximum likelihood estimate and calculate its mean and variance.
Compare the variance, the bias, and the mean square error to those of the method of moments estimate.
(d) Find a modification of the maximum likelihood estimate that renders it unbiased.

19
Problem 9
For two factors, starchy-or-sugary and green-or-white base leaf, the following counts for the progeny of self-fertilized
heterozygotes were observed (Fisher 1958)
Type Count
Starchy green x1 = 1997
Starchy white x2 = 906
Sugary green x3 = 904
Sugary white x4 = 32
According to the genetic theory the cell probabilities are
2+θ 1−θ 1−θ θ
p1 = , p2 = , p3 = , p4 = ,
4 4 4 4
where 0 < θ < 1. In particular, if θ = 0.25, then the genes are unlinked and the genotype frequencies are
Green White Total
Starchy 9/16 3/16 3/4

Sugary 3/16 1/16 1/4

Total 3/4 1/4 1

(a) Find the maximum likelihood estimate of θ and its asymptotic variance.
(b) For an approximate 95% confidence interval for θ based on part (a).
(c) Use the bootstrap to find the approximate standard deviation of the maximum likelihood estimate and
compare to the result of part (a).

20
4 Hypothesis testing
4.1 Statistical significance
Often we need a rule based on data for choosing between two mutually exclusive hypotheses
null hypothesis H0 : the effect of interest is zero,
alternative H1 : the effect of interest is not zero.
H0 represents an established theory that must be discredited in order to demonstrate some effect H1 .
A decision rule for hypotheses testing is based a test statistic t = t(x1 , . . . , xn ), a function of the data with distinct
typical values under H0 and H1 . The task is to find an appropriately chosen rejection region R and

reject H0 in favor of H1 if and only if t ∈ R.

Making a decision we can commit type I or type II error, see the table:

Negative decision: do not reject H0 Positive decision: reject H0 in favor of H1


If H0 is true True negative outcome False positive outcome, type I error
If H1 is true False negative outcome, type II error True positive outcome

Four important conditional probabilities:

α = P(T ∈ R|H0 ) significance level of the test, conditional probability of type I error,
1 − α = P(T ∈
/ R|H0 ) specificity of the test,
β = P(T ∈
/ R|H1 ) conditional probability of type II error,
1 − β = P(T ∈ R|H1 ) sensitivity of the test or power.

If test statistic and sample size are fixed,


one can not make smaller both α and β by changing R.

A significance test tries to control the type I error:


fix an appropriate significance level α, commonly used significance levels are 5%, 1%, 0.1%,
find R from α = P(T ∈ R|H0 ) using the null distribution of the test statistic T .

P-value of the test


A p-value is the probability of obtaining a test statistic value as extreme or more extreme than the observed one,
given that H0 is true. For a given significance level α,

reject H0 , if p-value ≤ α, and do not reject H0 , if p-value > α.

Observe that the p-value depends on the data and therefore, is a realisation of a random variable P. The source of
randomness is in the sampling procedure: if you take another sample, you obtain a different p-value. To illustrate,
suppose we are testing H0 : θ = θ0 versus H1 : θ > θ0 with help of a test statistic Z whose null distribution is
N(0,1). Suppose the null hypothesis is true. Given zobs = z, the p-value is

p = P(Z > z) = 1 − Φ(z),

and in terms of the random variables

P = P(Z > Zobs ) = 1 − Φ(Zobs ).

Now, under H0 the observed value for different samples has distribution N(0,1) so that

P(P > p) = P(1 − Φ(Zobs ) > 1 − Φ(z)) = P(Φ(Zobs ) < Φ(z)) = P(Zobs < z) = Φ(z) = 1 − p,

and we conclude that the P-value has a uniform null distribution.

4.2 Large-sample test for the proportion


x
Binomial model X ∼ Bin(n, p). The corresponding sample proportion p̂ = n.

For H0 : p = p0 use the test statistic z = √ x−np0 =√ p̂−p0


.
np0 (1−p0 ) p0 (1−p0 )/n

21
Three different composite alternative hypotheses:

one-sided H1 : p > p0 , one-sided H1 : p < p0 , two-sided H1 : p 6= p0 .


a
By the central limit theorem, the null distribution of the Z-score is approximately normal: Z ∼ N(0,1)
find zα from Φ(zα ) = 1 − α using the normal distribution table.

Alternative H1 Rejection rule P-value


p > p0 z ≥ zα P(Z ≥ zobs )
p < p0 z ≤ −zα P(Z ≤ zobs )
p 6= p0 z ≤ −zα/2 or z ≥ zα/2 2 · P(Z ≥ |zobs |)

Power function
Consider two simple hypotheses

H0 : p = p0 and H1 : p = p1 , assuming p1 > p0 .

The power function of the one-sided test can be computed using the normal approximation for √ X−np1 under
np1 (1−p1 )
H1 :
X − np
0
Pw(p1 ) = P p ≥ zα |H1
np0 (1 − p0 )
p √
X − np
1 zα p0 (1 − p0 ) + n(p0 − p1 )
=P p ≥ p |H1
np1 (1 − p1 ) p1 (1 − p1 )
p √
zα p0 (1 − p0 ) + n(p0 − p1 )
≈1−Φ p .
p1 (1 − p1 )

Planning of sample size: given α and β, choose sample size n such that
p p
√ zα p0 (1 − p0 ) + zβ p1 (1 − p1 )
n= .
|p1 − p0 |

Example: extrasensory perception


An experiment: guess the suits of n = 100 cards chosen at random with replacement from a deck of cards with
four suits. Binomial model: the number of cards guessed correctly X ∼ Bin(100, p). Hypotheses of interest

H0 : p = 0.25 (pure guessing), H1 : p > 0.25 (ESP ability).

Rejection rule at 5% significance level

R = { p̂−0.25
0.0433 ≥ 1.645} = {p̂ ≥ 0.32} = {x ≥ 32}.

With a simple alternative H1 : p = 0.30 the power of the test is

1 − Φ( 1.645·0.433−0.5
0.458 ) = 32%.

The sample size required for the 90% power is

n = ( 1.645·0.433+1.28·0.458
0.05 )2 = 675.

Suppose the observed sample count is xobs = 30, then

0.3 − 0.25
zobs = = 1.15
0.0433
and the one-sided p-value becomes
P(Z ≥ 1.15) = 12.5%.

The result is not significant, we do not reject H0 .

22
4.3 Small-sample test for the proportion
Binomial model X ∼ Bin(n, p) with H0 : p = p0 . For small n, use exact null distribution X ∼ Bin(n, p0 ).

Example: extrasensory perception


ESP test: guess the suits of n = 20 cards. Model: the number of cards guessed correctly is

X ∼ Bin(20, p).

For H0 : p = 0.25, the null distribution of the test statistic x is


x 8 9 10 11
Bin(20,0.25) table
P(X ≥ x) .101 .041 .014 0.004

For the one-sided alternative H1 : p > 0.25 and α = 5%, the rejection rule is R = {x ≥ 9}. Notice that the exact
significance level = 4.1%.
p 0.27 0.30 0.40 0.5 0.60 0.70
Power function
P(X ≥ 9) 0.064 0.113 0.404 0.748 0.934 0.995

4.4 Two tests for the mean


We wish to test H0 : µ = µ0 against either the two-sided or a one-sided alternative for continuous or discrete data.
As the test statistic we use the t-score
x̄ − µ0
t= .
sx̄
Large-sample test for mean is used when the population distribution is not necessarily normal but the sample
size n is sufficiently large. Compute the rejection region using an approximate null distribution
H0
T ≈ N(0, 1).

One-sample t-test is used for small n, under the assumption that the population distribution is normal. Compute
the rejection region using an exact null distribution
H
T ∼0 tn−1 .

Confidence interval method of hypotheses testing


Observe that at significance level α the rejection rule can be expressed

R = {µ0 ∈
/ Iµ }

in terms of a 100(1-α)% confidence interval for the mean. Having such confidence interval, reject H0 : µ = µ0 if
the interval does not cover the value µ0 .

4.5 Likelihood ratio test


A general method of finding asymptotically optimal tests (having the largest power for a given α) takes likelihood
ratio as the test statistic. Consider first the case of two simple hypotheses. For testing

H0 : θ = θ0 against H1 : θ = θ1 ,
L(θ0 )
use the likelihood ratio Λ = L(θ 1)
as a test statistic. Large values of Λ suggest that H0 explains the data set better
than H1 , while a small Λ indicates that H1 explains the data set better. Likelihood ratio test rejects H0 for small
values of Λ.
Neyman-Pearson lemma: the likelihood ratio test is optimal in the case of two simple hypothesis.

Nested hypotheses
With a pair of nested parameter sets Ω0 ⊂ Ω we get two composite alternatives

H0 : θ ∈ Ω0 against H1 : θ ∈ Ω \ Ω0 .

It will be more convenient to recast this setting in terms of two nested hypotheses

H0 : θ ∈ Ω0 , H : θ ∈ Ω,

leading to two maximum likelihood estimates

23
θ̂0 = maximises the likelihood function L(θ) over θ ∈ Ω0 ,
θ̂ = maximises the likelihood function L(θ) over θ ∈ Ω.

Generalised likelihood ratio test rejects H0 for small values of

L(θ̂0 )
Λ̃ = L(θ̂)
,

or equivalently for large values of


− ln Λ̃ = ln L(θ̂) − ln L(θ̂0 ).

It turns out that the test statistic −2 ln Λ̃ has a nice approximate null distribution

H0
−2 ln Λ̃ ≈ χ2df , where df = dim(Ω) − dim(Ω0 ).

4.6 Pearson’s chi-squared test


Suppose that each of n independent observations belongs to one of J classes with probabilities (p1 , . . . , pJ ). Such
data are summarised as the vector of observed counts whose joint distribution is multinomial

n!
(O1 , . . . , OJ ) ∼ Mn(n; p1 , . . . , pJ ), P(O1 = k1 , . . . , OJ = kJ ) = pk1 · · · pkJJ .
k1 ! · · · kJ ! 1

Consider a parametric model for the data

H0 : (p1 , . . . , pJ ) = (v1 (λ), . . . , vJ (λ)) with unknown parameters λ = (λ1 , . . . , λr ).

To see if the proposed model fits the data, compute λ̂, the maximum likelihood estimate of λ, and then the expected
cell counts
Ej = n · vj (λ̂),

where ”expected” means expected under the null hypothesis model. In the current setting, the likelihood ratio test
statistic
−2 ln Λ̃ ≈ χ2

is approximated by the so-called chi-squared test statistic

J
X (Oj − Ej )2
χ2 = .
j=1
Ej

The approximate null distribution of the chi-squared test statistic is χ2J−1−r , since

dim(Ω0 ) = r and dim(Ω) = J − 1.

A mnemonic rule for the number of degrees of freedom:

df = (number of cells) – 1 – (number of independent parameters estimated from the data).

Since the chi-squared test is approximate, all expected counts are recommended to be at least 5. If not, then you
shoud combine small cells in larger cells and recalculate the number of degrees of freedom df.

Example: geometric model


Returning to the data on the number of hops for birds consider

H0 : number of hops that a bird does between flights has a geometric distribution Geom(p).

Using p̂ = 0.358 and J = 7 we obtain χ2 = 1.86. With df = 5 we find p-value = 0.87, therefore we conclude a good
fit of the geometric distribution model to the data.

24
4.7 Example: sex ratio
A 1889 study made in Germany recorded the numbers of boys (y1 , . . . , yn ) for n = 6115 families with 12 children
each. Consider three nested models for the distribution of the number of boys Y in a family with 12 children

Model 1: Y ∼ Bin(12, 0.5)



Model 2: Y ∼ Bin(12, p)

General model: pj = P(Y = j), j = 0, 1, . . . , 12.

Model 1 leads to a simple null hypothesis



12
H0 : p j = · 2−12 , j = 0, 1, . . . , 12.
j
The expected cell counts
12
Ej = 6115 · · 2−12 , j = 0, 1, . . . , 12,
j
are summarised in the table below. Observed chi-squared test statistic χ2 = 249.2, df = 12. Since χ212 (0.005) = 28.3,
we reject H0 at 0.5% level.
(Oj −Ej )2 (Oj −Ej )2
cell j Oj Model 1: Ej and Ej Model 2: Ej and Ej
0 7 1.5 20.2 2.3 9.6
1 45 17.9 41.0 26.1 13.7
2 181 98.5 69.1 132.8 17.5
3 478 328.4 68.1 410.0 11.3
4 829 739.0 11.0 854.2 0.7
5 1112 1182.4 4.2 1265.6 18.6
6 1343 1379.5 1.0 1367.3 0.4
7 1033 1182.4 18.9 1085.2 2.5
8 670 739.0 6.4 628.1 2.8
9 286 328.4 5.5 258.5 2.9
10 104 98.5 0.3 71.8 14.4
11 24 17.9 2.1 12.1 11.7
12 3 1.5 1.5 0.9 4.9
Total 6115 6115 χ2 = 249.2 6115 χ2 = 110.5

Model 2 is more flexible and leads to a composite null hypothesis



12
H0 : pj = · pj (1 − p)12−j , j = 0, . . . , 12, 0 ≤ p ≤ 1.
j
The expected cell counts
12
Ej = 6115 · · p̂j · (1 − p̂)12−j
j
are computed using the maximum likelihood estimate of the proportion of boys p
number of boys 1 · 45 + 2 · 181 + . . . + 12 · 3
p̂ = = = 0.481
number of children 6115 · 12
The observed chi-squared test statistic is χ2 = 110.5. Since r = 1, df = 11, and the table value is χ211 (0.005) = 26.76,
we reject even model 2 at 0.5% level.
We see that what is needed is an even more flexible model addressing large variation in the observed cell counts.
Suggestion: allow the probability of a male child p to differ from family to family.

4.8 Exercises
Problem 1
Suppose that X ∼ Bin(100, p). Consider a test

H0 : p = 1/2, H1 : p 6= 1/2.

25
that rejects H0 in favour of H1 for |x − 50| > 10. Use the normal approximation to the binomial distribution to
answer the following:
(a) What is α?
(b) Graph the power as a function of p.

Problem 2
Let X have one of the following two distributions

X-values x1 x2 x3 x4
P(x|H0 ) 0.2 0.3 0.3 0.2
P(x|H1 ) 0.1 0.4 0.1 0.4

(a) Compare the likelihood ratio, Λ, for each xi and order the xi according to Λ.
(b) What is the likelihood ratio test of H0 versus H1 at level α = 0.2? What is the test at level α = 0.5?

Problem 3
Let (x1 , . . . , xn ) be a sample from a Poisson distribution. Find the likelihood ratio for testing H0 : λ = λ0 against
H1 : λ = λ1 , where λ1 > λ0 . Use the fact that the sum of independent Poisson random variables follows a Poisson
distribution to explain how to determine a rejection region for a test at level α.

Problem 4
Let (x1 , . . . , x25 ) be a sample from a normal distribution having a variance of 100.

(a) Find the rejection region for a test at level α = 0.1 of H0 : µ = 0 versus H1 : µ = 1.5.
(b) What is the power of the test?
(c) Repeat for α = 0.01.

Problem 5
Under H0 , a random variable has a cumulative distribution function

F (x) = x2 , 0 ≤ x ≤ 1,

and under H1 , it has a cumulative distribution function

F (x) = x3 , 0 ≤ x ≤ 1.

(b) What is the form of the likelihood ratio test of H0 versus H1 ?


(c) What is the rejection region of a level α test?
(d) What is the power of the test?

Problem 6
Let (x1 , . . . , x15 ) be a sample from a normal distribution.

(a) Find a 95% confidence interval for the population variance σ 2 .


(b) Using the confidence interval method test H0 : σ = 1 versus H0 : σ 6= 1 at α = 0.05.

Problem 7
An iid-sample from N(µ, σ 2 ) gives a 99% confidence interval for µ to be (−2, 3). Test

H0 : µ = −3 against H1 : µ 6= −3

at α = 0.01.

26
Problem 8
Binomial model for the data value x:
X ∼ Bin(n, p).
(a) What is the generalised likelihood ratio for testing H0 : p = 0.5 against H1 : p 6= 0.5?
(b) Show that the test rejects for large values of |x − n2 |.
(c) How the significance level corresponding to the rejection region

R = {|x − n2 | > k}

can be determined?
(d) If n = 10 and k = 2, what is the significance level of the test?
(e) Use the normal approximation to the binomial distribution to find the significance level if n = 100
and k = 10.

Problem 9
Suppose that a test statistic Z has a standard normal null-distribution.

(a) If the test rejects for large values of |z|, what is the p-value corresponding to z = 1.5?
(b) Answer the same question if the test rejects for large values of z.

Problem 10
It has been suggested that dying people may be able to postpone their death until after an important occasion,
such as a wedding or birthday. Phillips and King (1988) studied the patterns of death surrounding Passover, an
important Jewish holiday.

(a) California data 1966-1984. They compared the number of deaths during the week before Passover to the
number of deaths during the week after Passover for 1919 people who had Jewish surnames. Of these, 922 occurred
in the week before and 997 in the week after Passover.
(b) For 852 males of Chinese and Japanese ancestry, 418 died in the week before and 434 died in the week after
Passover.

Problem 11
If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur with probabilities

p1 = (1 − θ)2 , p2 = 2θ(1 − θ), p3 = θ2 .

Plato et al. (1964) published the following data on haptoglobin type in a sample of 190 people
Genotype Hp 1-1 Hp 1-2 Hp 2-2
Observed count xi 10 68 112
Test the goodness of fit of the data to the equilibrium model.

Problem 12
US suicides in 1970. Check for the seasonal variation
Month Number of suicides
Jan 1867
Feb 1789
Mar 1944
Apr 2094
May 2097
Jun 1981
Jul 1887
Aug 2024
Sep 1928
Oct 2032
Nov 1978
Dec 1859

27
Problem 13
In 1965, a newspaper carried a story about a high school student who reported getting 9207 heads and 8743 tails
in 17950 coin tosses.

(a) Is this a significant discrepancy from the null hypothesis H0 : p = 12 ?


(b) A statistician contacted the student and asked him exactly how he had performed the experiment (Youden
1974). To save time the student had tossed groups of five coins at a time, and a younger brother had recorded the
results, shown in the table:
number of heads 0 1 2 3 4 5 Total
observed 100 524 1080 1126 655 105 3590

Are the data consistent with the hypothesis that all the coins were fair (p = 12 )?
(c) Are the data consistent with the hypothesis that all five coins had the same probability of heads but this
probability was not necessarily 12 ?

28
5 Bayesian inference
The statistical tools introduced in this course so far are based on the the called frequentist approach. In the
parametric case the data x is assumed to be randomly generated by a distribution f (x|θ) and the unknown
population parameter θ is estimated using the maximum likelihood estimate. This section presents basic concepts
of the Bayesian approach when it is assumed that the parameter of interest θ is itself randomly generated using a
given prior distribution g(θ). The prior distribution brings into the model our knowledge (or lack of knowledge)
on θ before data x is generated using f (x|θ), which in this section is called the likelihood function.
After the data x is generated by such a two-step procedure involving the pair g(θ) and f (x|θ), we may update
our knowledge on θ and compute a posterior distribution h(θ|x) using the Bayes formula
f (x|θ)g(θ)
h(θ|x) = ,
φ(x)
where Z X
φ(x) = f (x|θ)g(θ)dθ or φ(x) = f (x|θ)g(θ)
θ

gives the marginal distribution of the random data X. For a fixed x, the denominator φ(x) is treated as a constant
and the Bayes formula can be summarised as
posterior ∝ likelihood × prior
where ∝ means proportional, as the coefficient of proportionality φ(x) does not explicitly depend on θ.
When we have no prior knowledge of θ, the prior distribution is often modelled by the uniform distribution.
In this case of uninformative prior, given g(θ) is a constant, we have h(θ|x) ∝ f (x|θ) so that all the posterior
knowledge comes from the likelihood function.

Example: IQ measurement
A randomly chosen individual has an unknown true intelligence quotient value θ. Suppose an IQ test is calibrated in
such a way that the prior distribution of θ is normal N(100, 225). This normal distribution describes the population
distribution of people’s IQ with population mean of m = 100 and population standard deviation v = 15. For a
person with an IQ value θ, the result x of an IQ measurement is generated by another normal distribution N(θ,
100), with no systematic error and a random error σ = 10.
Since
1 (θ−m)2 1 (x−θ)2
g(θ) = √ e− 2v2 , f (x|θ) = √ e− 2σ2 ,
2πv 2πσ
and the posterior is proportional to g(θ)f (x|θ), we get
(θ − m)2 (x − θ)2 (θ − γm − (1 − γ)x)2

h(θ|x) ∝ exp − − ∝ exp − ,
2v 2 2σ 2 2γv 2
where
σ2
γ=
σ2 + v2
is the so-called shrinkage factor. We conclude that the posterior distribution is also normal
1 (θ−γm−(1−γ)x)2

h(θ|x) = √ e 2γv 2
2πγv
having mean γm + (1 − γ)x and variance γv 2 .
In particular, if the observed IQ result is x = 130, then the posterior distribution becomes N(120.7, 69.2). We
see that the prior expectation m = 100 has corrected the observed result x = 130 down to 120.7. The posterior
variance 69.2 is smaller than that of the prior distribution 225 by the shrinkage factor γ = 0.308: the updated
knowledge is less uncertain than the prior knowledge.

5.1 Conjugate priors


Suppose we have two parametric families of probability distributions G and H.
G is called a family of conjugate priors to H, if a G-prior and a H-likelihood give a G-posterior.
Below we present five models involving conjugate priors. For this we need to introduce another two parametric
distributions: Beta and Dirichlet.

29
Beta distribution
Beta distribution Beta(a, b) is determined by two parameters a > 0, b > 0 which are called pseudo-counts. It has
density,
Γ(a + b) a−1
f (p) = p (1 − p)b−1 , 0 < p < 1,
Γ(a)Γ(b)
with mean and variance having the form

a µ(1 − µ)
µ= , σ2 = .
a+b a+b+1
Beta distribution is a convenient prior distribution for a population frequency p ∈ (0, 1). The uniform distribution
U(0, 1) is obtained with a = b = 1.
5

4.5
Beta (0.5, 0.5)
4 Beta (10, 3)
Beta (1, 1)
3.5 Beta (0.8, 3)

2.5

1.5

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exercise: verify that for given a > 1 and b > 1, the maximum of density function f (p) is attained at

a−1
p̂ = .
a+b−2

Dirichlet distribution
Dirichlet distribution is a multivariate extension of the Beta distribution. Dir(α1 , . . . , αr ) is a probability distribu-
tion over the vectors (p1 , . . . , pr ) with non-negative components such that

p1 + . . . + pr = 1.

Positive parameters α1 , . . . , αr are also called pseudo-counts. It has density

Γ(α0 )
f (p1 , . . . , pr ) = pα1 −1 . . . prαr −1 , α0 = α1 + . . . + αr .
Γ(α1 ) . . . Γ(αr ) 1

The marginal distributions are


Pj ∼ Beta(αj , α0 − αj ), j = 1, . . . , r,
Beta distributions, and we have negative covariances
αi αj
Cov(Pi , Pj ) = − for i 6= j.
α02 (α0 + 1)
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

30
The figure illustrates four examples of Dir(α1 , α2 , α3 ) distribution. Each triangle contains n = 300 points generated
using different sets of parameters (α1 , α2 , α3 ):
upper left (0.3, 0.3, 0.1), upper right (13,16,15), lower left (1,1,1), lower right (3,0.1,1).
A dot in a triangle gives (x1 , x2 , x3 ) as the distances to the bottom edge of the triangle (x1 ), to the right edge of
the triangle (x2 ), and to the left edge of the triangle (x3 ).

List of conjugate priors


Data distribution Prior Posterior distribution
X1 , . . . , Xn ∼ N(µ, σ 2 ) µ ∼ N(µ0 , σ02 ) N(γn µ0 + (1 − γn )x̄; γn σ02 )
X ∼ Bin(n, p) p ∼ Beta(a, b) Beta(a + x, b + n − x)
(X1 , . . . , Xr ) ∼ Mn(n; p1 , . . . , pr ) (p1 , . . . , pr ) ∼ Dir(α1 , . . . , αr ) Dir(α1 + x1 , . . . , αr + xr )
X1 , . . . , Xn ∼ Geom(p) p ∼ Beta(a, b) Beta(a + n, b + nx̄ − n)
X1 , . . . , Xn ∼ Pois(µ) µ ∼ Gam(α0 , λ0 ) Gam(α0 + nx̄, λ0 + n)
X1 , . . . , Xn ∼ Gam(α, λ) λ ∼ Gam(α0 , λ0 ) Gam(α0 + αn, λ0 + nx̄)
For the Normal-Normal model, the ratio
σ2
σ2 n
γn = 2 =
σ + nσ02 σ2
+ σ02
n

is called the shrinkage factor as it gives the ratio between the posterior variance to the prior variance. Notice that
the posterior variance is always smaller than the prior variance. This list of conjugate prior models illustrates
that the contribution of the prior distribution becomes smaller for larger samples. For the Binomial-Beta and
Multinomial-Dirichlet models the update rule has the form
posterior pseudo-counts = prior pseudo-counts plus sample counts
Example: Binomial-Beta model
First we show a simple demonstration that beta distribution gives a conjugate prior to the binomial likelihood.
Indeed, if
prior ∝ pa−1 (1 − p)b−1 ,
and
likelihood ∝ px (1 − p)n−x ,
then obviously posterior is also a beta distribution:
postterior ∝ prior × likelihood ∝ pa+x−1 (1 − p)b+n−x−1 .
Suppose we are interested in the probability p of a thumbtack landing on its base. Two experiments are
performed. An experiment consists of n tosses of the thumbtack with the number of base landings X ∼ Bin(n, p)
being counted.
Experiment 1: n1 = 10 tosses, counts x1 = 2, n1 − x1 = 8. We apply the uninformative prior distribution
Beta(1, 1) with mean µ0 = 0, 5 and standard deviation σ0 = 0.29. It gives a posterior distribution Beta(3, 9) with
3
mean p̂ = 12 = 0.25 and standard deviation σ1 = 0.12.
Experiment 2: n2 = 40 tosses, counts x2 = 9, n2 − x2 = 31. As a new prior distribution we use the posterior
distribution obtained from the first experiment Beta(3, 9). The new posterior distribution becomes Beta(12, 40)
12
with mean p̂ = 52 = 0.23 and standard deviation σ2 = 0.06.

5.2 Bayesian estimation


In the language of decision theory we are searching for an optimal action

action a = {assign value a to unknown parameter θ}.


The optimal a depends on the choice of the loss function l(θ, a). Bayes action minimises posterior risk

R(a|x) = E l(Θ, a)|x
so that Z X
R(a|x) = l(θ, a)h(θ|x)dθ or R(a|x) = l(θ, a)h(θ|x).
θ
We consider two loss functions leading to two Bayesian estimators.

31
Zero-one loss function and maximum a posteriori probability
Zero-one loss function: l(θ, a) = 1{θ6=a}

Using the zero-one loss function we find that the posterior risk is the probability of misclassification
X
R(a|x) = h(θ|x) = 1 − h(a|x).
θ6=a

It follows that to minimise the risk we have to maximise the posterior probability. We define θ̂map as the value of
θ that maximises h(θ|x). Observe that with the uninformative prior, θ̂map = θ̂mle .

Squared error loss function and posterior mean estimate


Squared error loss: l(θ, a) = (θ − a)2

Using the squared error loss function we find that the posterior risk is a sum of two components

R(a|x) = E((Θ − a)2 |x) = Var(Θ|x) + [E(Θ|x) − a]2 .

Since the first component is independent of a, we minimise the posterior risk by putting

θ̂pme = E(Θ|x).

Example: loaded die experiment


A possibly loaded die is rolled 18 times, giving 4 ones, 3 twos, 4 threes, 4 fours, 3 fives, and 0 sixes:

2, 1, 1, 4, 5, 3, 3, 2, 4, 1, 4, 2, 3, 4, 3, 5, 1, 5.

The parameter of interest is θ = (p1 , . . . , p6 ). The maximum likelihood estimate based on the sample counts is
4 3 4 4 3
θ̂mle = ( 18 , 18 , 18 , 18 , 18 , 0).

The maximum likelihood estimate assigns value zero to p6 , thereby excluding sixes in future observations. Now
take the uninformative prior distribution Dir(1, 1, 1, 1, 1, 1) and compare two Bayesian estimates
4 3 4 4 3 5 4 5 5 4 1
θ̂map = ( 18 , 18 , 18 , 18 , 18 , 0), θ̂pme = ( 24 , 24 , 24 , 24 , 24 , 24 ).

The latter has an advantage of assigning a positive value to p6 .

5.3 Credibility interval


Let x stand for the data in hand. For a confidence interval formula

Iθ = (a1 (x), a2 (x)),

the parameter θ is an unknown constant and a the confidence interval is random

P(a1 (X) < θ < a2 (X)) = 1 − α.

A credibility interval (or credible interval)


Jθ = (b1 (x), b2 (x))
is treated as a nonrandom interval, while θ as generated by the posterior distribution of a random variable Θ. A
credibility interval is computed from the posterior distribution

P(b1 (x) < Θ < b2 (x)|x) = 1 − α.

Example: IQ measurement
Given n = 1, we have X̄ ∼ N(µ; 100) and an exact 95% confidence interval for µ takes the form

Iµ = 130 ± 1.96 · 10 = 130 ± 19.6.

Posterior distribution of the mean is N(120.7; 69.2) and therefore a 95% credibility interval for µ is

Jµ = 120.7 ± 1.96 · 69.2 = 120.7 ± 16.3.

32
5.4 Bayesian hypotheses testing
We consider the case of two simple hypotheses. Choose between H0 : θ = θ0 and H1 : θ = θ1 using not only the
likelihoods of the data f (x|θ0 ), f (x|θ1 ) but also prior probabilities

P(H0 ) = π0 , P(H1 ) = π1 .

In terms of the rejection region R the decision should be taken depending of a cost function having the following
four cost values
Decision H0 true H1 true
x∈
/R Accept H0 0 c1
x∈R Accept H1 c0 0

where c0 is the error type I cost and c1 is the error type II cost. For a given set R, the average cost is the weighted
mean of two values c0 and c1
Z
c0 π0 P(X ∈ R|H0 ) + c1 π1 P(X ∈ / R|H1 ) = c1 π1 + c0 π0 f (x|θ0 ) − c1 π1 f (x|θ1 ) dx.
R

It follows that the rejection region minimising the average cost is

R = {x : c0 π0 f (x|θ0 ) < c1 π1 f (x|θ1 )}.

Thus the optimal decision rule becomes to reject H0 for small values of the likelihood ratio when

f (x|θ0 ) c1 π1
< ,
f (x|θ1 ) c0 π0

or in other terms, for small posterior odds


h(θ0 |x) c1
< .
h(θ1 |x) c0

Example of Bayesian hypothesis testing


The defendant A charged with rape, is a male of age 37 living in the area not very far from the crime place. The
jury have to choose between two alternative hypotheses H0 : A is innocent, H1 : A is guilty.
An uninformative prior probability
1 π0
π1 = 200000 , so that π1 = 200000.

takes into account the number of males who theoretically could have committed the crime without any evidence
taken into account. There were tree conditionally independent pieces of evidence
E1 : strong DNA match,
E2 : defendant A is not recognised by the victim,
E3 : an alibi supported by the girlfriend.

The reliability of these pieces of evidence was quantified as

1 P(E1 |H0 ) 1
P(E1 |H0 ) = 200,000,000 , P(E1 |H1 )=1, so that
P(E1 |H1 ) = 200,000,000 evidence in favour of H1
2 |H0 )
P(E2 |H1 ) = 0.1, P(E2 |H0 ) = 0.9, so that P(E
P(E2 |H1 ) = 9 evidence in favour of H0
3 |H0 )
P(E3 |H1 ) = 0.25, P(E3 |H0 ) = 0.5, so that P(E
P(E3 |H1 ) = 2 evidence in favour of H0

Then the posterior odds was computed as

P(H0 |E) π0 P(E|H0 ) π0 P(E1 |H0 )P(E2 |H0 )P(E3 |H0 )


= = = 0.018.
P(H1 |E) π1 P(E|H1 ) π1 P(E1 |H1 )P(E2 |H1 )P(E3 |H1 )

Thus we reject H0 if the cost values are assigned so that

c1 cost for unpunished crime


= > 0.018.
c0 cost for punishing an innocent

33
Prosecutor’s fallacy: P(H0 |E) = P(E|H0 ), which is only true if P(E) = π0 .
Example: π0 = π1 = 1/2, P(E|H0 ) ≈ 0, P(E|H1 ) ≈ 1.

5.5 Exercises
Problem 1
This is a continuation of the Problem 3 from Section 3.7.

(e) Assume uniform prior Θ ∼ U(0, 1) and find the posterior density. Plot it. What is the mode of the posterior?

Problem 2
In an ecological study of the feeding behaviour of birds, the number of hops between flights was counted for several
birds.
Number of hops j 1 2 3 4 5 6 7 8 9 10 11 12 Tot
Observed frequency Oj 48 31 20 9 6 5 4 2 1 1 2 1 130

Assuming that the data were generated by a Geom(p) model and take a uniform prior for p. What is then the
posterior distribution and what are the posterior mean and standard deviation?

Problem 3
Laplace’s rule of succession. Laplace claimed that when an event happens n times in a row and never fails to
n+1
happen, the probability that the event will occur the next time is n+2 . Can you suggest a rationale for this claim?

Problem 4
This is a continuation of the Problem 2 from Section 4.8.
Let X have one of the following two distributions

X-values x1 x2 x3 x4
P(x|H0 ) 0.2 0.3 0.3 0.2
P(x|H1 ) 0.1 0.4 0.1 0.4

(c) If the prior probabilities are P(H0 ) = P(H1 ) = 12 , which outcomes favour H0 ?
(d) What prior probabilities correspond to the decision rules with α = 0.2 and α = 0.5?

Problem 5
Suppose that under H0 , a measurement X is N(0, σ 2 ), and under H1 , the measurement X is N(1, σ 2 ). Assume
that the prior probabilities satisfy
P(H0 ) = 2P(H1 ).
The hypothesis H0 will be chosen if P(H0 |x) > P(H1 |x). For σ 2 = 0.1, 0.5, 1.0, 5.0:

(a) For what values of X = x will H0 be chosen?


2
(b) In the long run, what proportion of the time will H0 be chosen if H0 is true 3 of the time?

34
Problem 6
Under H0 , a random variable has a cumulative distribution function F (x) = x2 , 0 ≤ x ≤ 1, and under H1 , it has
a cumulative distribution function F (x) = x3 , 0 ≤ x ≤ 1.

(a) If the two hypotheses have equal prior probabilities, for what values of x is the posterior probability of H0
greater than that of H1 ?

35
6 Summarising data
6.1 Empirical probability distribution
Consider an iid-sample (x1 , . . . , xn ) from the population distribution F (x) = P(X ≤ x).

1
Pn
Empirical distribution function F̂ (x) = n i=1 1{xi ≤x} .

For a fixed x,
F̂ (x) = p̂

is the sample proportion estimating the population proportion p = F (x).


On the other hand for variable x, the function F̂ (x) is a cumulative distribution function for a random variable
Y with the discrete distribution
1
P(Y = xi ) = , i = 1, . . . , n,
n
assuming that all sample values xi are pairwise different. Clearly,
n
X xi
E(Y ) = = x̄,
i=1
n

and since
n
X x2 i
E(Y 2 ) = = x2 ,
i=1
n

we get
n−1 2
Var(Y ) = x2 − (x̄)2 = s .
n
n−1 2
It is easy to verify that F̂ (·) is a cumulative distribution function with mean x̄ and variance n s even if some of
xi coincide. We call
n
X
σ̂ 2 = n−1
n s2
= 1
n (xi − x̄)2 = x2 − (x̄)2
i=1

the empirical variance.

If the data describes life lengths, then it is more convenient to use the empirical survival function

Ŝ(x) = 1 − F̂ (x),

the proportion of the data greater than x. If the life length T has distribution function F (t) = P(T ≤ t), then its
survival function is
S(t) = P(T > t) = 1 − F (t).

f (t)
Hazard function h(t) = S(t) , where f (t) = F 0 (t) is the probability density function.

The hazard function is the mortality rate at age t:

P(t < T ≤ t + δ) F (t + δ) − F (t)


P(t < T ≤ t + δ|T ≥ t) = = ∼ δ · h(t), δ → 0.
P(T ≥ t) S(t)

The hazard function can be viewed as the negative of the slope of the log survival function:

d d
h(t) = − dt ln S(t) = − dt ln(1 − F (t)).

A constant hazard rate h(t) = λ corresponds to the exponential distribution Exp(λ).

36
Example: Guinea pigs
Guinea pigs were randomly divided in 5 treatment groups of 72 animals each and one control group of 107 animals.
The guinea pigs in the treatment groups were infected with increasing doses of tubercle bacilli (Bjerkdal, 1960).
The survival times were recorded (note that not all the animals in the lower-dosage regimens died).

Control lifetimes

18 36 50 52 86 87 89 91 102 105 114 114 115 118 119 120 149 160 165 166 167 167 173 178 189 209 212 216
273 278 279 292 341 355 367 380 382 421 421 432 446 455 463 474 506 515 546 559 576 590 603 607 608 621 634
634 637 638 641 650 663 665 688 725 735

Dose I lifetimes

76 93 97 107 108 113 114 119 136 137 138 139 152 154 154 160 164 164 166 168 178 179 181 181 183 185
194 198 212 213 216 220 225 225 244 253 256 259 265 268 268 270 283 289 291 311 315 326 326 361 373 373 376
397 398 406 452 466 592 598

Dose II lifetimes

72 72 78 83 85 99 99 110 113 113 114 114 118 119 123 124 131 133 135 137 140 142 144 145 154 156 157 162
162 164 165 167 171 176 177 181 182 187 192 196 211 214 216 216 218 228 238 242 248 256 257 262 264 267 267
270 286 303 309 324 326 334 335 358 409 473 550

Dose III lifetimes

10 33 44 56 59 72 74 77 92 93 96 100 100 102 105 107 107 108 108 108 109 112 113 115 116 120 121 122 122
124 130 134 136 139 144 146 153 159 160 163 163 168 171 172 176 183 195 196 197 202 213 215 216 222 230 231
240 245 251 253 254 254 278 293 327 342 347 361 402 432 458 555

Dose IV lifetimes

43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102


102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191
198 211 214 243 249 329 380 403 511 522 598

Dose V lifetimes

12 15 22 24 24 32 32 33 34 38 38 43 44 48 52 53 54 54 55 56 57 58 58 59 60 60 60 60 61 62 63 65 65 67 68
70 70 72 73 75 76 76 81 83 84 85 87 91 95 96 98 99 109 110 121 127 129 131 143 146 146 175 175 211 233 258 258
263 297 341 341 376

It is difficult to compare the groups just looking at numbers. The data is illuminated by two graphs: one for
the survival functions and the other for the log-survival functions.

1 0

0.9 -0.5

0.8
-1

0.7
-1.5

0.6
-2
0.5
-2.5
0.4

-3
0.3

-3.5
0.2

0.1 -4

0 -4.5
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800

The negative slopes of the curves to the right illustrate the hazard rates for different groups.

37
6.2 Density estimation
A histogram displays the observed counts
n
X
Oj = 1{xi ∈cellj }
i=1

over the adjacent cells of width h. The choice of a balanced width h is important: smaller h give ragged profiles,
larger h give obscured profiles. Put
O
fh (x) = nhj , for x ∈ cellj ,
and notice that Z X
h
fh (x)dx = nh Oj = 1.
j

The scaled histogram given by the graph of fh (x) is a density estimate. Kernel density estimate with bandwidth
h produces a smooth curve
n
1 X x−xi 2
fh (x) = φ( h ), where φ(x) = √12π e−x /2 .
nh i=1
Example: male heights
Let x stand for the column of 24 male heights. For a given bandwidth h, the following Matlab code produces a
plot for a kernel density estimate

x=160:0.1:210; L=length(x);
f=normpdf((ones(24,1)*x - hm*ones(1,L))/h);
fh=sum(f)/(24*h); plot(x,fh)

The stem-and-leaf plot for the 24 male heights indicates the distribution shape plus gives the full numerical infor-
mation:

17:056678899
18:0000112346
19:229

6.3 Quantiles and QQ-plots


The inverse of the cumulative distribution function F (x) is called the quantile function

Q(p) = F −1 (p), 0 < p < 1.

The quantile function Φ−1 for the standard normal distribution Φ is called the probit function (from probability
unit).

For a given distribution F and 0 ≤ p ≤ 1, the p-quantile is xp = Q(p).

Special quantiles:

median m = x0.5 = Q(0.5),


lower quartile x0.25 = Q(0.25),
upper quartile x0.75 = Q(0.75).

Quantile xp cuts off proportion p of smallest values of a random variable X with P(X ≤ x) = F (x):

P(X ≤ xp ) = F (xp ) = F (Q(p)) = p.


The ordered sample values
x(1) ≤ x(2) ≤ . . . ≤ x(n)
are the jump points for the empirical distribution function. In the continuous case with

x(1) < x(2) < . . . < x(n) ,

we have
k k−1
Fn (x(k) ) = n, Fn (x(k) − ) = n .
This observation leads to the following definition of empirical quantiles

38
x(k) is called the empirical ( k−0.5
n )-quantile

Suppose we have two independent samples (x1 , . . . , xn ) and (y1 , . . . , yn ) of equal size n which are taken from
two population distributions FX and FY . A relevant null hypothesis H0 : FX ≡ FY is equivalent to H0 : QX ≡ QY .
It can be tested graphically using a QQ-plot.

QQ-plot is a scatter plot of n dots with coordinates (x(k) , y(k) ).

If such a QQ-plot closely follows the 45 degree line, that is when we observe almost equal quantiles, we can claim
that H0 : FX ≡ FY is true.
More generally, if the QQ-plot approximates a straight line y = a + bx, then we take this as evidence for the
linear relation
Y = a + bX in distribution.
Indeed, the latter claim means that for all x,

FX (x) = FY (a + bx),

so that putting QX (p) = x, we get QY (p) = a + bx, which yields

QY (p) = a + bQX (p), 0 < p < 1.

6.4 Testing normality


The normality hypothesis H0 states that the population distribution for an iid-sample (x1 , . . . , xn ) is normal
N(µ, σ 2 ) with unspecified parameter values. A QQ-plot used for testing this hypothesis is called a normal probability
plot. The normal probability plot is the scatter plot for

(x(1) , y1 ), . . . , (x(n) , yn ), where yk = Φ−1 ( k−0.5


n ).

If the normal probability plot is close to a straight line y = a + bx, then we accept H0 and use the point estimates
µ̂ = a, σ̂ = b. If normality does not hold, draw a straight line via empirical lower and upper quartiles to detect a
light tails profile or heavy tails profile.
Another simple way of testing normality relies on two summary statistics: skewness and kurtosis.

E[(X−µ)3 ] 1
Pn
Coefficient of skewness: β1 = σ3 , sample skewness: b1 = s3 n i=1 (xi − x̄)3

Depending on the sign of the coefficient of skewness with distinguish between symmetric β1 = 0, skewed to the
right β1 > 0, and skewed to the left β1 < 0 distributions.

Given that b1 is close to zero, kurtosis can be used as an indication of the curve profile to be close to that of
the normal distribution.
E[(X−µ)4 ] 1
Pn
Kurtosis β2 = σ4 , sample kurtosis: b2 = s4 n i=1 (xi − x̄)4

For the normal distribution, kurtosis coefficient takes value β2 = 3. Leptokurtic distribution: β2 > 3 (heavy tails).
Platykurtic distribution: β2 < 3 (light tails).

Example: male heights


Summary statistics:
x̄ = 181.46, m̂ = 180, b1 = 1.05, b2 = 4.31.
Good to know: the distribution of the heights of adult males is positively skewed, so that m < µ, or in other terms,

P(X < µ) > P(X ≤ m) ≥ 0.50,

implying that more than half of heights are below the average.

The gamma distribution Gam(α, λ) is positively skewed β1 = √2 , and leptokurtic β2 = 3 + α6 .


α

39
6.5 Measures of location
The central point of a distribution can be defined in terms of various measures of location, for example, as the
population mean µ or the median m. The population median m is estimated by the sample median.

x(k) +x(k+1)
Sample median: m̂ = x(k) , if n = 2k − 1, and m̂ = 2 , if n = 2k.

The sample mean x̄ is sensitive to outliers, while the sample median m̂ is not. Therefore, we say that m̂ is a robust
estimator (robust to outliers).

Confidence interval for the median


Consider an iid-sample (x1 , . . . , xn ) without assuming any parametric model for the unknown population distribu-
tion. Let
Xn
y= 1{xi ≤m}
i=1

be the number of observations below the true median, then


n−k
X
n 1 n
pk = P(X(k) < m < X(n−k+1) ) = P(k ≤ Y ≤ n − k) = ( )
i 2
i=k

can be computed from the symmetric binomial distribution Y ∼ Bin(n, 0.5).


This yields the following non-parametric formula for an exact confidence interval for the median.

Im = (x(k) , x(n−k+1) ) is a 100 · pk % confidence interval for the population median m

For example, if n = 25, then from the table below we find that (X(8) , X(18) ) gives a 95.7% confidence interval for
the median.
k 6 7 8 9 10 11 12
100 · pk 99.6 98.6 95.7 89.2 77.0 57.6 31.0

Sign test
The sign test is a non-parametric test of H0 : m = m0 against the two-sided alternative H1 : m 6= m0 .
The sign test statistic
n
X
y0 = 1{xi ≤m0 }
i=1

H
counts the number of observations below the null hypothesis value. It has a simple null distribution Y0 ∼0
Bin(n, 0.5). Connection to the above confidence interval formula: reject H0 if m0 falls outside the correspond-
ing confidence interval
Im = (x(k) , x(n−k+1) ).
Trimmed means
A trimmed mean is a robust measure of location computed from a central portion of the data.

nα nα
α-trimmed mean x̄α = sample mean without 2 smallest and 2 largest observations

Example: male heights


Ignoring 20% of largest and 20% of smallest observations we compute x̄0.4 =180.36. The trimmed mean is between
x̄ = 181.46 and m̂ = 180.

When summarizing data compute several measures of location and compare the results.

Nonparametric bootstrap
Substitute the population distribution by the empirical distribution. Then a bootstrap sample is obtained by
resampling with replacement from the original sample

x1 , . . . , x n .

Generate many bootstrap samples of size n to approximate the sampling distribution for an estimator like trimmed
mean, sample median, or s.

40
6.6 Measures of dispersion
Sample variance s2 and sample range
R = x(n) − x(1)
are sensitive to outliers. Two robust measures of dispersion:
interquartile range IQR = x0.75 − x0.25 is the difference between the upper and lower quartiles,

MAD = Median of Absolute values of Deviations from sample median |xi − m̂|, i = 1, . . . , n.

IQR MAD
Three estimates of σ for the normal distribution N(µ, σ 2 ) model: s, 1.35 , 0.675

Indeed, under the normality assumption we have

Φ(0.675) = 0.75, Φ−1 (0.75) = 0.675,

and
IQR = (µ + σΦ−1 (0.75)) − (µ + σΦ−1 (0.25)) = 2σΦ−1 (0.75) = 1.35σ,

MAD = 0.675σ, since P(|X − µ| ≤ 0.675σ) = (Φ(0.675) − 0.5) · 2 = 0.5.

Boxplots
Boxplots are convenient to use for comparing different samples. A boxplot is built of the following components:
box, whiskers and outliers.
Box
• upper edge of the box = upper quartile (UQ)
• box center = median
• lower edge of the box = lower quartile (LQ)
Wiskers
• upper whisker end = {maximal data point not
exceeding UQ + 1.5 × IQR}
• lower whisker end = {min data point ≥ LQ – 1.5 × IQR}
Outliers
• upper dots = {data points ≥ UQ + 1.5 × IQR}
• lower dots = {data points ≤ LQ – 1.5 × IQR}

6.7 Exercises
Problem 1
Suppose that (X1 , . . . , Xn ) are independent uniform U(0, 1) random variables.

(a) Sketch the population distribution function F (x) and the standard deviation of the empirical distribution
function F̂ (x).
(b) Generate many samples of size 16. For each sample, plot the difference F (x) − F̂ (x) and relate what you
see to your answer to (a).

Problem 2
Let (X1 , . . . , Xn ) be independent random variables with the same distribution F , and let F̂ denote the empirical
distribution function. Show that for u < v,
1
Cov(F̂ (u), F̂ (v)) = n F (u)(1 − F (v)).

If follows that F̂ (u) and F̂ (v) are positively correlated: if F̂ (u) overshoots F (u), then F̂ (v) will tend to overshoot
F (v).

41
Problem 3
A random sample x1 , . . . , xn , n = 59:

14.27 14.80 12.28 17.09 15.10 12.92 15.56 15.38 15.15 13.98
14.90 15.91 14.52 15.63 13.83 13.66 13.98 14.47 14.65 14.73
15.18 14.49 14.56 15.03 15.40 14.68 13.33 14.41 14.19 15.21
14.75 14.41 14.04 13.68 15.31 14.32 13.64 14.77 14.30 14.62
14.10 15.47 13.73 13.65 15.02 14.01 14.92 15.47 13.75 14.87
15.28 14.43 13.96 14.57 15.49 15.13 14.23 14.44 14.57

are the percentages of hydrocarbons in each sample of beeswax.

(a) Plot the empirical distribution function, a histogram, and a normal probability plot. Find the 0.9, 0.75, 0.5,
0.25, and 0.1 quantiles. Does the distribution appear Gaussian?
(b) The average percentage of hydrocarbons in a synthetic wax is 85%. Suppose that beeswax was diluted with
1% synthetic wax. Could this be detected? What about 3% and 5% dilution?

Problem 4
Calculate the hazard function for the Weibull distribution
β
F (t) = 1 − e−αt , t ≥ 0,

where α and β are two positive parameters.


(Waloddi Weibull was a Swedish engineer, scientist, and mathematician.)

Problem 5
Give an example of a distribution with an increasing failure rate. Give an example of a distribution with a decreasing
failure rate.

Problem 6
Of the 26 measurements of the heat of sublimation of platinum, 5 are outliers.

136.3 136.6 135.8 135.4 134.7 135 134.1 143.3 147.8 148.8
134.8 135.2 134.9 146.5 141.2 135.4 134.8 135.8 135 133.7
134.4 134.9 134.8 134.5 134.3 135.2

Stem and leaf display for n = 26 observations including k = 5 outliers:

133:7
134:134
134:5788899
135:002244
135:88
136:36
High: 141.2, 143.3, 146.5, 147.8, 148.8

Let N be the number of outliers in a non-parametric bootstrap sample.

(a) Explain why the distribution of N is binomial.


(b) Find P(N ≥ 10).
(c) In 1000 bootstrap samples, how many would you expect to contain 10 or more of theses outliers.
(d) What is the probability that a bootstrap sample is composed entirely of these outliers?

Problem 7
For the data in Problem 3.

(a) Find the mean, median, and 10% and 20% trimmed means.

42
(b) Find an approximate 90% confidence interval for the mean.
(c) Find a confidence interval with coverage near 90% for the median.
(d) Use the bootstrap to find approximate standard errors of the trimmed means.
(f) Find and compare the standard deviation of the measurements, the IQR, and the MAD.
(g) Use the bootstrap to approximate the standard error of the upper quartile.

Problem 8
Olson, Simpson, and Eden (1975) discuss the analysis of data obtained from a cloud seeding experiment. The
following data present the rainfall from 26 seeded and 26 control clouds.

Seeded clouds
129.6, 31.4, 2745.6, 489.1, 430, 302.8, 119, 4.1, 92.4, 17.5,
200.7, 274.7, 274.7, 7.7, 1656, 978, 198.6, 703.4, 1697.8, 334.1,
118.3, 255, 115.3, 242.5, 32.7, 40.6
Control clouds
26.1, 26.3, 87, 95, 372.4, .01, 17.3, 24.4, 11.5, 321.2,
68.5, 81.5, 47.3, 28.6, 830.1, 345.5, 1202.6, 36.6, 4.9, 4.9,
41.1, 29, 163, 244.3, 147.8, 21.7

Make a QQ-plot for rainfall versus rainfall and log rainfall versus log rainfall. What do these plots suggest about
the effect, if any, of seeding?

7 Comparing two samples


Suppose we wish to compare two population distributions with means and standard deviations (µ1 , σ1 ) and (µ2 , σ2 )
based on two iid-samples (x1 , . . . , xn ) and (y1 , . . . , ym ) from these two populations. We start by computing two
sample means x̄, ȳ, and their standard errors
n
X
sx̄ = s1

n
, s21 = 1
n−1 (xi − x̄)2 ,
i=1
Xm
sȳ = s2

m
, s22 = 1
m−1 (yi − ȳ)2 .
i=1

The difference (x̄ − ȳ) is an unbiased estimate of µ1 − µ2 . We are interested in finding the standard error of x̄ − ȳ
and an interval estimate for the difference µ1 − µ2 , as well as testing the null hypothesis of equality

H0 : µ1 = µ2 .

Two main settings will be addressed: two independent samples and paired samples.

7.1 Two independent samples: comparing population means


If (X1 , . . . , Xn ) is independent from (Y1 , . . . , Ym ), then

σ12 σ2
2
Var(X̄ − Ȳ ) = σX̄ + σȲ2 = + 2,
n m
and
s21 s2
s2x̄−ȳ = s2x̄ + s2ȳ = + 2
n m
gives an unbiased estimate of Var(X̄ − Ȳ ). Therefore, sx̄−ȳ will be called the (estimated) standard error of the
point estimate x̄ − ȳ.

Large sample test for the difference between two means


If n and m are large, we can use a normal approximation

(X̄ − Ȳ ) − (µ1 − µ2 )
2 + S2 ≈ N(0, 1).
SX̄ Ȳ

43
The hypothesis
H0 : µ1 = µ2
is tested using the test statistic
x̄ − ȳ
z=q
s2x̄ + s2ȳ

whose null distribution is approximated by the standard normal N(0,1).

q
Approximate confidence interval formula Iµ1 −µ2 = x̄ − ȳ ± zα/2 · s2x̄ + s2ȳ .

Two-sample t-test
The key assumption of the two-sample t-test:

two normal population distributions X ∼ N(µ1 , σ 2 ), Y ∼ N(µ2 , σ 2 ) have equal variances.

Given σ12 = σ22 = σ 2 , the pooled sample variance


Pn Pm
i=1 (xi − x̄)2 + i=1 (yi − ȳ)2 n−1 m−1
s2p = = · s2 + · s2
n+m−2 n+m−2 1 n+m−2 2
is an unbiased estimate of the variance with
n−1 m−1
E(Sp2 ) = E(S12 ) + E(S22 ) = σ 2 .
n+m−2 n+m−2

In the equal variance two sample setting, the variance


n+m
Var(X̄ − Ȳ ) = σ 2 · ,
nm
has the following unbiased estimate
n+m
s2x̄−ȳ = s2p · .
nm
q
(X̄−Ȳ )−(µ1 −µ2 ) nm
Exact distribution Sp · n+m ∼ tn+m−2

Exact confidence interval formula


q
n+m
Iµ1 −µ2 = x̄ − ȳ ± tn+m−2 (α/2) · sp · nm .

q
x̄−ȳ nm
Two sample t-test uses the test statistic t = sp · n+m for testing H0 : µ1 = µ2 . The null distribution of the test
statistic is
T ∼ tn+m−2 .

Example: iron retention


Percentage of Fe2+ and Fe3+ retained by mice data at concentration 1.2 millimolar. (The two samples are taken
out from a larger dataset given in Section 8.5.) Summary of the data:

Fe2+ : n = 18, x̄ = 9.63, s1 = 6.69, sx̄ = 1.58


Fe3+ : m = 18, ȳ = 8.20, s2 = 5.45, sȳ = 1.28

The graphs below show that the population distributions are not normal. Therefore, to test H0 : µ1 = µ2 we use
the large sample test. Using the observed value
x̄ − ȳ
zobs = q = 0.7,
s2x̄ + s2ȳ

and applying the normal distribution table we find an approximate two-sided p-value = 0.48.

44
Left panel: boxplots for percentages of Fe2+ (left) and Fe3+ (right). Right panel: two normal probability plots.

After the log transformation the data look more like normally distributed, as seen from the graphs below. For
the transformed data, we get
n = 18, x̄0 = 2.09, s01 = 0.659, sx̄0 = 0.155,
m = 18, ȳ 0 = 1.90, s02 = 0.574, sȳ0 = 0.135.
Two sample t-test for the transformed data with

tobs = 0.917, df = 34, p-value = 0.366,

also results in non-significant difference.

We return to this example in Section 8.5.

Boxplots and normal probability plots for logs of percentages.

Rank sum test


The rank sum test is a nonparametric test for two independent samples, which does not assume normality of
population distributions.
Assume continuous population distributions F1 and F2 , and consider

H0 : F1 = F2 against H1 : F1 6= F2 .

The rank sum test procedure:


pool the samples and replace the data values by their ranks 1, 2, . . . , n + m, starting from the smallest
sample value to the largest, and then compute two test statistics r1 = sum of the ranks of x-observations,
and r2 = sum of y-ranks.
Clearly,
(n+m)(n+m+1)
r1 + r2 = 1 + 2 + . . . + (n + m) = 2 .
The null distributions for R1 and R2 depend only on the sample sizes n and m.

45
For n ≥ 10, m ≥ 10 apply the normal approximation for the null distributions of R1 and R2 with

n(n + m + 1) m(n + m + 1) mn(n + m + 1)


E(R1 ) = , E(R2 ) = , Var(R1 ) = Var(R2 ) = .
2 2 12

Example: in class experiment


Height distributions for females F1 , and males F2 . For n = m = 3, compute r1 , r2 and one-sided p-value.

7.2 Two independent samples: comparing population proportions


For the binomial model X ∼ Bin(n, p1 ), Y ∼ Bin(m, p2 ), two independently generated values (x, y) give sample
proportions
x y
p̂1 = , p̂2 = ,
n m
which are unbiased estimates of p1 , p2 and have standard errors
q q
p̂1 (1−p̂1 ) p̂2 (1−p̂2 )
sp̂1 = n−1 , sp̂2 = m−1 .

Large sample test for two proportions


If the samples sizes m and n are large, then an approximate 95 % confidence interval for the difference p1 − p2 is
given by
q
Ip1 −p2 = p̂1 − p̂2 ± 1.96 p̂1 (1−
n−1
p̂1 )
+ p̂2m−1
(1−p̂2 )
.

With help of this formula we can test the null hypothesis of equality

H0 : p 1 = p 2 .

Example: opinion polls


Consider two consecutive monthly poll results p̂1 and p̂2 with n ≈ m ≈ 5000 interviews. A change in support to a
major political party from p̂1 to p̂2 , with both numbers being close to 40%, is deemed significant if
q
0.4·0.6
|p̂1 − p̂2 | > 1.96 · 2· 5000 ≈ 1.9%.

This should be compared with the one-sample hypothesis testing

H0 : p = 0.4 vs H0 : p 6= 0.4.

The approximate 95% confidence interval for p is


q
p̂(1−p̂)
Ip = p̂ ± 1.96 · n−1 ,

and if p̂ ≈ 0.4, then the difference is significant if


q
0.4·0.6
|p̂ − 0.4| > 1.96 · 5000 ≈ 1.3%.

Fisher’s exact test


Fisher’s exact test deals with the null hypothesis

H0 : p 1 = p 2 ,

when the sample sizes m and n are not sufficiently large for applying normal approximations for the binomial
distributions. We summarise the data of two independent samples as a 2 × 2 table of sample counts

Sample 1 Sample 2 Total


Number of successes x y Np = x + y
Number of failures n−x m−y Nq = n + m − x − y
Sample sizes n m N =n+m

46
Fisher’s idea for this case, was to use X as a test statistic conditionally on the total number of successes x + y.
Under the null hypothesis, the conditional distribution of X is hypergeometric

X ∼ Hg(N, n, p)

with parameters (N, n, p) defined by


x+y
N = n + m, p= N .
This is a discrete distribution with probability mass function
Np
Nq
x n−x
P(X = x) = N
, max(0, n − N q) ≤ x ≤ min(n, N p).
n

This null distribution should be used for determining the rejection rule of the Fisher test.

Example: gender bias


The following data were collected after 48 copies of the same file with 24 files labeled as “male” and the other 24
labeled as “female” were sent to 48 experts.
Male Female Total
Promote 21 14 35
Hold file 3 10 13
Total 24 24 48
Each expert decision had two possible outcomes: promote or hold file. We wish to test

H0 : p1 = p2 no gender bias,

against
H1 : p1 > p2 males are favoured.
Fisher’s test would reject H0 in favour of the one-sided alternative H1 for large values of x under the null distribution
35
13 35
13
x 24−x 35−x
P(X = x) = 48
= 48
x−11 , 11 ≤ x ≤ 24.
24 24

This is a symmetric distribution with

P(X ≤ 14) = P(X ≥ 21) = 0.025.

so that a one-sided p-value = 0.025, and a two-sided p-value = 0.05. We conclude that there is a significant evidence
of sex bias, and reject the null hypothesis.

7.3 Paired samples


Examples of paired observations:
• different drugs for two patients matched by age, sex,
• a fruit weighed before and after shipment,
• two types of tires tested on the same car.
Two paired samples forms a vector of iid-pairs

(x1 , y1 ), . . . , (xn , yn ).

As before, our main question is whether the difference µ1 − µ2 is statistically significant. To this end, we turn a
one-dimensional iid-sample of differences

(d1 , . . . , dn ), di = xi − yi .

The population mean difference µ1 − µ2 is estimated by d¯ = x̄ − ȳ. This is an unbiased estimate whose variance
value takes into account dependence between X and Y . Observe that

Var(X̄ − Ȳ ) = Var(X̄) + Var(Ȳ ) − 2Cov(X̄, Ȳ )


σ12 σ22 2
= n + n − n2 Cov(X1 + . . . + Xn , Y1 + . . . + Yn ).

47
Since Xi and Yj are independent for i 6= j, we get
Cov(X1 + . . . + Xn , Y1 + . . . + Yn ) = Cov(X1 , Y1 ) + . . . + Cov(Xn , Yn ) = nCov(X, Y ) = nσ1 σ2 ρ,
where
Cov(X, Y )
ρ=
σ1 σ2
is the correlation coefficient for the joint population distribution of (X, Y ). Thus
2
Var(X̄ − Ȳ ) = 1
n (σ1 + σ22 − 2σ1 σ2 ρ).
If samples are independent and have equal sizes, then ρ = 0 and
2
Var(X̄ − Ȳ ) = Var(X̄) + Var(Ȳ ) = 1
n (σ1 + σ22 ).
Importantly, if ρ > 0, then
Var(X̄ − Ȳ ) < Var(X̄) + Var(Ȳ ),
which demonstrates that the the pared sampling with ρ > 0 ensures a smaller standard error for the estimate x̄ − ȳ
as compared with the two independent samples case.

Smoking and platelet aggregation


To study the effect of cigarette smoking on platelet aggregation, Levine (1973) drew blood samples from 11 indi-
viduals before and after they smoked a cigarette and measured the extend to which the blood platelets aggregated.
Platelets are involved in the formation of blod clots, and it is known that smokers suffer more often from disorders
involving blood clots than do nonsmokers. The data are shown in the following table, which gives the maximum
percentage of all the platelets that aggregated after being exposed to a stimulus.
Before smoking yi After smoking xi di = xi − yi Rank of |di | Signed rank
25 27 2 2 +2
25 29 4 3.5 +3.5
27 37 10 6 +6
44 56 12 7 +7
30 46 16 10 +10
67 82 15 8.5 +8.5
53 57 4 3.5 +3.5
53 80 27 11 +11
52 61 9 5 +5
60 59 –1 1 –1
28 43 15 8.5 +8.5
We have n = 11 pairs of measurements of individuals. Using the data we estimate correlation as ρ ≈ 0.90.
Assuming that the population distribution for differences D ∼ N(µ, σ 2 ) is normal with µ = µ1 − µ2 , we apply the
one-sample t-test for
H0 : µ1 − µ2 = 0 against H1 : µ1 − µ2 6= 0.
The observed test statistic value
d¯ 10.27
tobs = = = 4.28
sd¯ 2.40
reveals a small two-sided P-value, showing that smoking has a significant health effect
2*(1 – tcdf(4.28,10)) = 0.0016.
Without assumption of normality, having an iid-sample of difference, we can apply the non-parametric sign test
for a pair of hypothesis for the median m = mD of difference D
H0 : m = 0 against H1 : m 6= 0.
Test statistics: either X X
y+ = 1{di >0} or y− = 1{di <0} .
Assuming the distribution of D being continuous, we find that both Y+ and Y− have null distribution Bin(n, 0.5).
To take care of ties di = 0:
• you either discard the tied observations and reduce n respectively,
• or dissolve the ties by randomisation.
For the data on platelet aggregation, the observed test statistic is y− = 1. Thus a two-sided p-value of the sign
test is
p-value = 2[(0.5)11 + 11(0.5)11 ] = 0.012.

48
Signed rank test
The sign test disregards a lot of information in the data taking into account only the sign of the differences. The
signed rank test pays attention to sizes of positive and negative differences. This is a non-parametric test for the
null hypothesis of no difference

H0 : distribution of D is symmetric about its median m = 0.

The null hypothesis consists of two parts: symmetry of the distribution and m = 0. Test statistics: either
n
X
w+ = rank(|di |) · 1{di >0}
i=1

or
n
X
w− = rank(|di |) · 1{di <0} .
i=1

Assuming no ties, that is di 6= 0, we always get

n(n + 1)
w+ + w− = .
2
The null distributions of W+ and W− are the same and tabulated for smaller values of n. For n ≥ 20, one can use
the normal approximation of the null distribution with mean and variance

n(n + 1) 2 n(n + 1)(2n + 1)


µW = , σW = .
4 24

The signed rank test uses more data information than the sign test
but requires symmetric distribution of differences.

Example (platelet aggregation)


Observed value of the test statistic w− = 1. It gives a two-sided p-value p = 0.002. The null hypothesis can be
rejected for two reasons, therefore it is important to check the symmetry property of the distribution of differences
(di ) before we conclude that there is a significant treatment effect.

7.4 External and confounding factors


Double-blind, randomised controlled experiments are used to balance out such external factors as

• placebo effect,

• time factor,

• background variables like temperature,

• location factor.

Example: portocaval shunt


Portocaval shunt is an operation used to lower blood pressure in the liver. People believed in its high efficiency
until the controlled experiments were performed.

Enthusiasm level Marked Moderate None


No controls 24 7 1
Nonrandomized controls 10 3 2
Randomized controls 0 1 3

Example: platelet aggregation


Further parts of the experimental design: control group 1 smoked lettuce cigarettes, control group 2 “smoked”
unlit cigarettes.

Simpson’s paradox
Hospital A has higher overall death rate than hospital B. However, if we split the data in two parts, patients in
good (+) and bad (−) conditions, for both parts hospital A performs better.

49
Hospital: A B A+ B+ A– B–
Died 63 16 6 8 57 8
Survived 2037 784 594 592 1443 192
Total 2100 800 600 600 1500 200
Death Rate .030 .020 .010 .013 .038 .040
Here, the external factor, patient condition, is an example of a confounding factor:
Hospital performance ← Patient condition → Death rate

7.5 Exercises
Problem 1
Four random numbers generated from a normal distribution
x1 = 1.1650, x2 = 0.6268, x3 = 0.0751, x4 = 0.3516,
2
along with five random numbers with the same variance σ but perhaps a different mean
y1 = 0.3035, y2 = 2.6961, y3 = 1.0591, y4 = 2.7971, y5 = 1.2641.
(a) What do you think the means of the random normal number generators were? What do you think the
difference of the means was?
(b) What do you think the variance of the random number generator was?
(c) What is the estimated standard error of your estimate of the difference of the means?
(d) Form a 90% confidence interval for the difference of the means.
(e) In this situation, is it more appropriate to use a one-sided test or a two-sided test of the equality of the
means?
(f) What is the p-value of a two-sided test of the null hypothesis of equal means?
(g) Would the hypothesis that the means were the same versus a two-sided alternative be rejected at the
significance level α = 0.1?
(h) Suppose you know that the variance of the normal distribution was σ 2 = 1. How would your answers to
the preceding questions change?

Problem 2
In the ”two independent samples” setting we have two ways of estimating the variance of X̄ − Ȳ :
(a) s2p ( n1 + 1
m ), if σx = σy ,
s2x s2y
(b) n + m without the assumption of equal variances.
Show that if m = n, then these two estimates are identical.

Problem 3
An experiment of the efficacy of a drug for reducing high blood pressure is performed using four subjects in the
following way:
two of the subjects are chosen at random for the control group and two for the treatment group.
During the course of a treatment with the drug, the blood pressure of each of the subjects in the teatment group
is measured for ten consecutive days as is the blood pressure of each og the subjects in the control group.

(a) In order to test whether the treatment has an effect, do you think it is appropriate to use the two-sample t
test with n = m = 20?
(b) Do you think it is appropriate to use the rank sum test?

Problem 4
Let x1 , . . . , x25 be an iid-sample drawn from N(0.3, 1). Consider testing at α = 0.05
H0 : µ = 0, H1 : µ > 0.
Compare
(a) the power of the sign test , and
(b) the power of the test based on the normal theory assuming that σ is known.

50
Problem 5
Suppose that n measurements are to be taken under a treatment condition and another n measurements are to be
taken independently under a control condition. It is thought that the standard deviation of a single observation is
about 10 under both conditions. How large should n be so that a 95% confidence interval for the mean difference
has a width of 2? Use the normal distribution rather than the t-distribution, since n will turn out to be rather
large.

Problem 6
Data: millions of cycles until failure for two types of engine bearings

Type I Type II
3.03 3.19
5.53 4.26
5.60 4.47
9.30 4.53
9.92 4.67
12.51 4.69
12.95 6.79
15.21 9.37
16.04 12.75
16.84 12.78

(a) Use normal theory to test the null hypothesis of no difference against the two-sided alternative

H0 : µx = µy , H1 : µx 6= µy .

(b) Test the hypothesis that there is no difference between the two types of bearing using a nonparametric
method.
(c) Which of the methods (a) or (b) do you think is better in this case?
(d) Estimate π, the probability that a type I bearing will outlast a type II bearing.
(e) Use the bootstrap to estimate the sampling distribution of π̂ and its standard error.

Problem 7
Find the exact null distribution for the test statistic of the signed rank test with n = 4.

Problem 8
Turn to the two-sided signed rank test. For n = 10, 20, 25 and α = 0.05, 0.01, compare the critical values from the
table and using the normal approximation of the null distribution.

Problem 9
Two population distributions with σx = σy = 10. Two samples of sizes n = 25 can be taken in two ways

(a) paired with Cov(Xi , Yi ) = 50, i = 1, . . . , 25,


(b) unpaired x1 , . . . , x25 and y1 , . . . , y25 .

Compare the power curves for testing

H0 : µx = µy , H1 : µx > µy , α = 0.05.

Problem 10
Lin, Sutton, and Qurashi (1979) compared microbiological and hydroxylamine methods for the analysis of ampicillin
dosages. In one series of experiments, pairs of tablets were analysed by the two methods. The data in the table
give the percentages of claimed amount of ampicillin found by the two methods in several pairs of tablets.

51
Microbiological method Hydroxylamine method
97.2 97.2
105.8 97.8
99.5 96.2
100 101.8
93.8 88
79.2 74
72 75
72 67.5
69.5 65.8
20.5 21.2
95.2 94.8
90.8 95.8
96.2 98
96.2 99
91 100.2
What are x̄ − ȳ and sx̄−ȳ ? If the pairing had been erroneously ignored and it had been assumed that the two
samples were independent, what would have been the estimate of the standard deviation of X̄ − Ȳ ? Analyse the
data to determine if there is a systematic difference between the two methods.

Problem 11
The media often present short reports of the results of experiments. To the critical reader, such reports often raise
more questions than they answer. Comment on the following pitfalls in the interpretation of each of the following.
(a) It is reported that patients whose hospital rooms have a window recover faster than those whose rooms do
not.
(b) Nonsmoking wives whose husbands smoke have a cancer rate twice that of wives whose husbands do not
smoke.
(c) A two-year study in North Carolina found that 75% of all industrial accidents in the state happened to
workers who had skipped breakfast.
(d) A school integration program involved busing children from minority schools to majority (primarily white)
schools. Participation in the program was voluntary. It was found that the students who were bused scored lower
on standardised tests than did their peers who chose not to be bused.
(e) When a group of students were asked to match pictures of newborns and with pictures of their mothers,
they were correct 36% of the time.
(f) A survey found that that those who drank a moderate amount of beer were healthier than those who totally
abstained from alcohol.
(g) A 15-year study of more than 45 000 Swedish soldiers revealed that heavy users of marijuana were six times
more likely than nonusers to develop schizophrenia.
(h) A University of Wisconsin study showed that within 10 years of wedding, 38% of those who had lived
together before marriage had split up, compared to 27% of those who had married without a ”trial period”.
(i) A study of nearly 4000 elderly North Carolinians has found that those who attended religious services every
week were 46% less likely to die over a six-year period than people who attended less often or not at all.

52
8 Analysis of variance
The two sample setting from the previous section is the case of a single main factor having two levels. In this
section, we extend the setting first to a single main factor with arbitrary many levels (one-way layout) and then
to two main factors (two-way layout). Afterwards we introduce the two-way layout which generalises the paired
sample setting of the previous section.

8.1 One-way layout


Consider a random variable Y , called response, which depends on several factors with one factor X being treated
as the main factor, to be labeled as A-factor. Let

Y = µ(X) + , E() = 0,

where
µ(x) = E(Y |X = x)

is the expected value of response conditioned on the value of X, while the noise component (incorporating other
factors except the main one) is assumed to be independent of X.
We choose particular values x1 , . . . , xI of the explanatory variable X, and denote the corresponding conditional
expectations by
µi = µ(xi ), i = 1, . . . , I.

For each of the I levels of the main factor A, we independently collect an iid-sample (yi1 , . . . , yiJ ) of the same size
J. Having such I independent samples we want a utility test of

H0 : µ1 = . . . = µI , against H1 : µu 6= µv for some (u, v).

If the the collected response compares I different treatments, then the null hypothesis claims that all treatments
have the same effect (so that the factor A has no influence of the response Y and the suggested model is not useful).

Example: seven labs


Data: 70 measurements of chlorpheniramine maleate in tablets with a nominal dosage of 4 mg.
Seven labs made ten measurements each: I = 7, J = 10.

Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7


4.13 3.86 4.00 3.88 4.02 4.02 4.00
4.07 3.85 4.02 3.88 3.95 3.86 4.02
4.04 4.08 4.01 3.91 4.02 3.96 4.03
4.07 4.11 4.01 3.95 3.89 3.97 4.04
4.05 4.08 4.04 3.92 3.91 4.00 4.10
4.04 4.01 3.99 3.97 4.01 3.82 3.81
4.02 4.02 4.03 3.92 3.89 3.98 3.91
4.06 4.04 3.97 3.9 3.89 3.99 3.96
4.10 3.97 3.98 3.97 3.99 4.02 4.05
4.04 3.95 3.98 3.90 4.00 3.93 4.06

The data is summarised below in the form of seven boxplots. Ordered means

Lab i 1 3 7 2 5 6 4
Mean µi 4.062 4.003 3.998 3.997 3.957 3.955 3.920

Here the null hypothesis of interest states that there is no significant difference between the output of the seven
laboratories.

53
4.1

4.05

3.95

3.9

3.85

3.8
1 2 3 4 5 6 7

Normal theory model


Assume normally distributed observations Yij ∼ N(µi , σ 2 ) with equal variances (compare to the two sample t-test).
In other words,
X
Yij = µ + αi + ij , αi = 0, ij ∼ N(0, σ 2 ),
i

each observation is a sum of the overall mean, the differential main effect, and the noise.
Using the maximum likelihood approach we find the point estimates

µ̂ = ȳ.. , µ̂i = ȳi. , α̂i = ȳi. − ȳ.. ,

in terms of the sample means


X X XX
1 1 1
ȳi. = J yij , ȳ.. = I yi. = IJ yij .
j i i j

It follows, that
I
X
yij = µ̂ + α̂i + ˆij , ˆij = yij − ȳi. , α̂i = 0,
i=1

where ˆij are the so-called residuals.


The ANOVA tests are built around the following observation:

Decomposition of the total sum of squares: SST = SSA + SSE

where
SST = i j (yij − ȳ.. )2 is the total sum of squares for the pooled sample with dfT = IJ − 1,
P P

SSA = J i α̂i2 is the factor A sum of squares with df A = I − 1,


P
P P 2
SSE = i j ˆij is the error sum of squares with df E = I(J − 1).

This decomposition says that the total variation in response values is the sum of the between-group variation and
the within-group variation. After normalising by the numbers of degrees of freedom, we obtain so-called the mean
squares
SSA SSE
M SA = , M SE = .
df A df E
If treated as random variables, they lead the following formulas for the expected values
X
E(M SA ) = σ 2 + J
I−1 αi2 , E(M SE ) = σ 2 ,
i

M SA
which suggest looking for the ratio between the two mean squares M SE to find an evidence against the null
hypothesis
H0 : α1 = . . . = αI = 0.

54
One-way F -test
The pooled sample variance XX
s2p = M SE = 1
I(J−1) (yij − ȳi. )2
i j
2 M SA
is an unbiased estimate of σ . F-test rejection rule: use F = M SE as a test statistic for and reject H0 for large
values of F based on the null distribution
H
F ∼0 Fn1 ,n2 , where n1 = I − 1, n2 = I(J − 1).

F-distribution Fn1 ,n2 with degrees of freedom (n1 , n2 ) is the distribution for the ratio
X1 /n1
X2 /n2 ∼ Fn1 ,n2 , where X1 ∼ χ2n1 and X2 ∼ χ2n2 are two independent random variables.

Example: seven labs


The normal
√ probability plot of residuals ˆij supports the normality assumption. Noise size σ is estimated by
sp = 0.0037 = 0.061. One-way Anova table
Source df SS MS F P
Labs 6 .125 .0210 5.66 .0001
Error 63 .231 .0037
Total 69 .356
7

Which of the 2 = 21 pairwise differences are significant?

8.2 Simultaneous confidence interval


Using the 95% confidence interval for a single pair of independent samples (µu − µv ) we get
s
Iµu −µv = (ȳu. − ȳv. ) ± t63 (0.025) · √p = (ȳu. − ȳv. ) ± 0.055,
5

where t63 (0.025) = 2.00. This formula yields 9 significant differences:

Labs 1–4 1–6 1–5 3–4 7–4 2–4 1–2 1–7 1–3 5–4
Diff 0.142 0.107 0.105 0.083 0.078 0.077 0.065 0.064 0.059 0.047

The multiple comparison problem: the above confidence interval formula is aimed at a single difference, and may
produce false discoveries. We need a simultaneous confidence interval formula for all 21 pairwise comparisons.

Bonferroni method
Think of a statistical test repeatedly applied to k independent samples of size n. The overall result is positive if
we get at least one positive result among these k tests. Observe that the overall significance level α is obtained,
if each single test is performed at significance level α0 = α/k. Indeed, assuming the null hypothesis is true, the
number of positive results is X ∼ Bin(k, α0 ). Thus for small values of α0 ,

P(X ≥ 1|H0 ) = 1 − (1 − α0 )k ≈ kα0 = k αk = α.

formula of a 100(1 − α)% simultaneous confidence interval which can be used (as an
This yields Bonferroni’s
approximation) for I2 pairwise differences (µu − µv ):
q
α
Bµu −µv = ȳu. − ȳv. ± tdf ( 2k ) · sp J2 , 1 ≤ u < v ≤ I.

where df = I(J − 1) and k = I(I−1)


2 . Warning: pairwise differences µu − µv are not independent as required by
Bonferroni method, for example
µ1 − µ2 + µ2 − µ3 = µ1 − µ3 ,
Bonferroni method gives slightly wider intervals compared to the Tukey method introduced below.

Example: seven labs


Bonferroni 95% takes the form
s
Bµu −µv = (ȳu. − ȳv. ) ± t63 ( .05
42 ) ·
√p
5
= (ȳu. − ȳv. ) ± 0.086,

where t63 (0.0012) = 3.17, detects 3 significant differences between labs (1,4), (1,5), (1,6).

55
Tukey method
If I independent samples (yi1 , . . . , yiJ ) taken from N(µi , σ 2 ) have the same size J, then
2
Zi = Ȳi. − µi ∼ N(0, σJ )

are independent. Consider the range of differences between Zi :

R = max{Z1 , . . . , ZI } − min{Z1 , . . . , ZI }.

The normalised range has a distribution that is free from the parameter σ
R
√ ∼ SR(I, df), df = I(J − 1).
Sp / J
The so-called studentised range distribution SR has two parameters: the number of samples and the number of
degrees of freedom used in the variance estimate s2p . Tukey’s simultaneous confidence interval is built using an
appropriate quantile qI,df (α) of the studentised range distribution. It takes into account the dependences between
the differences µu − µv .
s
Tukey’s 100(1 − α)% simultaneous confidence interval Tµu −µv = ȳu. − ȳv. ± qI,df (α) · √p
J

Example: seven labs


Using q7,60 (0.05) = 4.31 from the SR-distribution table, we find
0.061
Tµu −µv = ȳu. − ȳv. ± q7,63 (0.05) · √
10
= ȳu. − ȳv. ± 0.083,

which brings four significant pairwise differences: (1,4), (1,5), (1,6), (3,4).

8.3 Kruskal-Wallis test


A nonparametric test, without assuming normality, for no treatment effect

H0 : all observations are equal in distribution.

Extending the idea of the rank-sum test, consider the pooled sample of size N = IJ. Let rij be the pooled ranks
of the sample values yij , so that XX
rij = 1 + 2 + . . . + N = N (N2+1) ,
i j
N +1
implying that the mean rank is r̄.. = 2 .

12J
PI N +1 2
Kruskal-Wallis test statistic W = N (N +1) i=1 (r̄i. − 2 )

Reject H0 for large W using the null distribution table. For I = 3, J ≥ 5 or I ≥ 4, J ≥ 4, use the approximate
H0
null distribution W ≈ χ2I−1 .

Example (seven labs)


In the table below the actual measurements are replaced by their ranks 1 ÷ 70. (There is a tie 4.06 between
laboratories 1 and 7, however, this is due to rounding.) With the observed Kruskal-Wallis test statistic W = 28.17
and df = 6, using χ26 -distribution table we get a p-value of approximately 0.0001.

Labs 1 2 3 4 5 6 7
70 4 35 6 46 48 38
63 3 45 7 21 5 50
53 65 40 13 47 22 52
64 69 41 20 8 28 58
59 66 57 16 14 37 68
54 39 32 26 42 2 1
43 44 51 17 9 31 15
61 56 25 11 10 34 23
67 24 29 27 33 49 60
55 19 30 12 36 18 62
Means 58.9 38.9 38.5 15.5 26.6 27.4 42.7

56
8.4 Two-way layout
Suppose the data values are influenced by two explanatory variables (X1 , X2 ) and noise :

Y = µ(X1 , X2 ) + , E() = 0,
where µ(x1 , x2 ) is the conditional expectation of the response variable. Choosing I values for variable X1 and J
values for variable X2 , we generate IJ independent samples, each of size K. We say that the main factor A has I
levels, the main factor B has J levels, and denote

µ(x1i , x2j ) = µij = µ + αi + βj + δij ,

where
• µ is the grand mean,
• (α1 , . . . , αI ) is the main A-effect, with α1 + . . . + αI = 0,
• (β1 , . . . , βJ ) is the main B-effect, with β1 + . . . + βJ = 0,
• {δij , i = 1, . . . , I, j = 1, . . . , J} is the AB-interaction effect, with
P P
i δij = 0, j δij = 0.
When an interaction effect is present, the impact of one factor depends on the level of the other factor. Without
the interaction effect, the joint impact of two factors is described by an additive model:

µij = µ + αi + βj .

Normal theory model


We assume that the data is generated in the following way

Yijk = µ + αi + βj + δij + ijk , i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , K,

where ijk ∼ N(0, σ 2 ) are independent and have the same variance. Here we assume that for each combination of
levels (i, j) of two main factors, K ≥ 2 replications are performed. The maximum likelihood estimates
XXX
1
µ̂ = ȳ... = IJK yijk ,
i j k
XX
1
α̂i = ȳi.. − ȳ... , ȳi.. = JK yijk ,
j k
XX
1
β̂j = ȳ.j. − ȳ... , ȳ.j. = IK yijk ,
i k
X
1
δ̂ij = ȳij. − ȳ... − α̂i − β̂j = ȳij. − ȳi.. − ȳ.j. + ȳ... , ȳij. = K yijk ,
k

bring a decomposition involving residuals

yijk = µ̂ + α̂i + β̂j + δ̂ij + ˆijk , ˆijk = yijk − ȳij.

8.5 Example: iron retention

30

25

20

15

10

0
1 2 3 4 5 6

57
0
The raw data yijk is the percentage of iron retained in mice. Factor A has two levels I = 2 representing two iron
forms, while factor B has three levels J = 3 representing dosage concentrations. Six samples are collected with
K = 18 replications for each (iron form, dosage) combination. Six boxplots for these six samples (see above) show
that the raw data is not normally distributed having different variances across six samples.
Fe3+ (10.2) Fe3+ (1.2) Fe3+ (0.3) Fe2+ (10.2) Fe2+ (1.2) Fe2+ (0.3)
.71 2.2 2.25 2.2 4.04 2.71
1.66 2.93 3.93 2.69 4.16 5.43
2.01 3.08 5.08 3.54 4.42 6.38
2.16 3.49 5.82 3.75 4.93 6.38
2.42 4.11 5.84 3.83 5.49 8.32
2.42 4.95 6.89 4.08 5.77 9.04
2.56 5.16 8.5 4.27 5.86 9.56
2.6 5.54 8.56 4.53 6.28 10.01
3.31 5.68 9.44 5.32 6.97 10.08
3.64 6.25 10.52 6.18 7.06 10.62
3.74 7.25 13.46 6.22 7.78 13.8
3.74 7.9 13.57 6.33 9.23 15.99
4.39 8.85 14.76 6.97 9.34 17.9
4.5 11.96 16.41 6.97 9.91 18.25
5.07 15.54 16.96 7.52 13.46 19.32
5.26 15.89 17.56 8.36 18.4 19.87
8.15 18.3 22.82 11.65 23.89 21.6
8.24 18.59 29.13 12.45 26.39 22.25
0
However, the transformed data yijk = ln(yijk ) produce more satisfactory boxplots.

3.5

2.5

1.5

0.5

-0.5
1 2 3 4 5 6

The
six sample means for the transformed data (ȳij. )
2.4

10.2 1.2 0.3 Level mean 2.2

3+
Fe 1.16 1.90 2.28 1.78 1.8

Fe2+ 1.68 2.09 2.40 2.08


1.6

1.4

Level mean 1.42 2.00 2.34 1.92 1.2

1
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

when depicted as two profiles for the two rows are not parallel which indicates possible interaction. The maximum
likelihood estimates are
ȳ... = 1.92, α̂1 = −0.14, α̂2 = 0.14, β̂1 = −0.50, β̂2 = 0.08, β̂3 = 0.42,
and
−0.12 0.04 0.08
(δ̂ij ) =
0.12 −0.04 −0.08
A two-way Anova table for the transformed iron retention data:
Source df SS MS F P
Iron form 1 2.074 2.074 5.99 0.017
Dosage 2 15.588 7.794 22.53 0.000
Interaction 2 0.810 0.405 1.17 0.315
Error 102 35.296 0.346
Total 107 53.768

58
According to the rightmost column
• the dosage effect is undoubtably significant, however, this is something expected,
• interaction is not statistically significant,
• there is significant effect due to iron form (compare to the previous analysis of two samples).
The estimated log scale difference α̂2 − α̂1 = ȳ2.. − ȳ1.. = 0.28 yields the multiplicative effect of e0.28 = 1.32 on the
original scale, implying that the retention percentage of Fe2+ is 1.32 higher than that of Fe3+ .

Three F -tests
We explain the Anova table above by starting with the sums of squares decomposition

SST = SSA + SSB + SSAB + SSE ,


where
− ȳ... )2 ,
P P P
SST = i j k (yijk df T = IJK − 1

α̂i2 ,
P
SSA = JK i df A = I − 1

β̂j2 ,
P
SSB = IK j df B = J − 1

2
P P
SSAB = K i j δ̂ij , df AB = (I − 1)(J − 1)

ˆ2ijk ,
P P P
SSE = i j k df E = IJ(K − 1)
The mean sums of squares and their expected values
SSA JK
E(M SA ) = σ 2 + αi2
P
M SA = df A , I−1 i

SSB IK
E(M SB ) = σ 2 + βj2
P
M SB = df B , J−1 j

SSAB K
E(M SAB ) = σ 2 + 2
P P
M SAB = df AB , (I−1)(J−1) i j δij

SSE
M SE = df E , E(M SE ) = σ 2

Pooled sample variance s2p = M SE is an unbiased estimate of σ 2 .

Null hypothesis No-effect property Test statistics and null distribution


HA : α1 = . . . = αI = 0 E(M SA ) = σ 2 FA = M SA
M SE ∼ Fdf A ,df E
M SB
HB : β1 = . . . = βJ = 0 E(M SB ) = σ 2 FB = M SE ∼ Fdf B ,df E
HAB : all δij = 0 E(M SAB ) = σ 2 FAB = M SAB
M SE ∼ Fdf AB ,df E

Reject null hypothesis for large values of the respective test statistic.
Inspect normal probability plot for the residuals ˆijk .

8.6 Randomised block design


Blocking is used to remove the effects of the most important nuisance variable. Randomisation is then used to
reduce the contaminating effects of the remaining nuisance variables.

Block what you can, randomise what you cannot.

Experimental design: randomly assign I treatments within each of J blocks.


Test the null hypothesis of no treatment effect using the two-way layout Anova.
The block effect is anticipated and is not of major interest. Examples:
Block Treatments Observation
A homogeneous plot of land I fertilizers each applied to The yield on the
divided into I subplots a randomly chosen subplot subplot (i, j)
A four-wheel car 4 types of tires tested on the same car tire’s life-length
A litter of I animals I diets randomly assigned to I sinlings the weight gain

59
Additive model
For the rest of this section suppose that K = 1. With only one replication per cell, then we cannot estimate
interaction. This leads to the additive model without interaction
Yij = µ + αi + βj + ij , ij ∼ N(0, σ 2 ).
For the given data (yij ), find the maximum likelihood estimates and residuals

µ̂ = ȳ.. , α̂i = ȳi. − ȳ.. , β̂i = ȳ.j − ȳ.. ,


ˆij = yij − ȳ.. − α̂i − β̂i = yij − ȳi. − ȳ.j + ȳ.. ,
yields a representation
yij = µ̂ + α̂i + β̂j + ˆij .
Sums of squares decomposition takes a reduced form
SST = SSA + SSB + SSE ,
with
− ȳ.. )2 ,
P P
SST = i j (ȳij df T = IJ − 1

α̂i2 ,
P
SSA = J i df A = I − 1

β̂j2 ,
P
SSB = I j df B = J − 1

ˆ2ij ,
P P
SSE = i j df E = (I − 1)(J − 1)
and
SSA J
E(M SA ) = σ 2 + αi2
P
M SA = df A , I−1 i

SSB I
E(M SB ) = σ 2 + βj2
P
M SB = df B , J−1 j

SSE
M SE = df E , E(MSE ) = σ 2
We can apply two F-tests for two different null hypotheses
M SA HA
HA : α1 = . . . = αI = 0, FA = ∼ Fdf A ,df E ,
M SE
M S B HB
HB : β1 = . . . = βJ = 0, FB = ∼ Fdf B ,df E .
M SE
Example: itching
Data: the duration of the itching in seconds yij , with K = 1 observation per cell, I = 7 treatments to relieve
itching applied to J = 10 male volunteers aged 20-30.
Subject No Drug Placebo Papaverine Morphine Aminophylline Pentabarbital Tripelennamine

BG 174 263 105 199 141 108 141


JF 224 213 103 143 168 341 184
BS 260 231 145 113 78 159 125
SI 225 291 103 225 164 135 227
BW 165 168 144 176 127 239 194
TS 237 121 94 144 114 136 155
GM 191 137 35 87 96 140 121
SS 100 102 133 120 222 134 129
MU 115 89 83 100 165 185 79
OS 189 433 237 173 168 188 317
Boxplots indicate violations of the assumptions of normality and equal variance. Notice much bigger variance for
the placebo group. Two-way Anova table
Source df SS MS F P
Drugs 6 53013 8835 2.85 0.018
Subjects 9 103280 11476 3.71 0.001
Error 54 167130 3096
Total 69 323422

60
8.7 Friedman test
Here we introduce another nonparametric test, which does not require that ij are normally distributed, for testing
H0 : no treatment effect. The Friedman test is based on within block ranking. Let ranks within j-th block be:

(r1j , . . . , rIj ) = ranks of (y1j , . . . , yIj ),

so that
I(I + 1)
r1j + . . . + rIj = 1 + 2 + . . . + I = .
2
For these ranks, we have I1 (r1j + . . . + rIj ) = I+1
2 and therefore r̄.. = I+1
2 .

12J
PI I+1 2 H0
Friedman test statistic Q = I(I+1) i=1 (r̄i. − 2 ) has an approximate null distribution Q ≈ χ2I−1 .

Test statistic Q is a measure of agreement between J rankings, so we reject H0 for large values of Q.

Example: itching
From the rank values rij and r̄i. given in the next table and I+1
2 = 4, we find the Friedman test statistic value to
be Q = 14.86. Using the chi-squared distribution table with df = 6 we obtain the p-value is approximately 2.14%.
We reject the null hypothesis of no effect even in the non-parametric setting.

Subject No Drug Placebo Papaverine Morphine Aminophylline Pentabarbital Tripelennamine

BG 5 7 1 6 3.5 2 3.5
JF 6 5 1 2 3 7 4
BS 7 6 4 2 1 5 3
SI 6 7 1 4 3 2 5
BW 3 4 2 5 1 7 6
TS 7 3 1 5 2 4 6
GM 7 5 1 2 3 6 4
SS 1 2 5 3 7 6 4
MU 5 3 2 4 6 7 1
OS 4 7 5 2 1 3 6
r̄i. 5.10 4.90 2.30 3.50 3.05 4.90 4.25

8.8 Exercises
Problem 1
A study on the tensile strength of aluminium rods is conducted. Forty identical rods are randomly divided into
four groups, each of size 10. Each group is subjected to a different heat treatment, and the tensile strength, in
thousands of pounds per square inch, of each rod is determined. The following data result:
Treatment 1 2 3 4 Combined data
18.9 18.3 21.3 15.9 18.9 18.3 21.3 15.9
20.0 19.2 21.5 16.0 20.0 19.2 21.5 16.0
20.5 17.8 19.9 17.2 20.5 17.8 19.9 17.2
20.6 18.4 20.2 17.5 20.6 18.4 20.2 17.5
19.3 18.8 21.9 17.9 19.3 18.8 21.9 17.9
19.5 18.6 21.8 16.8 19.5 18.6 21.8 16.8
21.0 19.9 23.0 17.7 21.0 19.9 23.0 17.7
22.1 17.5 22.5 18.1 22.1 17.5 22.5 18.1
20.8 16.9 21.7 17.4 20.8 16.9 21.7 17.4
20.7 18.0 21.9 19.0 20.7 18.0 21.9 19.0
mean 20.34 18.34 21.57 17.35 19.40
variance 0.88 0.74 0.88 0.89 3.58
skewness 0.16 0.14 -0.49 -0.08 0.05
kurtosis 2.51 2.59 2.58 2.46 1.98
Consider the null hypothesis of equality between the four treatment means of tensile strength.

(a) Test the null hypothesis applying an ANOVA test. Show clearly how all the sums of squares are computed
using the sample means and variances given in the table.

61
(b) What are the assumptions of the ANOVA model you used? Do they appear fulfilled?
(c) The Bonferroni method suggests the following formula for computing simultaneous 95% confidence intervals
for six pairwise differences between four treatment means

Bµu −µv = (ȳu. − ȳv. ) ± t36 ( 0.025


6 ) · 0.4472 · sp .

Explain this formula and using it check which of the pairs of treatments have significantly different means.

Problem 2
For a one-way analysis of variance with two treatment groups, show that the F statistic is t2 , where t is the test
statistic for a two-sample t-test.

Problem 3
Derive the likelihood ratio test for the null hypothesis of the one-way layout, and show that it is equivalent to the
F-test.

Problem 4
Suppose in a one-way layout there are 10 treatments and seven observations under each treatment. What is the
ratio of the length of a simultaneous confidence interval for the difference of two means formed by Tukey’s method
to that of one formed by the Bonferroni method? How do both of these compare in length to an interval based on
the t-distribution that does not take account of multiple comparisons?

Problem 5
During each of four experiments on the use of carbon tetrachloride as a worm killer, ten rats were infested with
larvae (Armitage 1983). Eight days later, five rats were treated with carbon tetrachloride; the other five were kept
as controls. After two more days, all the rats were killed and the numbers of worms were counted. The table below
gives the counts of worms for the four control groups.

Group I Group II Group III Group IV


279 378 172 381
338 275 335 346
334 412 335 340
198 265 282 471
303 286 250 318

Significant differences among the control groups, although not expected, might be attributable to changes in the ex-
perimental conditions. A finding of significant differences could result in more carefully controlled experimentation
and thus greater precision in later work.
Use both graphical techniques and the F-test to test whether there are significant differences among the four
groups. Use a nonparametric technique as well.

Problem 6
The concentrations (in nanogram per millimiter) of plasma epinephrine were measured for 10 dogs under isofluorane,
halothane, and cyclopropane anesthesia. The measurements are given in the following table (Perry et al. 1974).

Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9 Dog 10
Isofluorane 0.28 0.51 1.00 0.39 0.29 0.36 0.32 0.69 0.17 0.33
Halothane 0.30 0.39 0.63 0.68 0.38 0.21 0.88 0.39 0.51 0.32
Cyclopropane 1.07 1.35 0.69 0.28 1.24 1.53 0.49 0.56 1.02 0.30

Is there a difference in treatment effects? Use a parametric and a nonparametric analysis.

Problem 7
The following table gives the survival times (in hours) for animals in an experiment whose design consisted of three
poisons, four treatments, and four observations per cell.

62
Treatment A Treatment B Treatment C Treatment D
Poison I 3.1 4.5 8.2 11.0 4.3 4.5 4.5 7.1
4.6 4.3 8.8 7.2 6.3 7.6 6.6 6.2
Poison II 3.6 2.9 9.2 6.1 4.4 3.5 5.6 10.0
4.0 2.3 4.9 12.4 3.1 4.0 7.1 3.8
Poison III 2.2 2.1 3.0 3.7 2.3 2.5 3.0 3.6
1.8 2.3 3.8 2.9 2.4 2.2 3.1 3.3

(a) Conduct a two-way analysis of variance to test the effects of the two main factors and their interaction.
(b) Box and Cox (1964) analysed the reciprocals of the data, pointing out that the reciprocal of a survival time
can be interpreted as the rate of death. Conduct a two-way analysis of variance, and compare to the results of part
(a). Comment on how well the standard two-way ANOVA model fits and on the interaction in both analyses.

9 Categorical data analysis


Categorical data appear in the form of a contingency table containing the sample counts for k various categories.
The categorical population distribution is an extension of the Bernoulli distribution with several possible outcomes.
Assuming n independent trials with the same probabilities of k outcomes (π1 , . . . , πk ) we arrive at a multinomial
statistical model Mn(n, π1 , . . . , πk ).

Consider a cross-classification for a pair of categorical factors A and B. If factor A has I levels and factor B
has J levels, then the population distribution of a single cross classification event has the form

b1 b2 ... bJ Total
a1 π11 π12 ... π1J π1·
a2 π21 π22 ... π2J π2·
... ... ... ... ... ...
aI πI1 πI2 ... πIJ πI·
Total π·1 π·2 ... π·J 1

Here
πij = P(A = ai , B = bj )

are the joint the probabilities, and

πi· = P(A = ai ), π·j = P(B = bj )

are the marginal probabilities. The null hypothesis of independence claims that there is no relationship between
factors A and B
H0 : πij = πi· π·j for all pairs (i, j).

The conditional probabilities


πij
πi|j = P(A = ai |B = bj ) =
π·j

are summarised in the next table

b1 b2 ... bJ
a1 π1|1 π1|2 ... π1|J
a2 π2|1 π2|2 ... π2|J
... ... ... ... ...
aI πI|1 πI|2 ... πI|J
Total 1 1 ... 1

The null hypothesis of homogeneity states the equality of J population distributions

H0 : πi|j = πi for all pairs (i, j).

In this sense, the hypothesis of homogeneity is equivalent to the hypothesis of independence.

63
9.1 Chi-squared test of homogeneity
Consider a table of I × J observed counts obtained from J independent samples taken from J population distribu-
tions:
Pop. 1 Pop. 2 ... Pop. J Total
Category 1 n11 n12 ... n1J n1·
Category 2 n21 n22 ... n2J n2·
... ... ... ... ... ...
Category I nI1 nI2 ... nIJ nI·
Sample sizes n·1 n·2 ... n·J n··

This model is described by J multinomial distributions

(N1j , . . . , NIj ) ∼ Mn(n·j ; π1|j , . . . , πI|j ), j = 1, . . . , J.

The total number of degrees of freedom for J independent samples of size I is J(I − 1).
Under the hypothesis of homogeneity H0 : πi|j = πi for all (i, j), the maximum likelihood estimates of πi are
the pooled sample proportions
π̂i = ni· /n·· , i = 1, . . . , I.
These estimates consumes (I − 1) degrees of freedom, since their sum is 1. Using these maximum likelihood
estimates we compute the expected cell counts

Eij = n·j · π̂i = ni· n·j /n··

and the chi-squared test statistic becomes


I X J
X (nij − ni· n·j /n·· )2
χ2 =
i=1 j=1
ni· n·j /n··

We reject H0 for large values of χ2 using the approximate null distribution χ2 ≈ χ2df with

df = J(I − 1) − (I − 1) = (I − 1)(J − 1).

Example: small cars and personality


A car company studies how customers’ attitude toward small cars relates to different personality types. The next
table summarises the observed (expected) counts:

Cautious Middle-of-the-road Explorer Total


Favourable 79(61.6) 58(62.2) 49(62.2) 186
Neutral 10(8.9) 8(9.0) 9(9.0) 27
Unfavourable 10(28.5) 34(28.8) 42(28.8) 86
Total 99 100 100 299

The chi-squared test statistic is

χ2 = 27.24 with df = (3 − 1) · (3 − 1) = 4.

After comparing χ2 with the table value χ24 (0.005) = 14.86, we reject the hypothesis of homogeneity at 0.5%
significance level. Persons who saw themselves as cautious conservatives are more likely to express a favourable
opinion of small cars.

9.2 Chi-squared test of independence


Data: a single cross-classifying sample is summarised in terms of the observed counts,

b1 b2 ... bJ Total
a1 n11 n12 ... n1J n1.
a2 n21 n22 ... n2J n2.
... ... ... ... ... ...
aI nI1 nI2 ... nIJ nI.
Total n.1 n.2 ... n.J n..

64
whose joint distribution is multinomial

(N11 , . . . , NIJ ) ∼ Mn(n·· ; π11 , . . . , πIJ )


n
The maximum likelihood estimates of πi· and π·j are π̂i· = nni··· and π̂·j = n·j·· . Therefore, under the hypothesis
n n
of independence π̂ij = i·n2 ·j implying the same expected cell counts as before
··

ni· n·j
Eij = n·· π̂ij =
n··
with the same
df = (IJ − 1) − (I − 1 + J − 1) = (I − 1)(J − 1).

The same chi-squared test rejection rule for the homogeneity test and independence test.

Example: marital status and educational level


A sample is drawn from a population of married women. Each observation is placed in a 2 × 2 contingency table
depending on woman’s educational level and her marital status.

Married only once Married more than once Total


College 550 (523.8) 61(87.2) 611
No college 681(707.2) 144(117.8) 825
Total 1231 205 1436

The observed chi-squared test statistic is χ2 = 16.01. With df = 1 we can use the normal distribution table, since
Z ∼ N(0, 1) is equivalent to Z 2 ∼ χ21 . Thus

P(χ2 > 16.01) ≈ P(|Z| > 4.001) = 2(1 − Φ(4.001)).

We see that the p-value of the test is less that 0.1%, and we reject the null hypothesis of independence. College-
educated women, once they marry, are less likely to divorce.

9.3 Matched-pairs designs


We start with an illuminating example concerning Hodgkin disease which has very low incidence of 2 in 10 000.

Example: Hodgkin’s disease and tonsillectomy


To test a possible influence of tonsillectomy on the onset of Hodgkin’s disease, researchers use cross-classification
data of the form
X X̄
D n11 n12
D̄ n21 n22

where the four counts distinguish among sampled individual who are

either D = affected (have the Disease) or D̄ = unaffected,


and either X = eXposed (had tonsillectomy) or X̄ = non-exposed.

Three possible sampling designs:


0 0
• simple random sampling would give counts like ,
0 n

• prospective study (take an X-sample and a control X̄-sample, then watch who gets affected) would give
0 0

,
n1 n2

• retrospective study (take a D-sample and a control D̄-sample, then find who had been exposed) would give
informative counts.

Two retrospective case-control studies where produced opposite results. Study A (Vianna, Greenwald, Davis,
1971) gave a cross classification table

65
Study A X X̄
D 67 34
D̄ 43 64
The chi-squared test of homogeneity was applied. With χ2A = 14.29 and df = 1, the p-value was found to be very
small √
P(χ2A ≥ 14.29) ≈ 2(1 − Φ( 14.29)) = 0.0002.
Study B (Johnson and Johnson, 1972) was summarised by a table
Study B X X̄
D 41 44
D̄ 33 52
and the chi-squared tests of homogeneity was applied. With χ2B = 1.53 and df = 1, the p-value was strikingly
different √
P(χ2B ≥ 1.53) ≈ 2(1 − Φ( 1.53)) = 0.215.
It turned out that the study B was based on a matched-pairs design violating the assumption of the chi-squared
test of homogeneity. The sample consisted of m = 85 sibling pairs having same sex and close age: one of the
siblings was affected the other not. A proper summary of the study B sample distinguishes among four groups of
sibling pairs: (X, X), (X, X̄), (X̄, X), (X̄, X̄)

unaffected X unaffected X̄ Total


affected X m11 = 26 m12 = 15 41
affected X̄ m21 = 7 m22 = 37 44
Total 33 52 85
Notice that this contingency table contains more information than the previous one.
An appropriate test in this setting is McNemar’s test (see below). For the data of study B, the McNemar’s test
statistic is
(m12 − m21 )2
χ2 = = 2.91,
m12 + m21
giving the p-value of √
P(χ2 ≥ 2.91) ≈ 2(1 − Φ( 2.91)) = 0.09.
The correct p-value is much smaller than that of 0.215 computed using the test of homogeneity. Since there are
very few informative, only m12 + m21 = 22, observations, more data is required.

McNemar’s test
Consider data obtained by matched-pairs design for the population distribution
unaffected X unaffected X̄ Total
affected X p11 p12 p1.
affected X̄ p21 p22 p2.
p.1 p.2 1
The relevant null hypothesis is not the hypothesis of independence but rather

H0 : p1. = p.1 , or equivalently, H0 : p12 = p21 = p for an unspecified p.


The maximum likelihood estimates for the population frequencies under the null hypothesis are
m11 m22 m12 + m21
p̂11 = , p̂22 = , p̂12 = p̂21 = p̂ = .
m m 2m
These yield the McNemar test statistic of the form
X X (mij − mp̂ij )2 (m12 − m21 )2
χ2 = = ,
i j
mp̂ij m12 + m21

whose approximate null distribution is χ21 , where

df = 4 − 1 − 2

because 2 independent parameters are estimated from the data. We reject the H0 for large values of the test
statistic.

66
9.4 Odds ratios
Odds and probability of a random event A:

P(A) odds(A)
odds(A) = and P(A) = .
P(Ā) 1 + odds(A)

Notice that for small P(A),


odds(A) ≈ P(A).
Conditional odds for A given B are defined as

P(A|B) P(AB)
odds(A|B) = = .
P(Ā|B) P(ĀB)

Odds ratio for a pair of events defined by

odds(A|B) P(AB)P(ĀB̄)
∆AB = = ,
odds(A|B̄) P(ĀB)P(AB̄)

has the properties


1
∆AB = ∆BA , ∆AB̄ = .
∆AB
The odds ratio is a measure of dependence between a pair of random events such. It has the following properties

if ∆AB = 1, then events A and B are independent,


if ∆AB > 1, then P(A|B) > P(A|B̄) so that B increases probability of A,
if ∆AB < 1, then P(A|B) < P(A|B̄) so that B decreases probability of A.

Odds ratios for case-control studies


Return to conditional probabilities and observed counts

X X̄ Total X X̄ Total
D P(X|D) P(X̄|D) 1 D n11 n12 n1·
D̄ P(X|D̄) P(X̄|D̄) 1 D̄ n21 n22 n2·

The corresponding odds ratio


P(X|D)P(X̄|D̄)
∆DX =
P(X̄|D)P(X|D̄)
quantifies the influence of eXposition to a certain factor on the onset of the Disease in question. This odds ratio
can be estimated using the observed counts as

b DX = (n11 /n1· )(n22 /n2· ) = n11 n22 .



(n12 /n1· )(n21 /n2· ) n12 n21

Example: Hodgkin’s disease


Study A gives the odds ratio
∆b DX = 67 · 64 = 2.93.
43 · 34
Conclusion: tonsillectomy increases the odds for Hodgkin’s onset by factor 2.93.

9.5 Exercises
Problem 1
Adult-onset diabetes is known to be highly genetically determined. A study was done comparing frequencies of a
particular allele in a sample of such diabetics and a sample of nondiabetics. The data is shown in the following
table:
Diabetic Normal Total
Bb or bb 12 4 16
BB 39 49 88
Total 51 53 104
Are the relative frequencies of the alleles significantly different in the two groups?

67
Problem 2
Overfield and Klauber (1980) published the following data on the incidence of tuberculosis in relation to blood
groups in a sample of Eskimos. Is there any association of the disease and blood group within the ABO system or
within the MN system?
O A AB B
Moderate 7 5 3 13
Minimal 27 32 8 18
Not present 55 50 7 24

MM MN NN
Moderate 21 6 1
Minimal 54 27 5
Not present 74 51 11

Problem 3
It is conventional wisdom in military squadron that pilots tend to father more girls than boys. Snyder (1961)
gathered data for military fighter pilots. The sex of the pilots’ offspring were tabulated for three kinds of flight
duty during the month of conception, as shown in the following table.
Girl Boy
Flying fighter 51 38
Flying transport 14 16
Not flying 38 46
(a) Is there any significant difference between the three groups?
(b) In the United States in 1950, 105.37 males were born for every 100 females. Are the data consistent with
this sex ratio?

Problem 4
A randomised double-blind experiment compared the effectiveness of several drugs in ameliorating postoperative
nausea. All patients were anesthetized with nitrous oxide and ether. The following table shows the incidence of
nausea during the first four hours for each of several drugs and a placebo (Beecher 1959).
Number of patients Incidence of nausea
Placebo 165 95
Chlorpromazine 152 52
Dimenhydrinate 85 52
Pentobarbital (100 mg) 67 35
Pentobarbital (150 mg) 85 37
Compare the drugs to each other and to the placebo.

Problem 5
In a study of the relation of blood type to various diseases, the following data were gathered in London and
Manchester (Woolf 1955).
London Control Peptic Ulcer Manchester Control Peptic Ulcer
Group A 4219 579 Group A 3775 246
Group O 4578 911 Group O 4532 361
First, consider the two tables separately. Is there a relationship between blood type and propensity to peptic ulcer?
If so, evaluate the strength of the relationship. Are the data from London and Manchester comparable?

Problem 6
Record of 317 patients at least 48 years old who were diagnosed as having endometrial carcinoma were obtained
from two hospitals (Smith et al. 1975). Matched controls for each case were obtained from the two institutions:
the controls had cervical cancer, ovarian cancer, or carcinoma of the vulva. Each control was matched by age
at diagnosis (within four years) and year of diagnosis (within two years) to a corresponding case of endometrial
carcinoma.

68
The following table gives the number of cases and controls who had taken estrogen for at least 6 months prior
to the diagnosis of cancer.
Controls: estrogen used Controls: estrogen not used Total
Cases: estrogen used 39 113 152
Cases: estrogen not used 15 150 165
Total 54 263 317

(a) Is there a significant relationship between estrogen use and endometrial cancer?
(b) This sort of design, called a retrospective case-control study, is frequently used in medical investigations
where a randomised experiment is not possible. Do you see any possible weak points in a retrospective case-control
design?

Problem 7
A psychological experiment was done to investigate the effect of anxiety on a person’s desire to be alone or in
company (Lehman 1975). A group of 30 subjects was randomly divided into two groups of sizes 13 and 17. The
subjects were told that they would be subjected to some electric shocks, but one group (high-anxiety) was told
that the shocks would be quite painful and the other group (low-anxiety) was told that they would be mild and
painless. Both groups were told that there would be a 10-min wait before the experiment began, and each subject
was given the choice of waiting alone or with the other subjects. The following are the results:
Wait Together Wait Alone Total
High-Anxiety 12 5 17
Low-Anxiety 4 9 13
Total 16 14 30
Use Fisher’s exact test to test whether there is a significant difference between the high- and low-anxiety groups.
What is a reasonable one-sided alternative?

Problem 8
Hill and Barton (2005): red against blue outfits - does it matter in combat sports? Although other colors are also
present in animal displays, it is specifically the presence and intensity of red coloration that correlates with male
dominance and testosterone levels. Increased redness during aggressive interactions may reflect relative dominance.
In the 2004 Olympic Games, contestants in four combat sports were randomly assigned red and blue outfits.
The winner counts in different sports
Red Biue Total
Boxing 148 120 268
Freestyle wrestling 27 24 51
Greco-Roman wrestling 25 23 48
Tae Kwon Do 45 35 80
Total 245 202 447

(a) Let πR denote the probability that the contestant wearing red wins. Test the null hypothesis that πR = 0.5
versus the alternative hypothesis that πR is the same in each sport, but πR 6= 0.5.
(b) Test the null hypothesis that πR = 0.5 versus the alternative hypothesis that allows πR to be different in
different sports, but not equal to 0.5.
(c) Are these hypothesis tests equivalent to that which would test the null hypothesis πR = 0.5 versus the
alternative hypothesis πR 6= 0.5, using as data the total numbers of wins summed over all the sports?
(d) Is there any evidence that wearing red is more favourable in some of the sports than others?

Problem 9
Suppose that a company wishes to examine the relationship of gender to job satisfaction, grouping job satisfaction
into four categories: very satisfied, somewhat satisfied, somewhat dissatisfied, and very dissatisfied. The company
plans to ask the opinion of 100 employees. Should you, the company’s statistician, carry out a chi-square test of
independence or a test of homogeneity?

69
10 Multiple regression
Pearson’s father-son data. The following scatter diagram shows the heights of 1,078 fathers and their full-grown
sons, in England, circa 1900. There is one dot for each father-son pair.

Focussing on 6 feet tall fathers (above average height), we see that their sons on average are shorter than their
fathers. Francis Galton called this phenomenon regression to mediocrity.

10.1 Simple linear regression model


A simple linear regression model connects two random variables (X, Y ), where Y is called response and X is called
predictor variable, by a linear relation

Y = β0 + β1 X + , ∼ N(0, σ 2 ),

where is the noisy part of the response, that is not explained by the model. The key assumption of the model
requires that has a normal distribution N(0,σ 2 ) independent of X. This assumption is called homoscedasticity,
meaning that the noise size σ is the same for all possible levels of the predictor variable.

For a given iid-sample (x1 , y1 ), . . . , (xn , yn ),

y i = β0 + β1 x i + e i , i = 1, . . . , n,

the linear regression model is characterised by the likelihood function of the three dimensional parameter θ =
(β0 , β1 , σ 2 )
n n (y − β − β x )2 o
Y 1 i 0 1 i S(β0 ,β1 )
L(θ) = √ exp − 2
f (xi ) = Cσ −n e− 2σ2 ,
i=1
2πσ 2σ
where f (x) is the probability density function of the random variable X and
n
X
C = (2π)−n/2 f (x1 ) · · · f (xn ), S(β0 , β1 ) = (yi − β0 − β1 xi )2 .
i=1

This implies the following expression for the log-likelihood function l(θ) = ln L(θ)
S(β0 ,β1 )
l(θ) = ln C − n ln σ − 2σ 2 .

Observe that
n
X
n−1 S(β0 , β1 ) = n−1 (yi − β0 − β1 xi )2 = β02 + 2β0 β1 x̄ − 2β0 ȳ − 2β1 xy + β12 x2 + y 2 ,
i=1

70
where
x1 +...+xn y1 +...+yn x21 +...+x2n y12 +...+yn
2
x1 y1 +...+xn yn
x̄ = n , ȳ = n , x2 = n , y2 = n , xy = n

form a set of five sufficient statistics.


To obtain the maximum likelihood estimates of θ = (β0 , β1 , σ 2 ) compute the derivatives
∂l
∂β0 = − 2σ1 2 ∂β
∂S
0
,
∂l
∂β1 = − 2σ1 2 ∂β
∂S
1
,
∂l S(β0 ,β1 )
∂σ 2 = − 2σn2 + 2σ 4 ,
∂S ∂S
and set them equal to zeros. Putting ∂β0 = 0 and ∂β1 = 0, we get the so-called normal equations:


b0 + b1 x̄ = ȳ,  b = xy − x̄ȳ = rsy ,
1
implying 2 2 sx
b0 x̄ + b1 x2 = xy,  b = ȳx−−b x̄x̄.
0 1

where
1 X 1 X
s2x = (xi − x̄)2 , s2y = (yi − ȳ)2 .
n−1 n−1
As a result, the fitted regression line y = b0 + b1 x takes the form
s
y = ȳ + r sxy (x − x̄),

involving the sample correlation coefficient defined using the sample covariance:
sxy 1 X
r= , sxy = (xi − x̄)(yi − ȳ).
sx sy n−1

Notice that the maximum likelihood estimates (b0 , b1 ) of parameters (β0 , β1 ) are obtained by minimising the sum
of squares S(β0 , β1 ). Therefore, they are called the least squares estimates. Warning: the least square estimates
(b0 , b1 ) are not robust against outliers exerting leverage on the fitted line.
∂l 2
Putting ∂σ 2 = 0, and replacing (β0 , β1 ) with (b0 , b1 ), we find the maximum likelihood estimate of σ to be

S(b0 , b1 )
σ̂ 2 = ,
n
where
n
X
S(b0 , b1 ) = (yi − ŷi )2 ,
i=1

and
ŷi = b0 + b1 xi
are the so-called predicted responses. The maximum likelihood estimate of σ̂ 2 is a biased but asymptotically
unbiased estimate of σ 2 . An unbiased estimate of σ 2 is given by
S(b0 , b1 )
s2 = .
n−2

10.2 Residuals
The fitted regression line
s
y = b0 + b1 x = ȳ + r sxy (x − x̄),
is used for prediction of the response to a given predictor value x. For the given sample (x1 , y1 ), . . . , (xn , yn ), we
now compare the actual responses yi with the predicted responses ŷi . Introducing residuals by

êi = yi − ŷi ,

we can write
yi = ŷi + êi , i = 1, . . . , n,
The residuals (ê1 , . . . , ên ) are linearly connected via

ê1 + . . . + ên = 0, x1 ê1 + . . . + xn ên = 0, ŷ1 ê1 + . . . + ŷn ên = 0,

71
so we can say that êi are uncorrelated with xi and êi are uncorrelated with ŷi . The residuals êi are realisations of
random variables having normal distributions with zero means and
(xk − xi )2
P P
k (xk − xi )(xk − xj )

Var(êi ) = σ 2 1 − k , Cov(ê ,
i jê ) = −σ 2
· .
n(n − 1)s2x n(n − 1)s2x
The error sum of squares X X
SSE = S(b0 , b1 ) = (yi − ŷi )2 = ê2i
i
can be expressed as
X s s2 X
SSE = (yi − ȳ)2 − 2r sxy n(xy − ȳx̄) + r2 s2y (xi − x̄)2 = (n − 1)s2y (1 − r2 ).
x
i i

This leads to the following useful expression


n−1 2
s2 = s (1 − r2 ).
n−2 y
Using
yi − ȳ = ŷi − ȳ + êi
we obtain a decomposition
SST = SSR + SSE ,
where X
SST = (yi − ȳ)2 = (n − 1)s2y
i
is the total sum of squares, and X
SSR = (ŷi − ȳ)2 = (n − 1)b21 s2x
i
is the regression sum of squares. Combining these relations, we find that
SSR SSE
r2 = =1− .
SST SST
This relation explains why the squared sample correlation coefficient r2 is called the coefficient of determination.
Coefficient of determination r2 is the proportion of variation in the response variable explained by the variation of
the predictor. Therefore, r2 has a more intuitive meaning than the sample correlation coefficient r.
To test the normality assumption, use the normal distribution plot for the standardised residuals
êi
ẽi = , i = 1, . . . , n,
si
where s P
− xi )2
k (xk
si = s 1 −
n(n − 1)s2x
are the estimated standard deviations of Êi . Within the simple linear regression model, the scatter plot of the
standardised residuals versus xi should look as a horizontal blur. If the linear model is not valid, it will show up
in a somewhat bowed shape of the residual scatter plot. In some cases, the non-linearity problem can fixed by a
log-log transformation of the data.

10.3 Confidence intervals and hypothesis testing


The list square estimators (b0 , b1 ) are unbiased and consistent. Due to the normality assumption we have the
following exact distributions
σ 2 x2i s2 x2i
P P
2 2 2 B 0 − β0
B0 ∼ N(β0 , σ0 ), σ0 = , sb0 = , ∼ tn−2 ,
n(n − 1)s2x n(n − 1)s2x SB0
σ2 s2 B 1 − β1
B1 ∼ N(β1 , σ12 ), σ12 = , s2
b = , ∼ tn−2 .
(n − 1)s2x 1
(n − 1)s2x SB1
There is a weak correlation between the two estimators:
σ 2 x̄
Cov(B0 , B1 ) = −
(n − 1)s2x
which is negative, if x̄ > 0, and positive, if x̄ < 0.

72
Exact 100(1 − α)% confidence intervals Iβi = bi ± tn−2 ( α2 ) · sbi

For i = 0 or i = 1 and a given value β ∗ , one would like to the the null hypothesis H0 : βi = β ∗ . Use the test
statistic
bi − β ∗
t= ,
sbi

which is a realisation of a random variable T that has the exact null distribution

T ∼ tn−2 .

Two important examples.

1. Model utility test is built around the null hypothesis

H0 : β1 = 0

stating that there is no relationship between the predictor variable x and the response y. The corresponding
test statistic, often called t-value,
t = b1 /sb1 .

The corresponding null distribution is tn−2 .

2. Zero-intercept test aims at


H0 : β0 = 0.

Compute its t-value


t = b0 /sb0 ,

and find whether this value is significant using the t-distribution with df = n − 2.

10.4 Intervals for individual observations


Given the earlier sample of size n consider a new value x of the predictor variable. We wish to say something on
the response value
Y = β0 + β1 x + .

Its expected value


µ = β0 + β1 x

is estimated by
µ̂ = b0 + b1 x.

The standard error of µ̂ is computed as the square root of

σ2 σ2 2
Var(µ̂) = n + n−1 · ( x−x̄
sx ) .

q
Exact 100(1 − α)% confidence interval Iµ = b0 + b1 x ± tn−2 ( α2 ) · s n1 + n−1
1
( x−x̄
sx )
2
q
Exact 100(1 − α)% prediction interval I = b0 + b1 x ± tn−2 ( α2 ) · s 1 + n1 + n−1
1
( x−x̄
sx )
2

Prediction interval has wider limits since it contains the uncertainty due the noise factors:

Var(Y − µ̂) = Var(µ + − µ̂) = σ 2 + Var(µ̂) = σ 2 (1 + 1


n + 1
n−1 · ( x−x̄ 2
sx ) ).

Graphically compare the formulas for Iµ and prediction interval I by drawing the confidence bands around the
regression line both for the individual observation Y and the mean µ.

73
10.5 Linear regression and ANOVA
We can represent the two independent samples case from Chapter 7 as two groups of independent observations

1. first sample µ1 + e1 , . . . , µ1 + en ,

2. second sample µ2 + en+1 , . . . , µ2 + en+m ,

where the (e1 , . . . , en+m ) are realisations of independent random variables with a common distribution N(0, σ 2 ).
This setting is equivalent to a simple linear regression model with the sample size n + m

yi = β0 + β1 xi + ei ,

and fixed values for the predictors

x1 = . . . = xn = 0, xn+1 = . . . = xn+m = 1,

where the intercept and slope are found from

µ1 = β0 , µ2 = β0 + β1 .

Then the model utility test H0 : β1 = 0 becomes equivalent to the mean equality test H0 : µ1 = µ2 .
More generally, for the one-way ANOVA setting with I = p levels for the main factor and n = pJ observations

β0 + e i , i = 1, . . . , J,
β0 + β1 + e i , i = J + 1, . . . , 2J,
...
β0 + βp−1 + ei , i = (p − 1)J + 1, . . . , n,

we need a multiple linear regression model

yi = β0 + β1 xi,1 + . . . + βp−1 xi,p−1 + ei , i = 1, . . . , n

with dummy variables xi,j taking values 0 and 1 so that

xi,1 = 1 only for i = J + 1, . . . 2J,


xi,2 = 1 only for i = 2J + 1, . . . 3J,
...
xi,p−1 = 1 only for i = (p − 1)J + 1, . . . , n.

The corresponding design matrix (see below) has the following block form
 
1 0 0 ... 0
 1 1 0 ... 0 
 
X= 1
 0 1 ... 0  ,
 ... ... ... ... 
1 0 0 ... 1

where each block is a column vector of length J whose components are either or ones or zeros.

10.6 Multiple linear regression


A more flexible linear regression model is obtained by replacing a single predictor X with p − 1 predictors
(X1 , . . . , Xp−1 )
Y = β0 + β1 X1 + . . . + βp−1 Xp−1 + , ∼ N(0, σ 2 ),
again assuming that the noise is independent of the predictors. This is an extension of the simple linear regression
model corresponding to p = 2 and X1 = X.
The corresponding data set consists of n independent vectors (xi,1 , . . . , xi,p−1 , yi ) with n > p and

y1 = β0 + β1 x1,1 + . . . + βp−1 x1,p−1 + e1 ,


...
yn = β0 + β1 xn,1 + . . . + βp−1 xn,p−1 + en .

74
It is very convenient to use the matrix notation

y = Xβ + e,

where
y = (y1 , . . . , yn )T , β = (β0 , . . . , βp−1 )T , e = (e1 , . . . , en )T ,
are column vectors, and X is the so called design matrix
 
1 x1,1 . . . x1,p−1
X =  ... ... ... ... 
1 xn,1 . . . xn,p−1

assumed to have rank p.


The machinery developed for the simple linear regression model works well for the multiple regression. Least
square estimates b = (b0 , . . . , bp−1 )T minimise

S(b) = ky − Xbk2 ,

where
kak2 = a21 + . . . + a2k , a = (a1 , . . . , ak ).
Solving the normal equations XT Xb = XT y we find the least squares estimates:

b = MXT y, M = (XT X)−1 .

Least squares multiple regression: predicted responses ŷ = Xb = Py, where P = XMXT .

Covariance matrix for the least square estimates

E{(B − β)(B − β)T } = σ 2 M

is a p × p matrix with elements Cov(Bi , Bj ). The vector of residuals

ê = y − ŷ = (I − P)y

has a covariance matrix σ 2 (I − P).

SSE
An unbiased estimate of σ 2 is given by s2 = n−p , where SSE = kêk2 .

Denote by
m11 , m22 , . . . , mp−1,p−1 , mpp
the diagonal elements of matrix M. Then the standard error of bj is computed as

sbj = s mj+1,j+1 .

Bj −βj
Exact sampling distributions SBj ∼ tn−p , j = 1, . . . , p − 1.

To check the underlying normality assumption inspect the normal probability plot for the standardised residuals
√ êi , where pii are the diagonal elements of P.
s 1−pii

Coefficient of multiple determination can be computed similarly to the simple linear regression model as

SSE
R2 = 1 − ,
SST

where SST = (n − 1)s2y . The problem with R2 is that it increases even if irrelevant variables are added to the
model. To punish for irrelevant variables it is better to use the adjusted coefficient of multiple determination

n−1 SSE s2
Ra2 = 1 − n−p · SST =1− s2y .

n−1
The adjustment factor n−p gets larger for the larger values of predictors p.

75
Example: flow rate vs stream depth
The data in the following table were gathered for an environmental impact study that examined the relationship
between the depth of a stream and the rate of its flow (Ryan et al 1976).
Depth x .34 .29 .28 .42 .29 .41 .76 .73 .46 .40
Flow rate y .64 .32 .73 1.33 .49 .92 7.35 5.89 1.98 1.12
A bowed shape of the plot of the residuals versus depth suggests that the relation between x and y is not linear.
The multiple linear regression framework can by applied to the quadratic model

y = β0 + β1 x + β2 x 2 ,

with x1 = x and x2 = x2 .
Coefficient Estimate Standard Error t value
β0 1.68 1.06 1.52
β1 −10.86 4.52 −2.40
β2 23.54 4.27 5.51
The residuals show no sign of systematic misfit. The test statistic t = 5.51 of the utility test of H0 : β2 = 0 shows
that the quadratic term in the model is statistically significant.

Empirical relationship developed in a region might break down,


if extrapolated to a wider region in which no data been observed

Example: catheter length


Doctors want predictions on heart catheter length depending on child’s height and weight. The data consist of
n = 12 observations for the distance to pulmonary artery:

Height (in) Weight (lb) Distance (cm)


42.8 40.0 37.0
63.5 93.5 49.5
37.5 35.5 34.5
39.5 30.0 36.0 50

45.5 52.0 43.0 45

38.5 17.0 28.0


40
43.0 38.5 37.0
22.5 8.5 20.0 35

37.0 33.0 33.5 30

23.5 9.5 30.5


25
33.0 21.0 38.5 60
70

58.0 79.0 47.0 20


0
50
20 40
40
60 30
80
100 20

We start with two simple linear regressions

H-model: L = β0 + β1 H + , W-model: L = β0 + β1 W + .

The analysis of these two models is summarised as follows


Estimate H-model t value W-model t value
b0 (sb0 ) 12.1(4.3) 2.8 25.6(2.0) 12.8
b1 (sb1 ) 0.60(0.10) 6.0 0.28(0.04) 7.0
s 4.0 3.8
r2 0.78 0.80
Ra2 0.76 0.78
The plots of standardised residuals do not contradict the normality assumptions.
These two simple regression models should be compared to the multiple regression model

L = β0 + β1 H + β2 W + ,

which gives

76
b0 = 21, sb0 = 8.8, b0 /sb0 = 2.39,
b1 = 0.20, sb1 = 0.36, b1 /sb1 = 0.56,
b2 = 0.19, sb2 = 0.17, b2 /sb2 = 1.12,
s = 3.9, R2 = 0.81, Ra2 = 0.77.
In contrast to the simple models, we can not reject neither H1 : β1 = 0 nor H2 : β2 = 0. This paradox is explained
by different meaning of the slope parameters in the simple and multiple regression models. In the multiple model
β1 is the expected change in L when H increased by one unit and W held constant.
Collinearity problem: height and weight have a strong linear relationship. The fitted plane has a well resolved
slope along the line about which the (H, W ) points fall and poorly resolved slopes along the H and W axes.

100

90

80

70

60

50

40

30

20

10

0
20 25 30 35 40 45 50 55 60 65

Conclusion: since the simple W-model


L = β0 + β1 W +
gives the highest adjusted coefficient of determination, there is little or no gain from adding H to the regression
model model with a single explanatory variable W .

10.7 Exercises
Problem 1
Suppose we agiven a two-dimensional iid-sample
(x1 , y1 ), . . . , (xn , yn ).
Verify that the sample covariance is an unbiased estimate of the population covariance.

Problem 2
Ten pairs
x 0.34 1.38 -0.65 0.68 1.40 -0.88 -0.30 -1.18 0.50 -1.75
y 0.27 1.34 -0.53 0.35 1.28 -0.98 -0.72 -0.81 0.64 -1.59
Draw a scatter plot.
(a) Fit a straight line y = a + bx by the method of least squares, and sketch it on the plot.
(b) Fit a straight line x = c + dy by the method of least squares, and sketch it on the plot.
(c) Are the lines on (a) and (b) the same? If not, why not?

Problem 3
Two consecutive grades
X = the high school GPA (grade point average),
Y = the freshman GPA.
Allow two different intercepts for females
Y = βF + β1 X + , ∼ N(0, σ 2 )
and for males
Y = βM + β1 X + , ∼ N(0, σ 2 ).
Give the form of the design matrix for such a model.

77
Problem 4
Simple linear regression model
Y = β0 + β1 X + , ∼ N(0, σ 2 ).
Using n pairs of (xi , yi ) we fit a regression line by

σ 2 x2 σ2 2
σ x̄
y = b0 + b1 x, Var(B0 ) = (n−1)s2x , Var(B1 ) = (n−1)s2x , Cov(B0 , B1 ) = − (n−1)s 2 .
x

For a given x = x0 , we wish to predict the value of a new observation

Y0 = β0 + β1 x0 + 0

by
ŷ0 = b0 + b1 x0 .
(a) Find an expression for the variance of Ŷ0 − Y0 , and compare it to the variance of Ŷ0 . Find an , the standard
deviation of Ŷ0 −Y
σ
0
.
(b) Derive a formula for 95% prediction interval I such that

P(Y0 ∈ I) = 0.95

using
Y0 − Ŷ0
∼ tn−2 .
San

Problem 5
Data collected for

x = midterm grade,
y = final grade,
gave
r = 0.5, x̄ = ȳ = 75, sx = sy = 10.
(a) Given x = 95, predict the final score.
(b) Given y = 85 and not knowing the midterm score, predict the midterm score.

Problem 6
Let
Y = X + βZ,
where X ∈ N(0, 1) and Z ∈ N(0, 1) are independent.

(a) Show that the correlation coefficient for X and Y is


1
ρ= √ .
1+β 2

(b) Use the result of part (a) to generate bivariate samples (xi , yi ) of size 20 with population correlation
coefficients −0.9, −0.5, 0, 0.5, and 0.9. Compute the sample correlation coefficients.

Problem 7
The stopping distance of an automobile on a certain road was studied as a function of velocity (Brownee 1960)
velocity of a car x (mi/h) 20.5 20.5 30.5 40.5 48.8 57.8
stopping distance y (ft) 15.4 13.3 33.9 73.1 113.0 142.6

Fit y and y as linear functions of velocity, and examine the residuals in each case. Which fit is better? Can you
suggest any physical reason that explains why?

78
11 Overview of basic probability theory
11.1 Probability rules
Sample space and random events

Ω = the set of all possible outcomes in a random experiment,


A, B ⊂ Ω, A ∩ B = {A and B}, A ∪ B = {A or B or both}.
no. favorable outcomes
Division rule: P(A) = total no. equally likely outcomes

Addition rule: P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

Complementary event probability: P(Ā) = 1 − P(A), Ā = {A has not occurred}

P(A∩B)
Conditional probability of A given B has occurred: P(A|B) = P(B)

A and B are independent events if P(A ∩ B) = P(A)P(B)

Multiplication rule: P(A ∩ B) = P(A|B)P(B)

Law of Total Probability:


Pk
if {B1 , . . . , Bk } is a partition of Ω, then P(A) = i=1 P(Bi )P(A|Bi )

Bayes’ Probability Law


P(A|Bi )P(Bi )
prior probabilities P(Bi ) and posterior probabilities P(Bi |A) = P(A)

11.2 Random variables


A discrete random variable X is characterised by a probability mass function

f (x) = P(X = x).

A continuous random variable X is characterised by a probability density function

P(x < X < x + δ)


f (x) = lim .
δ→0 δ

Cumulative distribution function

X Z
F (x) = P(X ≤ x) = f (y) or = f (y)dy.
y≤x y≤x

Expected value (mean or average) of X:


X Z
µ = E(X) = xf (x) or µ = xf (x)dx.
x

The expectation operator is linear

E(X + Y ) = E(X) + E(Y ), E(c · X) = c · E(X).

Variance and standard deviation


2
p
σX = Var(X) = E((X − µ)2 ) = E(X 2 ) − µ2 , σX = Var(X),
X Z
E(X 2 ) = x2 f (x) or E(X 2 ) = x2 f (x)dx,

Var(c · X) = c2 · Var(X), σcX = c · σX .

79
11.3 Important parametric distributions
Discrete uniform distribution X ∼ U(N ):

1 N +1 N2 − 1
f (k) = , 1 ≤ k ≤ N, E(X) = , Var(X) = .
N 2 12
Continuous uniform distribution X ∼ U(a, b):

1 a+b (b − a)2
f (x) = , a < x < b, E(X) = , Var(X) = .
b−a 2 12
Binomial distribution X ∼ Bin(n, p):

n k
f (k) = p (1 − p)n−k , 0 ≤ k ≤ n, E(X) = np, Var(X) = np(1 − p).
k

Bernoulli distribution Ber(p) = Bin(1, p).

Hypergeometric distribution X ∼ Hg(N, n, p):


Np N (1−p)

k n−k
f (k) = N
, max(0, n − N (1 − p)) ≤ k ≤ min(n, N p),
n
E(X) = np,
n−1
Var(X) = np(1 − p)(1 − N −1 ),

n−1
where the factor 1 − N −1 is called the finite population correction.

Geometric distribution X ∼ Geom(p):


1−p
f (k) = pq k−1 , k ≥ 1, E(X) = 1/p, Var(X) = p2 .

Exponential distribution X ∼ Exp(λ):

f (x) = λe−λx , x > 0, E(X) = σX = 1/λ.

Poisson distribution X ∼ Pois(λ):

λk −λ
f (k) = k! e , k ≥ 0, E(X) = Var(X) = λ.

Poisson approximation

Bin(n, p) ≈ Pois(np) if n ≥ 100 and p ≤ 0.01 and np ≤ 20.

Standard normal distribution Z ∼ N(0, 1) has the following density function and distribution function respectively
Z z
1 −z2 /2
φ(z) = √ e , Φ(z) = φ(x)dx.
2π −∞

Normal distribution X ∼ N(µ, σ 2 ):

X −µ x−µ
∼ N(0, 1), f (x) = 1
σ φ( σ ), E(X) = µ, Var(X) = σ 2 .
σ
Central Limit Theorem: if X1 , . . . , Xn are independent and identically distributed with mean µ and variance σ 2 ,
then for all z,
X1 + . . . + Xn − nµ
P √ ≤ z → Φ(z), n → ∞.
σ n
Normal approximations:

Bin(n, p) ≈ N(np, np(1 − p)), if both np ≥ 5 and nq ≥ 5,


Pois(λ) ≈ N(λ, λ), for λ ≥ 5,
−n
Hg(N, n, p) ≈ N(np, np(1 − p) NN −1 ), if both np ≥ 5 and n(1 − p) ≥ 5.

80
11.4 Joint distributions
Given a joint probability mass (density) function fX,Y (x, y) of a random vector (X, Y ), the marginal distributions
of X and Y are respectively,
X X
fX (x) = fX,Y (x, y), fY (y) = fX,Y (x, y),
y x

or in the continuous case,


Z ∞ Z ∞
fX (x) = fX,Y (x, y)dy, fY (y) = fX,Y (x, y)dx.
−∞ −∞

Two random variables X and Y are independent if for all pairs of values (x, y)

fX,Y (x, y) = fX (x)fY (y).

Conditional distribution of (Y |X = x):


fX,Y (x, y)
fY |X (y|x) = .
fX (x)
The resulting conditional expectations E(Y |X) and Var(Y |X) are random variables such that

E(E(Y |X)) = E(Y ),


Var(Y ) = Var(E(Y |X)) + E(Var(Y |X)),

the latter rule is called the law of total variance.

Addition rule for variance


Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ).
Covariance, a measure of association between X and Y

Cov(X, Y ) = E(X − µX )(Y − µY ) = E(XY ) − µX µY .

Correlation coefficient is unit free


Cov(X, Y )
ρ= , −1 < ρ < 1.
σX σY
Independent X and Y are uncorrelated: ρ = 0. If X and Y are uncorrelated, then

Var(X + Y ) = Var(X) + Var(Y ).

Multinomial distribution (X1 , . . . , Xr ) ∼ Mn(n; p1 , . . . , pr ) is an extension of the binomil distribution



n
P(X1 = k1 , . . . , Xr = kr ) = pk1 . . . pkr r .
k1 , . . . , k r 1

The marginal distributions are binomial Xi ∼ Bin(n; pi ), and

Cov(Xi , Xj ) = −npi pj , i 6= j.

Bivariate normal distribution (X, Y ) ∼ N(µ1 , µ2 , σ12 , σ22 , ρ) has joint density

(x − µ1 )2 (y − µ2 )2

1 1 2ρ(x − µ1 )(y − µ2 )
f (x, y) = exp − + − ,
2(1 − ρ2 ) σ12 σ22
p
2πσ1 σ2 1 − ρ2 σ1 σ2

where −∞ < x, y < ∞. The marginal distributions of the bivariate normal distribution are also normal

X ∼ N(µ1 , σ12 ), Y ∼ N(µ2 , σ22 ).

Moreover, the conditional distribution of (Y |X = x) is normal with mean and variance

µ2 + ρ σσ12 (x − µ1 ), σ22 (1 − ρ2 ).

81
12 Statistical distribution tables

12.1 Normal distribution table

82
12.2 Chi-square distribution table

83
12.3 Tails of t-distribution

84
12.4 Critical values of the F-distribution

85
Critical values of the F-distribution (continued)

86
Critical values of the F-distribution (continued)

87
12.5 Critical values of the Studentised range distribution

88
Critical values of the Studentised range distribution (continued)

89
12.6 Critical values for the rank sum test

90
12.7 Critical values for the signed rank test

91
Critical values for the signed rank test (continued)

92
13 Solutions to exercises
13.1 Solutions to Section 2 (random sampling)
Solution 1
Here we consider sampling with replacement. For an answer in the case of sampling without replacement consult
the book page A36.
Population distribution
Values 1 2 4 8
1 2 1 1
Probab. 5 5 5 5

Population mean and variance are computed in three steps


1 2 1 1
µ=1· 5 +2· 5 +4· 5 +8· 5 = 3.4
2 1 2 1 1
E(X ) = 1 · 5 +4· 5 + 16 · 5 + 64 · 5 = 17.8
2 2
σ = 17.8 − µ = 6.24.

The list of X̄ values (and their probabilities in brackets) for n = 2 observations taken with replacement:

1 2 4 8 Total prob.
1 1.0 (1/25) 1.5 (2/25) 2.5 (1/25) 4.5 (1/25) 1/5
2 1.5 (2/25) 2.0 (4/25) 3.0 (2/25) 5.0 (2/25) 2/5
4 2.5 (1/25) 3.0 (2/25) 4.0 (1/25) 6.0 (1/25) 1/5
8 4.5 (1/25) 5.0 (2/25) 6.0 (1/25) 8.0 (1/25) 1/5
Tot. prob. 1/5 2/5 1/5 1/5 1

This yields the followig sampling distribution of X̄:


Values 1 1.5 2 2.5 3 4 4.5 5 6 8
1 4 4 2 4 1 2 4 2 1
Probab. 25 25 25 25 25 25 25 25 25 25

Using the same three steps we find


1 4 4 2 4 1 2 4 2 1
E(X̄) = 1 · 25 + 1.5 · 25 +2· 25 + 2.5 · 25 + 3 · 25 + 4 · 25 + 4.5 · 25 +5· 25 + 6 · 25 +8· 25 = 3.4
E(X̄ 2 ) = 1
25 + (1.5)2 · 4
25 +4· 4
25 + (2.5)2 · 25
2 4
+ 9 · 25 + 16 · 1
25 + (4.5)2 · 2
25 + 25 ·
4
25 + 36 · 252
+ 64 · 1
25 = 14.68
2
Var(X̄) = 14.68 − (3.4) = 3.12.

We see that indeed,


σ2
E(X̄) = µ, Var(X̄) = 3.12 = .
n

Solution 2
Dichotomous data
q q
p̂(1−p̂) 0.55×0.45
n = 1500, p̂ = 0.55, 1 − p̂ = 0.45, sp̂ = n−1 = 1499 = 0.013.

Population margin of victory


v = p − (1 − p) = 2p − 1.
Estimated margin of victory
v̂ = p̂ − (1 − p̂) = 2p̂ − 1 = 0.1.
(a) Since
Var(V̂ ) = Var(2P̂ ),
the standard error of v̂ is twice the standard error of p̂

sv̂ = 2sp̂ = 0.026.

(b) Approximate 95% confidence interval for v is

Iv = v̂ ± 1.96sv̂ = 0.10 ± 0.05.

93
Solution 3
X̄−µ
Normal approximation: SX̄ is asymptotically N(0,1)-distributed. From

0.90 ≈ P( X̄−µ
S > −1.28) = P(−∞ < µ < X̄ + 1.28SX̄ ),

0.95 ≈ P( X̄−µ
SX̄ < 1.645) = P(X̄ − 1.645SX̄ < µ < ∞).

we find
k1 = 1.28, k2 = 1.645.

Solution 4
Randomised response method. Consider

x = number of ”yes” responses for n inmates.

Then X has Bin (n, p) distribution, where


5 1 1 + 4q
p = P(a “yes” answer) = · q + · (1 − q) = .
6 6 6
x
Replacing p by p̂ = n we get an equation
1 + 4q̃
, p̂ =
6
whose solution gives a method of moments estimate q̃ of the population proportion q
6p̂ − 1
q̃ = .
4
The estimate is unbiased
6p − 1
E(Q̃) = = q.
4
Its variance equals
9 9 p(1 − p) (1 + 4q)(5 − 4q)
· Var(Q̂) = ·
Var(Q̃) = = .
4 4 n 16n
Take for example n = 40, x = 8. Then p̂ = 0.2 and
6p̂ − 1
q̃ = = 0.05.
4
The estimated standard error r
(1 + 4q̃)(5 − 4q̃)
sq̃ = = 0.095.
16n
The estimate is unreliable. We have to increase the sample size.

Solution 5
Data summary X X
N = 2000, n = 25, xi = 2351, x2i = 231305.
(a) Unbiased estimate of µ is
2351
x̄ = = 94.04.
25
(b) Sample variance
n 25 231305
s2 = (x2 − x̄2 ) = − (94.04)2 = 425.71.
n−1 24 25
Unbiased estimate of σ 2 is
N −1 2 1999
s = 425.71 = 425.49.
N 2000
Unbiased estimate of Var(X̄) is
s2 n
s2x̄ =
1− = 16.81.
n N
(c) An approximate 95% confidence interval for µ

Iµ = x̄ ± 1.96sx̄ = 94.04 ± 1.96 16.81 = 94.04 ± 8.04.

94
Solution 6
The bias is
σ2 n−1
E(X̄ 2 ) − µ2 = E(X̄ 2 ) − (EX̄)2 = Var(X̄) = 1− .
n N −1
For large n, the bias is small.

Solution 7
Stratified population of size N = 2010 with k = 7 strata.

(a) With n = 100, we get the following answers using the relevant formulas
Stratum number j 1 2 3 4 5 6 7 Weighted mean
Stratum proportion wj 0.20 0.23 0.19 0.17 0.08 0.06 0.07
Stratum mean µj 5.4 16.3 24.3 34.5 42.1 50.1 63.8 µ = 26.15
Stratum standard deviation σj 8.3 13.3 15.1 19.8 24.5 26.0 35.2 σ̄ = 16.94
w σ
Optimal allocation n σ̄jj l 10 18 17 19 12 9 15
Proportional allocation nwj 20 23 19 17 8 6 7
(b) Since σ̄ 2 = 286.9 and σ 2 = 339.7, we have
σ̄ 2 σ2 σ2
Var(X̄so ) = = 2.87, Var(X̄sp ) = = 3.40, Var(X̄) = = 6.15,
n n n
where σ 2 is computed in the next item.

(c) We have µ = 26.15, and


k
X
wj (µj − µ)2 = 274.9.
j=1
Therefore
σ 2 = 339.7 + 274.9 = 614.6, σ = 24.8.
(d) If n1 = . . . = n7 = 10 and n = 70, then Var(X̄s ) = 4.45. The requested sample size x = 140 is found from
the equation
σ2
Var(X̄) = = 4.45.
x
(e) If n = 70, then Var(X̄sp ) = 4.92. Solving the equation
σ2
Var(X̄) =
= 4.92,
x
we find that the the requested sample size is x = 127 which is smaller than that the answer for (d).

Solution 9
Stratified population with
N = 5, k = 2, w1 = 0.6, w2 = 0.4, µ1 = 1.67, µ2 = 6, σ12 = 0.21, σ22 = 4.
Given n1 = n2 = 1 and n = 2, the sampling distribution of the stratified sample mean x̄s = 0.6x1 + 0.4x2 is
x1 = 1 x1 = 2 Total prob.
x2 = 4 2.2 (1/6) 2.8 (2/6) 1/2
x2 = 8 3.8 (1/6) 4.4 (2/6) 1/2
Tot. prob. 1/3 2/3 1
We find that
1 2 1 1
E(X̄s ) = 2.2 · 6 + 2.8 · 6 + 3.8 · 6 + 4.4 · 6 = 3.4,
2
(E(X̄s )) = 11.56,
E(X̄s2 ) = (2.2)2 · 1
6 + (2.8)2 · 2
6 + (3.8)2 · 1
6 + (4.4)2 · 2
6 = 12.28,
Var(X̄s ) = 12.28 − 11.56 = 0.72.
These results are in agreement with the formulas
w12 σ12 2 2
wk σk
E(X̄s ) = µ, Var(X̄s ) = n1 + ... + nk = 0.36σ12 + 0.16σ22 .

95
13.2 Solutions to Section 3 (parameter estimation)
Solution 1
A method of moment estimate of the parameter λ for the Poisson distribution model is given by the sample mean
λ̃ = 3.9. Using this value we compute the expected counts, see table. Comparing observed and expected counts by
a naked eye we see that the Poisson model does not fit well. The sample variance is close to 5 which shows that
there is over dispersion in that the variance is larger than λ.
This extra variation in the data can be explained by the fact that the 300 intervals were distributed over various
hours of the day and various days of the week.
n Frequency Observed frequency
0 14 6.1
1 30 23.8
2 36 46.3
3 68 60.1
4 43 58.5
5 43 45.6
6 30 29.6
7 14 16.4
8 10 8.0
9 6 3.5
10 4 1.3
11 1 0.5
12 1 0.2
13+ 0 0.1

Solution 2
Number X of yeast cells on a square. Test the Poisson model X ∼ Pois(λ).

Concentration 1.
x̄ = 0.6825, x2 = 1.2775, s2 = 0.8137, s = 0.9021, sx̄ = 0.0451.
Approximate 95% confidence interval
Iµ = 0.6825 ± 0.0884.
Pearson’s chi-squared test based on λ̂ = 0.6825:
x 0 1 2 3 4+ Total
Observed 213 128 37 18 4 400
Expected 202.14 137.96 47.08 10.71 2.12 400
Observed test statistic χ2 = 10.12, df = 5 − 1 − 1 = 3, p-value < 0.025. Reject the model.

Concentration 2.
x̄ = 1.3225, x2 = 3.0325, s = 1.1345, sx̄ = 0.0567.
Approximate 95% confidence interval
Iµ = 1.3225 ± 0.1112.
Pearson’s chi-squared test: observed test statistic χ2 = 3.16, df = 4, p-value > 0.10. Do not reject the model.

Concentration 3.
x̄ = 1.8000, s = 1.1408, sx̄ = 0.0701.
Approximate 95% confidence interval for
Iµ = 1.8000 ± 0.1374.
Pearson’s chi-squared test: observed test statistic χ2 = 7.79, df = 5, p-value > 0.10. Do not reject the model.

Concentration 4.
n = 410, x̄ = 4.5659, s2 = 4.8820, sx̄ = 0.1091.
Approximate 95% confidence interval
Iµ = 4.566 ± 0.214.
Pearson’s chi-squared test: observed test statistic χ2 = 13.17, df = 10, p-value > 0.10. Do not reject the model.

96
Solution 3
Population distribution: X takes values 0, 1, 2, 3 with probabilities
2 1 2 1
p0 = · θ, p1 = · θ, p2 = · (1 − θ), p3 = · (1 − θ),
3 3 3 3
so that
p0 + p1 = θ, p2 + p3 = 1 − θ.
We are given an iid-sample with
n = 10, x̄ = 1.5, s = 1.08,
and observed counts
x 0 1 2 3 Total
Ox 2 3 3 2 10
(a) Method of moments. Using
1 2 1 7
µ= · θ + 2 · · (1 − θ) + 3 · · (1 − θ) = − 2θ,
3 3 3 3
derive an equation
7
x̄ = − 2θ̃.
3
It gives an unbiased estimate
7 x̄ 7 3
θ̃ = − = − = 0.417.
6 2 6 4
(b) To find sθ̃ , observe that
1 σ2
Var(X̄) =
Var(Θ̃) = .
4 40
σ
Thus we need to find sθ̃ , which estimates σθ̃ = 6.325 . Next we estimate σ using two methods.

Method 1. From
2
2 2 1 2 2 1 7 7 2
σ = E(X ) − µ = · θ + 4 · · (1 − θ) + 9 · · (1 − θ) = − 2θ − − 2θ = + 4θ − 4θ2 ,
3 3 3 3 3 9
we estimate σ as r
2
+ 4θ̃ − 4θ̃2 = 1.093.
9
This gives
1.093
sθ̃ = = 0.173.
6.325
Method 2:
s 1.08
sθ̃ =
= = 0.171.
6.325 6.325
(c) Likelihood function is obtained using (O0 , O1 , O2 , O3 ) ∼ Mn(n, p0 , p1 , p2 , p3 )
O 0 O 1 O 2 O 3
L(θ) = 2

1

2
3 (1 − θ) 1
3 (1 − θ) = const θt (1 − θ)n−t ,

where t = O0 + O1 is a sufficient statistic. Notice that T = O0 + O1 has Bin(n, θ) distribution.


Log-likelihood and its derivative

l(θ) = const + t ln θ + (n − t) ln(1 − θ),


t n−t
l0 (θ) = − .
θ 1−θ
Setting the last expression to zero, we find
t n−t t 2+3 1
= , θ̂ = = = .
θ̂ 1 − θ̂ n 10 2
The maximum likelihood estimate is the sample proportion, an unbiased estimate of the population proportion θ.

97
(d) We find sθ̂ using the formula for the standard error of sample proportion
q
sθ̂ = θ̂(1− θ̂)
n−1 = 0.167.

A similar answer is obtained using the formula


q
∂2
sθ̂ = nI(1 θ̂) , I(θ) = −E(g(Y, θ)), g(y, θ) = ∂θ 2 ln f (y|θ),

where Y ∼ Ber(θ). Since f (1|θ) = θ, f (0|θ) = 1 − θ, we have


∂2 1 ∂2 1
g(1, θ) = ln θ = − 2 , g(0, θ) = ln(1 − θ) = − ,
∂θ2 θ ∂θ2 (1 − θ)2
we get
1 1 1
I(θ) = −E (g(Y, θ)) = g(1, θ)f (1|θ) + g(0, θ)f (0|θ) = ·θ+ · (1 − θ) = .
θ2 (1 − θ)2 θ(1 − θ)

Solution 4
Likelihood function of X ∼ Bin(n, p) for a given n and X = x is

n x
L(p) = p (1 − p)n−x ∝ px (1 − p)n−x .
x
(a) To maximise L(p) we minimise
ln px (1 − p)n−x ) = x ln p + (n − x) ln(1 − p).
Since
∂ x n−x
(x ln p + (n − x) ln(1 − p)) = − ,
∂p p 1−p
x n−x x
we have to solve p = 1−p , which brings the maximum likelihood estimate formula p̂ = n.

(b) We have X = Y1 + . . . + Yn , where (Y1 , . . . , Yn ) are iid Bernoulli random variables with
f (y|p) = py (1 − p)1−y , y = 0, 1.
By Cramer-Rao, if p̃ is an unbiased estimate of p, then
1
Var(P̃ ) ≥ ,
nI(p)
where
∂2

1
I(p) = −E ln f (Y |p) = ,
∂p2 p(1 − p)
see Solution 3 (d). We conclude that the variance sample proportion p̂ attains the Cramer-Rao lower bound since
p(1−p)
Var(P̂ ) = n .
(c) Plot L(p) = 252p5 (1 − p)5 . The top of the curve is in the middle p̂ = 0.5.

Solution 5
The observed serial number x = 888 can be modeled by the discrete uniform distribution X ∼ U(N ).

(a) The method of moments estimate of N is obtained from


N +1 Ñ +1
µ= 2 , 888 = 2 .

It gives Ñ = 2x − 1 = 1775. This is an unbiased estimate.

(b) The likelihood function


1{1≤x≤N } 1{N ≥888}
L(N ) = P(X = x) = =
N N
reaches its maximum at N̂ = 888. We see that in this case the MLE is severely biased.

98
Solution 6
Statistical model: x is the number of black balls obtained by sampling k balls without replacement from an urn
with N balls of which n balls are black. Hypergeometric distribution
n
N −n
20
P(X = 20) = N
30 .
50

The likelihood function


100 N −100

20 (N − 100)(N − 101) · · · (N − 129)
L(N ) = N
30 = const · .
50
N (N − 1) · · · (N − 49)

To find the value of N = N̂ that maximises L(N ), consider the ratio

L(N ) (N − 100)(N − 50)


= .
L(N − 1) N (N − 130)
L(N ) L(N )
If N < N̂ , then L(N −1) > 1, and N > N̂ , then L(N −1) < 1 Solving the equation

L(N̂ )
= 1 ⇔ (N̂ − 100)(N̂ − 50) = N̂ (N̂ − 130),
L(N̂ − 1)

we arrive at the maximum likelihood estimate estimate N̂ = 5000


20 = 250. The answer is very intuitive as we expect
that
100 : N ≈ 20 : 50.

Solution 7
An iid-sample of size n = 16 from a normal distribution.

(a) The summary statistics


x̄ = 3.6109, s2 = 3.4181, sx̄ = 0.4622
suggest an estimate for µ to be 3.6109, and an estimate for σ 2 to be 3.4181.

(b), (c) Exact confidence intervals

90% 95% 99%


Iµ 3.61 ± 0.81 3.61 ± 0.98 3.61 ± 1.36
Iσ 2 (2.05; 7.06) (1.87; 8.19) (1.56; 11.15)
Iσ (1.43; 2.66) (1.37; 2.86) (1.25; 3.34)

(d) To find sample size x that halves the confidence interval length we set up an equation using the exact
confidence interval formula for the mean
s s0
t15 (α/2) · √ = 2 · tx−1 (α/2) · √ ,
16 x

where s0 is the sample standard deviation for the sample of size x. A simplistic version of this equation 1
4 = √2
x
implies x ≈ (2 · 4)2 = 64. Further adjustment for a 95% confidence interval is obtained using

t15 (α/2) = 2.13, tx−1 (α/2) ≈ 2,


2 2
yielding x ≈ (2 · 4 · 2.13 ) = 56.4. We conclude that going from a sample of size 16 to a sample of size 56 would
halve the length of the confidence interval for µ.

Solution 8
An iid-sample (x1 , . . . , xn ) from the uniform distribution U(0, θ) with density

f (x|θ) = θ1 1{0≤x≤θ} .

99
(a) Method of moments estimate θ̃ is unbiased
4σ 2 θ2
µ = θ2 , θ̃ = 2x̄, E(Θ̃) = θ, Var(Θ̃) = n = 3n .

(b) Denote x(n) = max(x1 , . . . , xn ). Likelihood function takes the form


1 1
L(θ) = f (x1 |θ) · · · f (xn |θ) = θ n 1{θ≥x1 } · · · 1{θ≥xn } = θ n 1{θ≥x(n) } ,

so that x(n) is a sufficient statistic. The maximum is achieved at θ̂ = x(n) .

(c) Sampling distribution of the maximum likelihood estimate θ̂ = x(n) :


x n
P(X(n) ≤ x) = P(X1 ≤ x, . . . , Xn ≤ x) = P(X1 ≤ x) · · · P(Xn ≤ x) =
θ
with pdf
n

b (x) = · xn−1 , 0 ≤ x ≤ θ.
θn
The maximum likelihood estimate is biased
Z θ
b = n n n 2 nθ2
E(Θ) n
xn dx = θ, b 2) =
E(Θ θ , Var(Θ)
b = ,
θ 0 n+1 n+2 (n + 1)2 (n + 2)

but asymptotically unbiased. Notice the unusual asymptotics indicating that the conditions on the parametric
1
model implying Θ̂ ≈ N(θ, nI(θ) ) are violated:

θ2
Var(Θ)
b = , n → ∞.
n2
Compare two mean square errors:
2
nθ2 2θ2

b − θ)2 = θ
MSE(Θ)
b = E(Θ − + 2
= ,
n+1 (n + 1) (n + 2) (n + 1)(n + 2)
θ2
MSE(Θ̃) = .
3n
(d) Corrected maximum likelihood estimate

n+1
θ̂c = · x(n)
n
θ2
becomes unbiased E(Θ
b c ) = θ with Var(Θ
b c) =
n(n+2) .

Solution 9
Data
x1 = 1997, x2 = 906, x3 = 904, x4 = 32
and model
2+θ 1−θ 1−θ θ
p1 = , p2 = , p3 = , p4 = .
4 4 4 4
(a) Sample counts (X1 , X2 , X3 , X4 ) ∼ Mn(n, p1 , p2 , p3 , p4 ) with n = 3839. Given a realisation

(x1 , x2 , x3 , x4 ) with x1 + x2 + x3 + x4 = n,

the likelihood function takes the form



n
L(θ) = px1 px2 px3 px4 ∝ (2 + θ)x1 (1 − θ)x2 +x3 θx4 4−n ∝ (2 + θ)x1 θx4 (1 − θ)n−x1 −x4 ,
x1 , x2 , x3 , x4 1 2 3 4

where ∝ means that we drop factors depending only on (n, x1 , x2 , x3 , x4 ). The last expression reveals that we have
a case of two sufficient statistics (x1 , x4 ). Putting

d x1 x4 n − x1 − x4
ln L(θ) = + −
dθ 2+θ θ 1−θ

100
equal to zero, we arrive at the equation
x1 x4 n − x1 − x4
+ =
2+θ θ 1−θ
or equivalently
θ2 n + θu − 2x4 = 0,
where
u = 2x2 + 2x3 + x4 − x1 = 2n − x4 − 3x1 .
We find the maximum likelihood estimate to be

−u + u2 + 8nx4
θ̂ = = 0.0357.
2n
Asymptotic variance
b ≈ 1
Var(Θ) , I(θ) = −E(g(Y1 , Y2 , Y3 , Y4 , θ)).
nI(θ)
where (Y1 , Y2 , Y3 , Y4 ) ∼ Mn(1, p1 , p2 , p3 , p4 ) with

f (y1 , y2 , y3 , y4 |θ) = py11 py22 py33 py44 = (2 + θ)y1 (1 − θ)y2 +y3 θy4 4−n

and
∂2 y1 y2 + y3 y4
g(y1 , y2 , y3 , y4 , θ) = 2
ln f (y1 , y2 , y3 , y4 |θ) = − 2
− 2
− 2.
∂θ (2 + θ) (1 − θ) θ

Since E(Yi ) = pi , we find


1 2 1 1 + 2θ
I(θ) = + + = ,
4(2 + θ) 4(1 − θ) 4θ 2θ(2 + θ)(1 − θ)
and get q
1
nI(θ̂) = 29345.8, sθ̂ = nI(θ̂)
= 0.0058.

(b) Iθ = 0.0357 ± 1.96 · 0.0058 = 0.0357 ± 0.0114.

(c) Parametric bootstrap using Matlab:

p1=0.5089, p2=0.2411, p3=0.2411, p4=0.0089,


n=3839; B=1000; b=ones(B,1);
x1=binornd(n,p1,B,1);
x2=binornd(n*b-x1,p2/(1-p1));
x3=binornd(n*b-x1-x2,p3/(1-p1-p2));
x4=n*b-x1-x2-x3;
u=2*x2+2*x3+x4-x1;
t=(-u+sqrt(u.ˆ 2+8*n*x4))/(2*n);
std(t)
histfit(t)

gives std(t)=0.0058 similar to the answer in (a).

13.3 Solutions to Section 4 (hypothesis testing)


Solution 1
The z-score
X − 100p
Z= p
10 p(1 − p)
has a distribution that is approximated by N(0, 1).

(a) Under H0 we have


X − 100p0 X − 50
Z= p = ,
10 p0 (1 − p0 ) 5

101
and the significance level in question is (using a continuity correction)

α = P(|X − 50| > 10|H0 ) = P(|X − 50| ≥ 11|H0 )


10.5
≈ P(|Z| > 5 |H0 ) ≈ 2(1 − Φ(2.1)) = 2 · 0.018 = 0.036.

(b) The power of the test is a function of the parameter value p (without continuity correction)

Pw(p) = P(|X − 50| > 10) = P(X < 40) + P(X > 60)
! !
40 − 100p 60 − 100p
=P Z< p +P Z > p
10 p(1 − p) 10 p(1 − p)
! !
4 − 10p 10p − 6
≈Φ p +Φ p .
p(1 − p) p(1 − p)

Putting δ = 1/2 − p, we see that the power function


! ! ! !
10δ − 1 10δ + 1 10|δ| − 1 10|δ| + 1
Pw(p) = Φ p + Φ −p =Φ p + Φ −p
1/4 − δ 2 1/4 − δ 2 1/4 − δ 2 1/4 − δ 2

is symmetric around p = 1/2

p 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70


Pw(p) 0.986 0.853 0.500 0.159 0.046 0.159 0.500 0.853 0.986

Solution 2
P(x|H0 )
Data: one observation of X = x. Likelihood ratio test: reject for small values of Λ = P(x|H1 ) .

(a) See the bottom line of the table:

X-values x4 x2 x1 x3
P(x|H0 ) 0.2 0.3 0.2 0.3
P(x|H1 ) 0.4 0.4 0.1 0.1
Likelihood ratio Λ = P(x|H 0)
P(x|H1 ) 0.5 0.75 2 3

(b) The null distribution of Λ

X-values x4 x2 x1 x3
Likelihood ratio Λ 0.5 0.75 2 3
P(x|H0 ) 0.2 0.3 0.2 0.3
Cumulative probab. 0.2 0.5 0.7 1

At α = 0.2 we reject H0 only if Λ = 0.5, that is when X = x4 .


At α = 0.5 we reject H0 for Λ ≤ 0.75, that is when X = x4 or x2 .

Solution 3
Likelihood function
n n
Y 1 xi −λ −λn y
Y 1
L(λ) = λ e =e λ
x
i=1 i
! x
i=1 i
!

where
y = x1 + . . . + xn
is a sufficient statistic. Reject H0 for small values of the likelihood ratio

L(λ0 )
= e−n(λ0 −λ1 ) ( λλ01 )y .
L(λ1 )

If λ1 > λ0 , then we reject H0 for large values of y. Test statistic Y has null distribution Pois(nλ0 ).

102
Solution 4
We have an iid-sample from N(µ, 100) of size n = 25. Two simple hypotheses

H0 : µ = 0, H1 : µ = 1.5

Test statistic and its exact sampling distribution

X̄ ∼ N(µ, 4).

Its null and alternative distributions are


H H
X̄ ∼0 N(0, 4), X̄ ∼1 N(1.5, 4).

(a) The rejection region at α = 0.1 is {x̄ > x}, where x is the solution of the equation

0.1 = P(X̄ > x|H0 ) = 1 − Φ(x/2).

From the normal distribution table we find x/2 = 1.28, so that x = 2.56 and the rejection region is

R = {x̄ > 2.56}.

The corresponding confidence interval method uses the one-sided 90% confidence interval for the mean

Iµ = (x̄ − 2.56, ∞).

We reject H0 if the interval does not cover µ0 = 0, that is when x̄ − 2.56 > 0.

(b) The power of the test (a) is

P(X̄ > 2.56|H1 ) = P( X̄−1.5


2 > 0.53|H1 ) = 1 − Φ(0.53) = 1 − 0.7019 = 0.298.

(c) For α = 0.01, since 1 − Φ(2.33) = 0.01, the rejection region is

R = {x̄ > 4.66}.

The power of this test is

P(X̄ > 4.66|H1 ) = P( X̄−1.5


2 > 1.58|H1 ) = 1 − Φ(1.58) = 1 − 0.9429 = 0.057.

Solution 5
We have a pair of beta-densities

f (x|H0 ) = 2x, f (x|H1 ) = 3x2 , 0 ≤ x ≤ 1.

(b) The likelihood ratio as a function of data value x is

f (x|H0 ) 2
Λ= = 3x , 0 ≤ x ≤ 1.
f (x|H1 )

The corresponding likelihood ratio test of H0 versus H1 rejects H0 for large values of x.
(c) The rejection region of a level α test is computed from the equation

P(X > xcrit |H0 ) = α,

that is
1 − x2crit = α.
We conclude that √
R = {x : x > 1 − α}.
(d) The power of the test is √
P(X > 1 − α|H1 ) = 1 − (1 − α)3/2

103
Solution 6
Let s2 be the sample variance computed from (x1 , . . . , x15 ).

(a) An exact 95% confidence interval for σ 2 if n = 15 is

Iσ2 = (0.536s2 ; 2.487s2 ).

(b) We reject H0 : σ = 1 if the value 1 falls outside the confidence interval interval Iσ2 , so that

/ (0.536s2 ; 2.487s2 )} = {s2 > 1.866


R = {1 ∈ or s2 < 0.402}.

Solution 7
Using the confidence interval-method of hypotheses testing we reject H0 in favour of the two-sided alternative,
since the value µ = −3 is not covered by the two-sided confidence interval (−2, 3).

Solution 8
The analysis is the basis of the sign test.

(a) Generalised likelihood ratio


n x
( 21 )n ( n2 )n
n−x
L(p0 ) x p0 (1 − p0 )
Λ= = n x = = .
L(p̂) x p̂ (1 − p̂)
n−x ( nx )x ( n−x
n )
n−x xx (n − x)n−x

(b) The generalised likelihood ratio test rejects H0 for small values of

ln Λ = n ln(n/2) − x ln x − (n − x) ln(n − x),

or equivalently, for large values of


x ln x + (n − x) ln(n − x),
or equivalently, for large values of

a(y) = (n/2 + y) ln(n/2 + y) + (n/2 − y) ln(n/2 − y),

where
y = |x − n/2|.
The function a(y) is monotonely increasing over y ∈ [0, n/2], since
n
+y
a0 (y) = ln 2
n > 0.
2 −y
We conclude that the test rejects for large values of y.

(c) Compute the significance level for the rejection region |x − n2 | > k:
X n
n
α = P(|X − 2 | > k|H0 ) = 2 2−n .
n i
i< 2 −k

(d) In particular, for n = 10 and k = 2 we get


2
X 10 1 + 10 + 45
α = 2−9 = = 0.11.
i=0
i 512

(d) Using the normal approximation for n = 100 and k = 10, we find

α = P(|X − n2 | > k|H0 ) ≈ 2(1 − Φ( √k )) = 2(1 − Φ(2)) = 0.046.


n/4

Solution 9
(a) Two-sided p-value = 0.134.

(b) One-sided p-value = 0.067.

104
Solution 10
We are supposed to test
H0 : death cannot be postponed,
H1 : death can be postponed until after an important date.
(a) Jewish data: n = 1919 death dates
x = 922 deaths during the week before Passover,
n − x = 997 deaths during the week after Passover.
Under the binomial model X ∼ Bin(n, p), we would like to test
H0 : p = 0.5 against H1 : p < 0.5.
We apply the large sample test for proportion. Observed test statistic
x − np0 922 − 1919 · 0.5
z=p = √ = −1.712.
np0 (1 − p0 ) 1919 · 0.5
One-sided p-value of the test
Φ(−1.712) = 1 − Φ(1.712) = 1 − 0.9564 = 0.044.
Reject H0 in favour of one-sided H1 at the significance level 5%.

(b) To control for the seasonal effect the Chinese and Japanese data were studied
n = 852, x = 418, n − x = 434, z = −0.548.
One-sided p-value is 29%, showing no significant effect.

(c) Overeating during the important occasion might be a contributing factor.

Solution 11
Multinomial model
(X1 , X2 , X3 ) ∼ Mn(190, p1 , p2 , p3 ).
Composite null hypothesis (Hardy-Weinberg Equilibrium)
H0 : p1 = (1 − θ)2 , p2 = 2θ(1 − θ), p3 = θ2 .
Likelihood function and maximum likelihood estimate

190 292
L(θ) = 268 θ292 (1 − θ)88 , θ̂ = = 0.768.
10, 68, 112 380
Pearson’s chi-squared test:
cell 1 2 3
Total
observed 10 68 112
190
expected 10.23 67.71 112.07
190
2

Observed chi-squared test statistic χ = 0.0065, df = 1, p-value = 2(1 − Φ( 0.0065)) = 0.94.

Conclusion: the Hardy-Weinberg Equilibrium model fits well the haptoglobin data.

Solution 12
Month Oj Days Ej Oj − Ej
Jan 1867 31 1994 −127
Feb 1789 28 1801 −12
Mar 1944 31 1994 −50
Apr 2094 30 1930 164
May 2097 31 1994 103
Jun 1981 30 1930 51
Jul 1887 31 1994 -107
Aug 2024 31 1994 30
Sep 1928 30 1930 -2
Oct 2032 31 1994 38
Nov 1978 30 1930 48
Dec 1859 31 1994 -135

105
Simple null hypothesis
31 28 30
H0 : p1 = p3 = p5 = p7 = p8 = p10 = p12 = , p2 = , p4 = p6 = p9 = p11 = .
365 365 365
The total number suicides n = 23480, so that the expected counts are
(0)
Ej = npj , j = 1, . . . , 12.

The χ2 -test statistic


X (Oj − Ej )2
χ2 = = 47.4.
j
Ej

Since df = 12 − 1 = 11, and χ211 (0.005) = 26.8, we reject H0 of no seasonal variation. Merry Christmas!

Solution 13
Number of heads
Y ∼ Bin(n, p), n = 17950.
(a) For H0 : p = 0.5 the observed z-score

z = √ y−np0 = 3.46.
np0 (1−p0 )

According to the three-sigma rule this is a significant result and we reject H0 .

(b) Pearson’s chi-squared test for the simple null hypothesis

H0 : p0 = (0.5)5 = 0.031, p1 = 5 · (0.5)5 = 0.156, p2 = 10 · (0.5)5 = 0.313,


p3 = 10 · (0.5)5 = 0.313, p4 = 5 · (0.5)5 = 0.156, p5 = (0.5)5 = 0.031.

number of heads 0 1 2 3 4 5 Total


observed 100 524 1080 1126 655 105 3590
expected 112.2 560.9 1121.9 1121.9 560.9 112.2 3590

Observed χ2 = 21.58, df = 5, p-value = 0.001.

(c) Composite null hypothesis



5 i
H0 : pi = p (1 − p)5−i , i = 0, 1, 2, 3, 4, 5.
i

Pearson’s chi-squared test based on the maximum likelihood estimate p̂ = 0.5129

number of heads 0 1 2 3 4 5 Total


observed 100 524 1080 1126 655 105 3590
expected 98.4 518.3 1091.5 1149.3 605.1 127.4 3590

Observed χ2 = 8.74, df = 6 − 1 − 1 = 4, p-value = 0.07. Do not reject H0 at 5% level.

13.4 Solutions to Section 5 (Bayesian inference)


Solution 1
Since
f (x|θ) ∝ θ5 (1 − θ)5 ,
and the prior is flat, we get
h(θ|x) ∝ f (x|θ) ∝ θ5 (1 − θ)5 .
We conclude that the posterior distribution is Beta (6, 6). This yields

1
θ̂map = θ̂pme = .
2

106
Solution 2
Number of bird hops X ∼ Geom(p)

f (x|p) = (1 − p)x−1 p, x = 1, 2, . . . .

Data in the table summarises an iid-sample

(x1 , . . . , xn ), n = 130.

(d) Using a uniform prior P ∼ U(0, 1), we find the posterior to be

h(p|x1 , . . . , xn ) ∝ f (x1 |p) · · · f (xn |p) = (1 − p)nx̄−n pn , n = 130, nx̄ = 363.

It is a beta distribution
Beta(n + 1, nx̄ − n + 1) = Beta(131, 234).
Posterior mean
a 131
µ= = = 0.36.
a+b 131 + 234
Observe that
1
1+ n
µ= 2 ,
x̄ + n
gets closer to the method of moments estimate of p as n → ∞. The standard deviation of the posterior distribution
r r
µ(1 − µ) 0.36 · 0.64
σ= = = 0.025.
a+b+1 366

Solution 3
We use the binomial model X ∼ Bin(n, p), with p being the probability that the event will occur at a given trial.
Use an uninformative conjugate prior P ∼ Beta(1, 1). Given X = n, the posterior becomes P ∼ Beta(n + 1, 1).
n+1
Since the posterior mean is n+2 , we get
n+1
p̂pme = .
n+2

Solution 4
Recall solutions of parts (a) and (b).

(c) By Bayes formula,


P(x|H0 )P(H0 ) P(x|H0 )
P(H0 |x) = = .
P(x|H0 )P(H0 ) + P(x|H1 )P(H1 ) P(x|H0 ) + P(x|H1 )
Thus the posterior odds equals the likelihood ratio
P(H0 |x)
= Λ,
P(H1 |x)
and we conclude that outcomes x1 and x3 favour H0 since with these outcomes Λ > 1.

(d) For the general prior


P(H0 ) = π0 , P(H1 ) = π1 = 1 − π0 ,
we get
P(x|Hi )πi
P(Hi |x) = ,
P(x|H0 )π0 + P(x|H1 )π1
yielding a relation for the posterior odds
P(H0 |x) P(x|H0 )π0 π0
= =Λ· .
P(H1 |x) P(x|H1 )π1 π1
Assuming equal costs c0 = c1 , the rejection rule is
P(H0 |x) c1
< = 1,
P(H1 |x) c0

107
so that in terms of the likelihood ratio,
π1 1 1
Λ< = − 1, π0 < .
π0 π0 1+Λ
If x = x4 , then Λ = 0.5, and we reject H0 , provided π0 < 23 . This corresponds to the decision rules with α = 0.2.
If x = x2 , then Λ = 0.75, and we reject H0 , provided π0 < 47 . This corresponds to the decision rules with
α = 0.5.
Furthermore, if x = x1 , then Λ = 2, and we reject H0 , provided π0 < 13 , and if x = x3 , then Λ = 3, and we
reject H0 , provided π0 < 14 .

Solution 5
For a single observation X ∼ N(µ, σ 2 ), where σ 2 is known, test H0 : µ = 0 vs H1 : µ = 1. Prior probabilities
2 1
P(H0 ) = , P(H1 ) = .
3 3
(a) Likelihood ratio
x2
f (x|0) e− 2σ2 1 −x
2
= 2 = e σ2 .
f (x|1) e
(x−1)
− 2σ2

Choose H0 for x such that


1 −x
P(H0 |x) 2 1
= 2e σ2 > 1, x< + σ 2 ln 2.
P(H1 |x) 2
We conclude that
σ 2 = 0.1 σ 2 = 0.5 σ2 = 1 σ2 = 5
Choose H0 for x < 0.57 x < 0.85 x < 1.19 x < 3.97
(b) In the long run, the proportion of the time H0 will be chosen is
P(X < 1
2 + σ 2 ln 2) = 23 P(X − µ < 1
2 + σ 2 ln 2) + 13 P(X − µ < σ 2 ln 2 − 21 ) = 23 Φ(σ ln 2 + 1
2σ ) + 13 Φ(σ ln 2 − 1
2σ ).

We conclude that
σ 2 = 0.1 σ 2 = 0.5 σ2 = 1 σ2 = 5
Proportion of the time H0 will be chosen 0.67 0.73 0.78 0.94

Solution 6
We have a pair of beta-densities
f (x|H0 ) = 2x, f (x|H1 ) = 3x2 , 0 ≤ x ≤ 1.
(a) If the two hypotheses have equal prior probabilities, then the posterior probabilities equal
1
2 f (x|H0 ) x 2 3x
h(H0 |x) = 1 1 = 3 2 = 2 + 3x , h(H1 |x) = .
2 f (x|H0 ) + 2 f (x|H1 ) x + 2x 2 + 3x
Therefore, the posterior probability of H0 is greater than that of H1 for x satisfying
2 > 3x, that is when x < 23 .

13.5 Solutions to Section 6 (summarising data)


Solution 1
Recall that for a fixed x, the empirical distribution function F̂ (x) = p̂ is the sample proportion estimate of
p = F (x) = x.

(a) The variance of F̂ (x) is


p(1 − p) x(1 − x)
σF̂2 (x) = σp̂2 = = ,
n n
so that the standard deviation is r
x(1 − x)
σF̂ (x) = , x ∈ [0, 1].
n
(b) Generate 100 samples of size n = 16. Matlab code

108
x=rand(16,100);
y=sort(x)’;
for k=1:100
plot(y(k,:),(1:16)/16-y(k,:),’.’)
hold on
end

See the figure

Solution 2
We have

1{X1 ≤u} + . . . + 1{Xn ≤u}


F̂ (u) = , E(F̂ (u)) = F (u),
n
1{X1 ≤v} + . . . + 1{Xn ≤v}
F̂ (v) = , E(F̂ (v)) = F (v).
n

Assuming u < v, we get

n n X
1 hX X i
E(F̂ (u) · F̂ (v)) = E(1 1
{Xi ≤u} {Xi ≤v} ) + E(1 1
{Xi ≤u} {Xj ≤v} )
n2 i=1 i=1 j6=i
n n X
1 hX X i
= 2 F (u) + F (u)F (v)
n i=1 i=1 j6=i
1h i
= F (u) + (n − 1)F (u)F (v) .
n

Finish by using

Cov(F̂ (u), F̂ (v)) = E(F̂ (u) · F̂ (v)) − E(F̂ (u)) · E(F̂ (v))
1
= [F (u) + (n − 1)F (u)F (v)] − F (u)F (v)
n
= n1 F (u)(1 − F (v)).

Solution 3
Ordered sample x(1) , . . . , x(n)

12.28 12.92 13.33 13.64 13.65 13.66 13.68


13.73 13.75 13.83 13.96 13.98 13.98 14.01
14.04 25% quantile
14.10 14.19 14.23 14.27 14.30 14.32 14.41
14.41 14.43 14.44 14.47 14.49 14.52 14.56
14.57 50% quantile
14.57 14.62 14.65 14.68 14.73 14.75 14.77
14.80 14.87 14.90 14.92 15.02 15.03 15.10
15.13 75% quantile
15.15 15.18 15.21 15.28 15.31 15.38 15.40
15.47 15.47 15.49 15.56 15.63 15.91 17.09

(a) The figure shows the empirical distribution function and a normal probability plot.

109
Use Matlab commands

x=data vector;
stairs(sort(x),(1:length(x))/length(x)) % empirical cdf
hist(x) % histogram, the same as hist(x,10)
normplot(x) % normal probability plot
prctile(x,90) % 0.90-quantile

The distribution appears to be rather close to normal. The 10% quantile

x(6) + x(7) 13.66 + 13.68


= = 13.67.
2 2

(b) Expected percentages under different dilution levels:

1% dilution µ1 = 14.58 · 0.99 + 85 · 0.01 = 15.28 can not be detected


3% dilution µ3 = 14.58 · 0.97 + 85 · 0.03 = 16.69 can be detected
5% dilution µ5 = 14.58 · 0.95 + 85 · 0.05 = 18.10 can be detected

We see that the value 15.28 can not be detected as an outlier, since it coincides with the 82% sample quantile.
There is only one sample value larger than 16.69, therefore 3% dilution would be easier to detect. Obviously, 5%
dilution resulting in 18.10 is very easy to detect.

Solution 4
Taking the derivative of
β
1 − F (t) = e−αt ,

we find the density


β
f (t) = αβtβ−1 e−αt ,

and dividing the latter by the former we obtain the hazard function

h(t) = αβtβ−1 .

Solution 5
Take the Weibull distribution with parameters α and β.

• If β = 1, then h(t) = α is constant and the distribution is memoryless.

• If β > 1, then h(t) increases with t meaning that the older individuals die more often than the younger.

• If 0 < β < 1, then h(t) decreases with t meaning that the longer you live the healthier you become.

110
Solution 6
5
(a) Due to sampling with replacement we have N ∼ Bin (26, 26 ).

(b) Using the binomial distribution command we find

P(N ≤ 9) = binocdf(9, 26, 5/26) = 0.9821,


P(N ≥ 10) = 1 − 0.9821 = 0.018.

(c) In B = 1000 bootstrap samples, we expect

B · P(N ≥ 10) = 18

samples to contain 10 or more of outliers.

(d) The probability that a bootstrap sample is composed entirely of these outliers is negligibly small

P(N = 25) = (5/26)26 = 2.4 · 10−19 .

Solution 7
(a) The Matlab commands

x=data vector;
trimmean(x,10)
trimmean(x,20)

give x̄0.1 = 14.586 and x̄0.2 = 14.605.

m = trimmean(x,percent) calculates the trimmed mean of the values in x. For a vector input, m is the
mean of x, excluding the highest and lowest k data values, where k=n*(percent/100)/2 and where n is
the number of values in x.

(b) An approximate 90% confidence interval for the mean is

0.78
Iµ = 14.58 ± 1.645 · √ = 14.58 ± 0.17 = (14.41; 14.75)
59

(c) Nonparametric 90% confidence interval for the population median M is (x(k) , x(60−k) ), where P(Y < k) =
0.05 and Y ∼ Bin (59, 0.5). Applying the normal approximation for Bin (n, p) with continuity correction
!
k − 0.5 − np
P(Y < k) = P(Y ≤ k − 1) ≈ Φ p ,
np(1 − p)

we arrive at equation
59
k − 0.5 − 2
q = −1.645,
59
4

that gives k = 24. This yields

Im = (x(k) , x(60−k) ) = (x(24) , x(36) ) = (14.43; 14.75).

(d) The Matlab commands for the non-parametric bootstrap

n=59; B=1000;
z=x(random(’unid’,n,n,B)); % (’unid’,n) - uniform discrete [1, n], 1000 samples of size n
t1=trimmean(z,10);
t2=trimmean(z,20);
std(t1)
std(t2)

give the standard errors 0.1034 and 0.1004 for x̄0.1 and x̄0.2 respectively.

(f) Matlab commands

111
iqr(x)
median(abs(x-median(x)))

Warning: mad(x) in Matlab stands for the mean abs. dev.

(g) Matlab commands (vector z comes from the (d) part)

q=prctile(z,75);
hist(q)
std(q)

give the standard error 0.1332 of the upper quartile.

Solution 8
Matlab command (x = control and y = seeded data)

qqplot(x,y)

produces a QQ-plot that fits the line y = 2.5x claiming 2.5 times more rainfall from seeded clouds. On the other
hand, Matlab command

qqplot(log(x),log(y))

produces a QQ-plot that fits the line


ln y = 2 + 0.8 ln x
meaning a decreasing slope in the relationship y = 7.4x0.8 .

13.6 Solutions to Section 7 (two samples)


Solution 1
(a) x̄ = 0.5546, ȳ = 1.6240, ȳ − x̄ = 1.0694

(b) We have s2x = 0.2163, s2y = 1.1795, s2p = 0.7667. The latter is an unbiased estimate of σ 2 .

(c) sȳ−x̄ = 0.5874

(d) Based on t7 -distribution, an exact 90% confidence interval for mean difference is

Iµy −µx = 1.0694 ± 1.1128.

(e) More appropriate to use a two-sided test.

(f) From the observed test statistic value t = 1.8206, we find the two-sided p = 0.1115 using the Matlab com-
mand 2*tcdf(-1.8206,7).

(g) No, because the obtained p-value is larger than 0.1.

(h) Given σ 2 = 1, we answer differently to some of the the above questions:

b: σ 2 = 1,
c: sȳ−x̄ = 0.0.6708,
d: Iµy −µx = 1.0694 ± 1.1035,
f: z = 1.5942 two-sided p-value = 0.11.

Solution 2
If m = n, then ! Pn Pn
1 1 2 i=1 (xi − x̄)2 + i=1 (yi − ȳ)2 s2x + s2y s2 s2y
s2p + = · = = x+ .
n m n 2n − 2 n n m

112
Solution 3
Test the null hypothesis of no drug effect

H0 : µ1 = µ2 , the drug is not effective for reducing high blood pressure.

Suggested measurement design: during the same n = 10 days take blood pressure measurements on 4 people, two
on the treatment

x11 , . . . , x1n ,
x21 , . . . , x2n ,

and two controls

x31 , . . . , x3n ,
x41 , . . . , x4n .

Dependencies across the days and the people make inappropriate both two-sample t test and rank sum test. Proper
design for 40 measurements is that of two independent samples: 20 people on the treatment and 20 controls:

x1 , . . . , x20 ,
y1 , . . . , y20 .

Solution 4
(a) The sign test statistic

H
t = number of positive xi , T ∼0 Bin(25, 21 ) ≈ N( 25 25
2 , 4 ).

Reject H0 for t ≥ k, where k is found from


! !
k − 0.5 − 12.5 k − 13
0.05 = P(T ≥ k|H0 ) = P(T > k − 1|H0 ) ≈ 1 − Φ =1−Φ ,
5/2 2.5

which gives
k − 13
= 1.645, k = 17.
2.5
We know the true population distribution is N(0.3, 1). Since

P(X > 0|N(0.3, 1)) = 1 − Φ(−0.3) = Φ(0.3) = 0.62,

we can use
T ∼ Bin(25, 0.62) ≈ N(15.5, 5.89)

to find the power of the sign test by


!
17 − 0.5 − 15.5
1 − β = P(T ≥ 17) ≈ 1 − Φ = 1 − Φ(0.41) = 0.34.
2.4

X̄−µ
(b) Normal distribution model X ∼ N(µ, 1). Since 1/5 ∼ N(0, 1), we reject H0 for

5x̄ > 1.645, that is for x̄ > 0.33.

The power of the test


!
0.33 − 0.3
1 − β = P(X̄ > 0.33|µ = 0.3) = 1 − Φ = 1 − Φ(0.15) = 0.44
1/5

is higher than the power of the sign test.

113
Solution 5
Two independent samples
x1 , . . . , x n , y1 , . . . , yn ,
are taken from two population distributions with equal standard deviation σ = 10. Approximate 95% confidence
interval q
Iµ1 −µ2 = x̄ − ȳ ± 1.96 · 10 · n2 = x̄ − ȳ ± 27.72
√ .
n

If the confidence interval has width 2, then


27.72

n
= 1,
implying n ≈ 768.

Solution 6
Rank Type I Type II Rank
1 3.03 3.19 2
8 5.53 4.26 3
9 5.60 4.47 4
11 9.30 4.53 5
13 9.92 4.67 6
14 12.51 4.69 7
17 12.95 6.79 10
18 15.21 9.37 12
19 16.04 12.75 15
20 16.84 12.78 16
Rank sum 130 80

(a) Two-sample t-test


q
x̄ = 10.693, ȳ = 6.750, s2x = 23.226, s2y = 12.978, sx̄−ȳ = s2x̄ + s2ȳ = 1.903.

Assume equal variances. The observed test statistic


10.693 − 6.750
t= = 2.072.
1.903
With df = 18, the two-sided p-value = 0.053 is found using the Matlab command 2*tcdf(-2.072,18).

(b) Rank sum test statistics Rx = 130, Ry = 80. From the table in Section 12.6 we find that the two-sided
p-value is between 0.05 < p-value < 0.10.

(c) The non-parametric test in (b) is more relevant, since both normplot(x) and normplot(y) show non-normality
of the data distribution.

(d) To estimate the probability π, that a type I bearing will outlast a type II bearing, we turn to the ordered
pooled sample
X-YYYYYY-XX-Y-X-Y-XX-YY-XXXX.
Pick a pair (X, Y ) at random, then by the division rule of probability

number of (xi < yj ) 10 + 4 + 4 + 3 + 2 + 2


P(X < Y ) = = = 0.25.
total number of pairs (xi , yj ) 100

This implies a point estimate π̂ = 0.75.

(e) The matlab commands


u=x(random(’unid’,10,10,1000));
v=y(random(’unid’,10,10,1000));
N=zeros(1,1000);
for k=1:1000 for i=1:10 for j=1:10
N(k)=N(k)+(u(i,k)>v(j,k));

114
end,end,end
P=N/100;
hist(P,20)
std(P)

estimate the sampling distribution of π̂ with sπ̂ = 0.1187.

Solution 7
Model: iid-sample of the differences d1 , . . . , dn whose population distribution is symmetric around the unknown
median m. Test the null hypothesis of no difference H0 : m = 0 using the signed ranks test statistic w+ defined as
follows:

step 1: remove signs |d1 |, . . . , |dn |,


step 2: assign ranks 1, . . . , n to |d1 |, . . . , |dn |,
step 3: attach the original signs of di to the ranks 1, . . . , n,
step 4: compute w+ as the sum of the positive ranks.

Under H0 : m = 0, on the step 3, the signs ± are assigned symmetrically at random (due to the model assumption
that the population distribution is symmetric around the median). As a result there are 16 equally likely outcomes

1 2 3 4 w+
− − − − 0
+ − − − 1
− + − − 2
+ + − − 3
− − + − 3
+ − + − 4
− + + − 5
+ + + − 6
− − − + 4
+ − − + 5
− + − + 6
+ + − + 7
− − + + 7
+ − + + 8
− + + + 9
+ + + + 10
Thus the null distribution of W+ is given by the table
k 0 1 2 3 4 5 6 7 8 9 10
1 1 1 2 2 2 2 2 1 1 1
pk 16 16 16 16 16 16 16 16 16 16 16
1
The smallest one-sided p-value is 16 = 0.06 which is higher than 5%. Thus n = 4 is too small sample size.
Therefore, the table in Section 12.7 starts from n = 5.

Solution 8
Using
r
n(n + 1) n(n + 1)(2n + 1)
W0.05 (n) = − 1.96 · ,
4 24
r
n(n + 1) n(n + 1)(2n + 1)
W0.01 (n) = − 2.58 · ,
4 24
we find (table/normal approximation)

n = 10 n = 20 n = 25
n(n+1)
q 4 27.5 105 162.5
n(n+1)(2n+1)
24 9.81 26.79 37.17
α = 0.05 8/8.3 52/53.5 89/89.65
α = 0.01 3/2.2 38/36.0 68/67.6

115
Solution 9
(a) The variance of a difference

Var(D) = Var(X − Y ) = σx2 + σy2 − 2Cov(X, Y ) = 100 + 100 − 100 = 100.

Using the normal approximation we get

D̄ = X̄ − Ȳ ≈ N(µx − µy , 100
25 ) = N(δ, 4).

The rejection region becomes


¯
R = { d2 > 1.645} = {d¯ > 3.29}.
The power function (under the one-sided alternative δ > 0)

Pw(δ) ≈ P(D̄ > 3.29|N(δ, 4)) = 1 − Φ( 3.29−δ


2 ).

(b) Two independent samples


D̄ ≈ N(µx − µy , 100
25 +
100
25 ) = N(δ, 8).
The rejection region
¯
R = { √d8 > 1.645} = {d¯ > 4.65}.
The power function
Pw(δ) ≈ P(D̄ > 4.65|N(δ, 8)) = 1 − Φ( 4.65−δ
2.83 ).
The two power functions are compared graphically on the next figure.

Power function

Paired samples

Independent samples
0.5

0.05

3.29 4.65

Solution 10
Paired samples

x̄ = 85.26, sx = 21.20, sx̄ = 5.47, nx = 15,


ȳ = 84.82, sy = 21.55, sȳ = 5.57, ny = 15,
d¯ = x̄ − ȳ = 0.44,
sd = 4.63, sx̄−ȳ = 1.20.

If the pairing had been erroneously ignored, then the two independent samples formula would give 6 times larger
standard error
sx̄−ȳ = 7.81.
To test H0 : µx = µy against H1 : µx 6= µy assume D ∼ N(µ, σ 2 ) and apply one-sample t-test


t= = 0.368.
sd¯
With df = 14, two-sided p-value = 0.718, we can not reject H0 .
Without normality assumption we apply the signed rank test. Matlab command
signrank(x,y)
computes the two-sided p-value = 0.604. We can not reject H0 .

116
Solution 11
Possible explanations
(a) room with a window ← rich patient → recovers faster,
(b) besides passive smoking: smoker ← the man is a bad husband → wife gets cancer,
(c) no breakfast ← more stress → accident,
(d) choose to change the school and to be bused ← lower grades before → lower grades after,
(e) match two babies with two mothers (or even 3 babies with 3 mothers) then it is pure chance,
(f) abstain from alcohol ← poor health,
(g) marijuana ← schizophrenia,
(h) total time together = time before wedding + time after wedding,
(i) being part of a community can have a positive effect on mental health and emotional wellbeing.

13.7 Solutions to Section 8 (analysis of variance)


Useful Matlab commands:

boxplot(x)
anova1(x)
anova2(x)

where x is the data matrix.

Solution 1
(a) The sums of squares: between samples, within samples and total:
SSA = 10((20.34 − 19.40)2 + (18.34 − 19.40)2 + (21.57 − 19.40)2 + (17.35 − 19.40)2 ) = 109.2
SSE = 9(0.88 + 0.74 + 0.88 + 0.89) = 30.5
SST = 3.58 · 39 = 139.7 = 109.2 + 30.5

Source SS df MS F
Treatment 109.2 3 36.4 42.9
Error 30.5 36 0.85
Total 139.7 39

Comparing with the observed test statistics 42.9 with the critical value for F3,27 (0.001) = 7.27 shows that the
result is highly significant and we reject the null hypothesis.

(b) The normality assumption is supported by the four skewness and kurtosis values, with the former being
close to zero and the latter close to 3. On the other hand, the four sample variances are close to each other making
realistic the assumption of equal variances.

(c) Since sp = MSE = 0.92 and the t-distribution table gives approximately t36 (0.0042) ≈ t40 (0.005) = 2.7,
we get
Bµu −µv = (ȳu. − ȳv. ) ± 1.11.
Therefore all observed pairwise differences except (2-4) are significant:

Pairs 1-2 1-3 1-4 2-3 2-4 3-4


Differences 2.00 -1.23 2.99 -3.23 0.99 4.22

Solution 2
Consider one-way ANOVA test statistic
J
PI 2
M SA I−1 i=1 (ȳi· − ȳ·· )
F = = PI PJ
M SE 1
i=1 j=1 (yij − ȳi· )
2
I(J−1)

For I = 2 and J = n, put


x̄ + ȳ
ȳ1· = x̄, ȳ2· = ȳ, ȳ·· = .
2

117
In this two-sample setting, the F-test statistic becomes

n[(x̄ − x̄+ȳ 2 x̄+ȳ 2


2 ) + (ȳ − 2 ) ] 2n( x̄−ȳ
2 )
2
x̄−ȳ
2
F = 1 n n = = √2 .
2
P 2
P 2
2(n−1) [ j=1 (xj − x̄) + j=1 (yj − ȳ) ] sp sp n

x̄−ȳ
√2
This equals t2 , where t = is the two-sample t-test statistic.
sp n

Solution 3
The null hypothesis says that the data (yij ) comes from a single normal distribution

H0 : µ1 = . . . = µI = µ

described by two parameters µ and σ 2 , so that dim Ω0 = 2, while

dim Ω = I + 1

since the general setting is described by parameters µ1 , . . . , µI and σ 2 . The likelihood ratio

L0 (µ̂, σ̂02 )
Λ= ,
L(µ̂1 , . . . , µ̂I , σ̂ 2 )

is expressed in terms of two likelihood functions


J
I Y (yij −µi )2 2
ij −µi )
Y X X (y
L(µ1 , . . . , µI , σ 2 ) = √1 e
2πσ
2σ 2 ∝ σ −n exp{− 2σ 2 },
i=1 j=1 i j
2
ij −µ)
X X (y
L0 (µ, σ 2 ) = L(µ, . . . , µ, σ 2 ) ∝ σ −n exp{− 2σ 2 }.
i j

where n = IJ. We find the maximum likelihood estimates to be


SST SSE
µ̂ = ȳ·· , σ̂02 = , µ̂i = ȳi· , σ̂ 2 = ,
n n
which yields
P P (yij −µ̂)2
σ̂0−n exp{− 2σ̂02
} σ̂ 2 −n/2 exp{− SST } σ̂ 2 n/2
2SST /n
0
Λ= = · = .
σ̂ −n exp{−
P P (yij −µ̂i ) 2
} σ̂ 2 SSE
exp{− 2SS } σ̂02
2σ̂ 2 E /n

The likelihood ratio test rejects the null hypothesis for small values of Λ or equivalently for large values of

σ̂02 SST SSA J(I−1)M SA J(I−1)


= SSE =1+ SSE =1+ I(J−1)M SE =1+ I(J−1) ·F
σ̂ 2
that is for large values of F-test statistics. This leads to an asymptotic approximation of the FJ(I−1),I(J−1) in
terms of the chi-squared distribution with df = I − 1.

Solution 4
One-way layout with I = 10, J = 7,
Yij ∼ N(µi , σ 2 ).
Pooled sample variance
1 XX
s2p = M SE = (yij − ȳi. )2
I(J − 1) i j

uses df = I(J − 1) = 60.

(a) A 95% confidence interval for a single difference µu − µv


q
2
Iµu −µv = ȳu· − ȳv· ± t60 (0.025)sp J

has the half-width of


s
2.82 · √p .
J

118
10

(b) Bonferroni simultaneous 95% confidence interval for 2 = 45 differences µu − µv
q
Bµu −µv = ȳu· − ȳv· ± t60 ( 0.025
45 )sp
2
J

has the half-width of


s
4.79 · √p ,
J

giving the ratio


4.79
2.82 = 1.7.

(c) Tukey simultaneous 95% confidence interval for differences µu − µv


s
Tµu −µv = ȳu· − ȳv· ± q10,60 (0.05) √pJ

has the half-width of


s
4.65 · √p ,
J

giving the ratio


Bonferroni 4.79
Tukey = 4.65 = 1.03.

Solution 5
For I = 4 control groups of J = 5 mice each, test H0 : no systematic differences between groups.

450

400

350
Values

300

250

200

1 2 3 4
Column Number

One way ANOVA table

Source SS df MS F P
Columns 27230 3 9078 2.271 0.12
Error 63950 16 3997
Total 91190 19

Do not reject H0 at 10% significance level. Boxplots show non-normality. The largest difference is between the
third and the fourth boxplots. Control question: why the third boxplot has no upper whisker?
Kruskal-Wallis test. Pooled sample ranks

Group I 2 6 9 11 14 r̄1. = 8.4


Group II 4 5 8 17 19 r̄2. = 10.6
Group III 1 3 7 12.5 12.5 r̄3. = 7.2
Group IV 10 15 16 18 20 r̄4. = 15.8

Kruskal-Wallis test statistic


12 · 5
(8.4 − 10.5)2 + (10.6 − 10.5)2 + (7.2 − 10.5)2 + (15.8 − 10.5)2 = 6.20.

W =
20 · 21

Since χ23 (0.1) = 6.25, we do not reject H0 at 10% significance level.

119
Solution 6
Two-way layout with I = 3 treatments on J = 10 subjects with K = 1 observations per cell. ANOVA table

Source SS df MS F P
Columns (blocks) 0.517 9 0.0574 0.4683 0.8772
Rows (treatments) 1.081 2 0.5404 4.406 0.0277
Error 2.208 18 0.1227
Total 3.806 29
Reject
H0 : no treatment effects
at 5% significance level. (Interestingly, no significant differences among the blocks.)

Friedman’s test. Ranking within blocks:

Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9 Dog 10 r̄i.
Isof 1 2 3 2 1 2 1 3 1 3 1.9
Halo 2 1 1 3 2 1 3 1 2 2 1.8
Cycl 3 3 2 1 3 3 2 2 3 1 2.3

The observed value of the Friedman test statistic


12 · 10
(1.8 − 2)2 + (1.9 − 2)2 + (2.3 − 2)2 = 1.4.

Q=
3·4
Since χ22 (0.1) = 4.61, we can not reject H0 even at 10% significance level.

Solution 7
Forty eight survival times: I = 3 poisons and J = 4 treatments with K = 4 observations per cell. Cell means for
the survival times
A B C D
I 4.125 8.800 5.675 6.100
II 3.200 8.150 3.750 6.625
III 2.100 3.350 2.350 3.250

Draw three profiles: I and II cross each other, and profile III is more flat. Three null hypotheses of interest

HA : no poison effect,
HB : no treatment effect,
HAB : no interaction.

(a) Survival in hours x data matrix. Results of anova2(x,4)

Source SS df MS F P
Columns (treatments) 91.9 3 30.63 14.01 0.0000
Rows (poisons) 103 2 51.52 23.57 0.0000
Intercation 24.75 6 4.124 1.887 0.1100
Error 78.69 36 2.186
Total 298.4 47

Reject HA and HB at 1% significance level, we can not reject HAB even at 10% significance level:

3 poisons act differently,


4 treatments act differently,
some indication of interaction.

Analysis of the residuals

normal probability plot reveals non-normality,


skewness = 0.59,
kurtosis = 4.1.

(b) Transformed data: death rate = 1/survival time. Cell means for the death rates

120
Normal Probability Plot Normal Probability Plot

0.99 0.99
0.98 0.98

0.95 0.95

0.90 0.90

0.75 0.75
Probability

Probability
0.50 0.50

0.25 0.25

0.10 0.10

0.05 0.05

0.02 0.02
0.01 0.01

−3 −2 −1 0 1 2 3 4 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1


Data Data

Figure 1: Left panel: survival times. Right panel: death rates.

A B C D
I 0.249 0.116 0.186 0.169
II 0.327 0.139 0.271 0.171
III 0.480 0.303 0.427 0.309
Draw three profiles: they look more parallel.
New data matrix y=x.ˆ(-1). Results of anova2(y,4):
Source SS df MS F P
Columns (treatments) 0.204 3 0.068 28.41 0.0000
Rows (poisons) 0.349 2 0.174 72.84 0.0000
Intercation 0.01157 6 0.0026 1.091 0.3864
Error 0.086 36 0.0024
Total 0.6544 47

Reject HA and HB at 1% significance level. Do not reject HAB . Conclusions

3 poisons act differently,


4 treatments act differently,
no interaction,
the normal probability plot of residuals reveals a closer fit to normality assumption.

13.8 Solutions to Section 9 (categorical data analysis)


Warning: in some of the contingency tables the expected counts are rounded. If you then will compute the
chi-squared test statistic χ2 from the table, you will often get a somewhat different value.

Solution 1
Test
H0 : same genotype frequencies for diabetics and normal
using the chi-squared test of homogeneity. The table below gives the expected counts along with the observed
counts:
Diabetic Normal Total
Bb or bb 12 (7.85) 4 (8.15) 16
BB 39 (43.15) 49 (44.85) 88
Total 51 53 104

Observed χ2 =5.10, df=1, p-value = 0.024. Reject H0 . Diabetics have genotype BB less often.

16
The exact Fisher test uses Hg(104,51, 104 ) as the null distribution of the test statistic N11 = 12

one-sided P-value: 1-hygecdf(11,104,16,51)=0.0225,


two-sided P-value P = 0.045.

121
Normal approximation of the null distribution
16
Hg(104, 51, 104 ) ≈ N(7.85, 3.41).
12−7.85
Since zobs = √
3.41
=2.245, the approximate two-sided p-value = 0.025.

Solution 2
(a) H0 : no association of the disease and the ABO blood group:

O A AB B Total
Moderate 7 (10.4) 5 (9.8) 3 (2.0) 13 (6.2) 28
Minimal 27 (30.4) 32 (29.7) 8 (6.1) 18 (18.8) 85
Not present 55 (48.6) 50 (47.5) 7 (9.8) 24 (30.0) 136
Total 89 87 18 55 249

Observed χ2 =15.37, df=6, p-value = 0.018. Reject H0 .

(b) H0 : no association of the disease and the MN blood group:

MM MN NN Total
Moderate 21 (16.7) 6 (9.4) 1 (1.9) 28
Minimal 54 (51.3) 27 (28.9) 5 (5.8) 86
Not present 74 (81.1) 51 (45.7) 11 (9.2) 136
Total 149 84 17 250

Observed χ2 =4.73, df=4, p-value = 0.42. Can not reject H0 .

Solution 3
(a) Apply the chi-squared test of homogeneity:

Girl Boy Total


Flying fighter 51 (45.16) 38 (43.84) 89
Flying transport 14 (15.22) 16 (14.78) 30
Not flying 38 (42.62) 46 (41.38) 84
Total 103 100 203

Observed χ2 =2.75, df=2, p-value = 0.25. Can not reject H0 .

(b) Goodness of fit chi-squared test for the same sex ratio for three father’s activities

H0 : boys proportions p12 = p22 = p32 = 0.513.

Here 0.513 is obtained as


105.37
= 0.513.
105.37 + 100
Observed and expected counts

Girl Boy Total


Flying fighter 51 (43.34) 38 (45.66) 89
Flying transport 14 (14.61) 16 (15.39) 30
Not flying 38 (40.91) 46 (43.09) 84
Total 103 100 203

Observed χ2 = 3.09, df = 3, p-value = 0.38. Can not reject H0 .

Solution 4
We use the chi-squared test for homogeneity

122
No nausea Incidence of nausea Total
Placebo 70 (84) 95 (81) 165
Chlorpromazine 100 (78) 52 (74) 152
Dimenhydrinate 33 (43) 52 (42) 85
Pentobarbital (100 mg) 32 (34) 35 (33) 67
Pentobarbital (150 mg) 48 (43) 37 (42) 85
Total (150 mg) 283 271 554

The observed test statistic χ2 = 35.8 according to the χ24 -distribution table gives p-value = 3 · 10−7 . Comparing
the observed and expected counts we conclude that Chlorpromazine is most effective in ameliorating postoperative
nausea.

Solution 5
(a) H0 : no relation between blood group and disease in London:

Control Peptic Ulcer Total


Group A 4219 (4103.0) 579 (695.0) 4798
Group O 4578 (4694.0) 911 (795.0) 5489
Total 8797 1490 10287

ˆ = 1.45.
Observed χ2 =42.40, df=1, p-value = 0.000. Reject H0 . Odds ratio ∆

(b) H0 : no relation between blood group and disease in Manchester:

Control Peptic Ulcer Total


Group A 3775 (3747.2) 246 (273.8) 4021
Group O 4532 (4559.8) 361 (333.2) 4893
Total 8307 607 8914

ˆ = 1.22.
Observed χ2 =5.52, df=1, p-value = 0.019. Reject H0 . Odds ratio ∆

(c) H0 : London Group A and Manchester Group A have the same propensity to Peptic Ulcer:

C and A PU and A Total


London 4219 (4349.2) 579 (448.8) 4798
Manchester 3775 (3644.8) 246 (376.2) 4021
Total 7994 825 8819

Observed χ2 =91.3, df=1, p-value = 0.000. Reject H0 .

H0 : London Group O and Manchester Group O have the same propensity to Peptic Ulcer:

C and O PU and O Total


London 4578 (4816.5) 911 (672.5) 5489
Manchester 4532 (4293.5) 361 (599.5) 4893
Total 9110 1272 10382

Observed χ2 =204.5, df=1, p-value = 0.000. Reject H0 .

Solution 6
D = endometrical carcinoma, X = estrogen taken at least 6 months prior to the diagnosis of cancer.

(a) Matched controls, retrospective case-control study

D̄X D̄X̄ Total


DX 39 113 152

DX̄ 15 150 165


Total 54 263 317

123
Apply McNemar test for
H0 : π1· = π·1 vs H1 : π1· 6= π·1 .
Observed value of the test statistic
(113−15)2
χ2 = 113+15 = 75

is highly significant as 75 = 8.7 and the corresponding two-sided P-value obtained from N(0,1) table is very small.

(b) Possible weak points in a retrospective case-control design


- selection bias: some patients have died prior the study,
- information bias: have to rely on other sources of information.

Solution 7
16
(a) The exact Fisher test uses Hg(30,17, 30 ) as the null distribution of the test statistic whose observed value is
x = 12. It gives

one-sided p-value: 1-hygecdf(11, 30, 16, 17) = 0.036,


two-sided p-value P = 0.071.

(b) Using normal approximation


2
Hg(30, 17, 16
30 ) ≈ N(9.1, (1.4) )
and continuity correction, we find the one-sided p-value to be

P(X ≥ 12|H0 ) = P(X > 11|H0 ) ≈ 1 − Φ( 11.5−9.1


1.4 ) = 1 − Φ(1.71) = 0.044.

(c) Approximate chi-squared test yields: observed χ2 = 4.69, df = 1, two-sided p-value



2(1 − Φ( 4.69)) = 2(1 − Φ(2.16)) = 0.03.

Solution 8
Denote
π1 = probability that red wins in boxing,
π2 = probability that red wins in freestyle wrestling,
π3 = probability that red wins in Greco-Roman wrestling,
π4 = probability that red wins in Tae Kwon Do.
(a, c) Assuming
Heq : π1 = π2 = π3 = π4 = π,
we test
1
H0 : π = 2 vs H1 : π 6= 12 .
We use the large sample test for proportion based on the statistic X = 245 whose null distribution is Bin(n, 12 ),
n = 447. The two-sided P-value is approximated by
447
245− 2
2(1 − Φ( √ 1 ) = 2(1 − Φ(2.034) = 0.042.
447· 2

At 5% level we reject the H0 : π = 12 . The maximum likelihood estimate is π̂ = 245


447 = 0.55.

(d) Is there evidence that wearing red is more favourable in some of the sports than others? We test

Heq : π1 = π2 = π3 = π4 vs Hineq : πi 6= πj for some i 6= j

using the chi-squared test of homogeneity. From


Red Biue Total
Boxing 148 (147) 120 (121) 268
Freestyle wrestling 27 (28) 24 (23) 51
Greco-Roman wrestling 25 (26) 23 (22) 48
Tae Kwon Do 45 (44) 35 (36) 80
Total 245 202 447
Marginal proportions 0.55 0.45 1.00

124
we find that the test statistic χ2 = 0.3 is not significant. We can not reject Heq , which according to (a) leads to
π̂ = 0.55.

(b) Now we state the hypotheses of interest directly: consider


1
H0 : π1 = π2 = π3 = π4 = 2 vs H1 : (π1 , π2 , π3 , π4 ) 6= ( 21 , 12 , 12 , 12 ).

Here we need a new chi-squared test, a chi-squared test for k proportions with k = 4 (see below). Given four
observed counts x1 = 148, x2 = 27, x3 = 25, x4 = 45, we obtain the following table of observed and expected
counts
Red Biue Total
Boxing 148 (134) 120 (134) 268
Freestyle wrestling 27 (25.5) 24 (25.5) 51
Greco-Roman wrestling 25 (24) 23 (24) 48
Tae Kwon Do 45 (40) 35 (40) 80
H0 proportions 0.5 0.5 1.00

It gives χ2obs = 4.4. Since χ24 (0.1) = 7.8, we do not reject H0 : π1 = π2 = π3 = π4 = 12 .

Chi-squared test for k proportions


Given k independent values (x1 , . . . , xn ) drawn from k binomial distributions with parameters (n1 , πi ), . . . , (nk , πk )
we test
H0 : π1 = p1 , . . . , πk = pk vs H1 : (π1 , π2 , π3 , π4 ) 6= (p1 , . . . , pk )
using the test statistic
2k k
X (Oj − Ej )2 X (xi −ni pi )2 (ni −xi −ni (1−pi ))2
χ2 = = ni pi + ni (1−pi ) ,
j=1
Ej i=1

whose null distribution is approximately χ2k . The last fact follows from
k k k
(xi −ni pi )2 (ni −xi −ni (1−pi ))2 i −ni pi )
2
X X (x
X
ni pi + ni (1−pi ) = ni pi (1−pi ) = zi2 ,
i=1 i=1 i=1

where zi are realisations of independent random variables

Zi = √Xi −ni pi , i = 1, . . . , k
ni pi (1−pi )

which are approximately N(0,1) distributed, provided Xi ∼ Bin(ni , pi ).


We derive this test statistic using the likelihood ratio approach. The likelihood function based on the binomial
model has the form
k
Y ni xi
L(π1 , . . . , πk ) = πi (1 − πi )ni −xi .
i=1
x i

xi
Using π̂i = ni , we compute the likelihood ratio as
Qk k
L(p1 , . . . , pk ) pxi (1 − pi )ni −xi Y
Λ= = Qk i=1 x i = ( nxi pi i )xi ( nni (1−p
i −xi
i ) ni −xi
) .
L(π̂1 , . . . , π̂k ) i xi ni −xi ni −xi
i=1 ( ) ( ni ) i=1
ni

Turn to the logarithms,


k
X
i −xi
− ln Λ = xi ln nxi pi i + (ni − xi ) ln nni (1−p i)
,
i=1

observe that under H0 we have


xi ni −xi
ni ≈ pi , ni ≈ 1 − pi .
Using a Taylor expansion
xi −ni pi i −xi ni pi −xi
ln nxi pi i ≈ ni pi , ln nni (1−p i)
≈ ni (1−pi ) ,

we find that
k
2
i −ni pi )
X (x
− ln Λ ≈ ni pi (1−pi ) = χ2 .
i=1

125
13.9 Solutions to Section 10 (multiple regression)
Solution 1
Recall that the sample covariance and the population covariance are
n
1 X
sxy = (xi − x̄)(yi − ȳ), Cov(X, Y ) = E(XY ) − E(X)E(Y ).
n − 1 i=1

It is enough to check that


n
!
X
E (Xi − X̄)(Yi − Ȳ ) = (n − 1)E(XY ) − (n − 1)E(X)E(Y ).
i=1

To do this, observe that


n
X n
X n
X n
X n
X
(xi − x̄)(yi − ȳ) = xi yi − x̄ yi − ȳ xi + nx̄ȳ = xi yi − nx̄ȳ,
i=1 i=1 i=1 i=1 i=1

and
n
X n
X n
X n
XX
n2 x̄ȳ = xi yi = xi yi + xi yj ,
i=1 i=1 i=1 i6=j j=1

so that
n n n
X n−1X 1 XX
(xi − x̄)(yi − ȳ) = xi yi − xi yj .
i=1
n i=1 n j=1 i6=j

It remains to see that


!  
n
X n
XX
E Xi Yi = nE(XY ), E Xi Yj  = n(n − 1)E(X)E(Y ).
i=1 i6=j j=1

Solution 2
We have after ordering
x -1.75 -1.18 -0.88 -0.65 -0.30 0.34 0.50 0.68 1.38 1.40
y -1.59 -0.81 -0.98 -0.53 -0.72 0.27 0.64 0.35 1.34 1.28
and
x̄ = −0.046, ȳ = −0.075, sx = 1.076, sy = 0.996, r = 0.98.
(a) Simple linear regression model

Y = β0 + β1 x + , ∼ N(0, σ 2 ).

Fitting a straight line using


sy
y − ȳ = r · sx (x − x̄)
we get the predicted response
ŷ = −0.033 + 0.904 · x.
Estimated σ 2
s2 = n−1 2
n−2 sy (1 − r2 ) = 0.05.
(b) Simple linear regression model

X = β0 + β1 y + , ∼ N(0, σ 2 ).

Fitting a straight line using


sx
x − x̄ = r · sy (y − ȳ)
we get the predicted response
x̂ = 0.033 + 1.055 · y.

126
Estimated σ 2
s2 = n−1 2
n−2 sx (1 − r2 ) = 0.06.
(c) First fitted line
y = −0.033 + 0.904 · x
is different from the second
y = −0.031 + 0.948 · x.
They are different since in (a) we minimise the vertical residuals while in (b) - horizontal.

Solution 3
Using an extra explanatory variable f which equal 1 for females and 0 for males, we rewrite this model in the form
of a multiple regression
Y = f βF + (1 − f )βM + β1 x + = β0 + β1 x + β2 f + ,
where
β0 = βM , β2 = βF − βM .
Here p = 3 and the design matrix is  
1 x1 f1
 .. .. ..  .
X= . . . 
1 xn fn
After β0 , β1 , β2 are estimated, we compute

βM = β0 , βF = β0 + β2 .

A null hypothesis of interest β2 = 0.

Solution 4
(a) The predicted value Ŷ0 and actual observation Y0 are independent random variables, therefore

Var(Y0 − Ŷ0 ) = Var(Y0 ) + Var(ŷ0 ) = σ 2 + Var(b0 + b1 x0 ) = σ 2 a2n ,

where
Var(b0 )+Var(b1 )x20 −2x0 Cov(b0 ,b1 ) x2 +x20 −2x̄x0 x2 −x̄2 +(x0 −x̄)2 (x0 −x̄)2
a2n = 1 + σ2 =1+ (n−1)s2x =1+ (n−1)s2x =1+ 1
n + (n−1)s2x .

(b) A 95% prediction interval I for the new observation Y0 is obtained from

Y0 −Ŷ0
San ∼ tn−2 .

Since
0.95 = P(|Y0 − Ŷ0 | ≤ tn−2 (0.025) · s · an ) = P(Y0 ∈ Ŷ0 ± tn−2 (0.025) · s · an ),
we conclude that a 95% prediction interval for the new observation Y0 is given by
r
1 (x0 −x̄)2
I = b0 + b1 x0 ± tn−2 (0.025) · s 1 + n + (n−1)s2x .

The further from x̄ lies x0 , the more uncertain becomes the prediction.

Solution 5
(a) Given x = 95, we predict the final score by

ŷ = 75 + 0.5(95 − 75) = 85.

Regression to mediocrity.

(b) Given y = 85 and we do not know the midterm score, we predict the midterm score by

x̂ = 75 + 0.5(85 − 75) = 80.

127
Solution 6
(a) Find the correlation coefficient ρ for (X, Y ). Since EX = 0, we have
Cov(X, Y ) = E(XY ) = E(X 2 + βXZ) = 1, VarY = VarX + VarZ = 1 + β 2 ,
and we see that the correlation coefficient is always positive
1
ρ= √ .
1+β 2

(b) Use (a) to generate five samples


(x1 , y1 ), . . . , (x20 , y20 )
with different
ρ = −0.9, −0.5, 0, 0.5, 0.9,
and compute the sample correlation
p coefficients.
From ρ = √ 1 2 , we get β = ρ−2 − 1 so that
1+β

ρ = 0.5 ⇒ β = 1.73, ρ = 0.9 ⇒ β = 0.48.


How to generate a sample with ρ = −0.9 using Matlab:
X=randn(20,1);
Z=randn(20,1);
Y=-X+0.48*Z;
r=corrcoeff(X,Y)
How to generate a sample with ρ = 0 using Matlab:
X=randn(20,1);
Y=randn(20,1);
r=corrcoeff(X,Y)
Simulation results
ρ -0.9 -0.5 0 0.5 0.9
r -0.92 -0.45 -0.20 0.32 0.92

Solution 7
Matlab commands (x and y are columns)
[b,bint,res,rint,stats]=regress(y,[ones(6,1),x])
[b,bint,res,rint,stats]=regress(sqrt(y),[ones(6,1),x])
give two sets of residuals - see the plot. Two simple linear regression models
y = −62.05 + 3.49 · x, r2 = 0.984,

y = −0.88 + 0.2 · x, r2 = 0.993.
Kinetic energy formula explains why the second model is better.

128
14 Miscellaneous exercises
14.1 Exercises
Exercise 1
From Wikipedia:

”The American Psychological Association’s 1995 report Intelligence: Knowns and Unknowns stated
that the correlation between IQ and crime was -0.2. It was -0.19 between IQ scores and number of
juvenile offences in a large Danish sample; with social class controlled, the correlation dropped to -0.17.
A correlation of 0.20 means that the explained variance is less than 4%.”

Explain the last sentence.

Exercise 2
The Laplace distribution with a positive parameter λ is a two-sided exponential distribution. Its density function
is f (x) = λ2 e−λ|x| for x ∈ (−∞, ∞).
R∞
(a) The variance of this distribution is 2λ−2 and kurtosis is 6. Prove this using the formula 0 xk e−x dx = k!
valid for any natural
√ number k.
(b) Take λ = 2. Plot carefully the density f (x) together with the standard normal distribution density.
(c) Use the drawn picture to explain the exact meaning of the following citation. ”Kurtosis is a measure of the
peakedness of the probability distribution of a real-valued random variable, although some sources are insistent
that heavy tails, and not peakedness, is what is really being measured by kurtosis”.

Exercise 3
The following 16 numbers came from normal random number generator on a computer:

5.33 4.25 3.15 3.70


1.61 6.39 3.12 6.59
3.53 4.74 0.11 1.60
5.49 1.72 4.15 2.28

(a) Write down the likelihood function for the mean and variance of the generating normal distribution. (Hint:
to avoid tedious calculations on your calculator use the numbers in the next subquestion.)
(b) In what sense the sum of the sample values (which is close to 58), and the sum of their squares (which is
close to 260) are sufficient statistics in this case?
(c) Turning to the log-likelihood function compute the maximum likelihood estimates for the mean and variance.
Is the MLE for the variance unbiased?

Exercise 4
Questions concerning hypotheses testing methodology. Try to give detailed answers.

(a) Consider a hypothetical study of the effects of birth control pills. In such a case, it would be impossible to
assign women to a treatment or a placebo at random. However, a non-randomized study might be conducted by
carefully matching control to treatments on such factors as age and medical history.
The two groups might be followed up on for some time, with several variables being recorded for each subject
such as blood pressure, psychological measures, and incidences of various problems. After termination of the study,
the two groups might be compared on each of these many variables, and it might be found, say, that there was a
”significant difference” in the incidence of melanoma.
What is a common problem with such ”significant findings”?
(b) You analyse cross-classification data summarized in a two by two contingency table. You wanted to apply
the chi-square test but it showed that one of the expected counts was below 5. What alternative statistical test
you may try applying?
(c) Why tests like rank sum test, Friedman test, and Kruskal-Wallis tests are often called distribution-free tests?

129
Exercise 5
A public policy polling group is investigating whether people living in the same household tend to make independent
political choices. They select 200 homes where exactly three voters live. The residents are asked separately for
their opinion (”yes” or ”no”) on a city charter amendment. The results of the survey are summarized in the table:
Number of saying ”yes” 0 1 2 3
Frequency 30 56 73 41
Based on these data can we claim that opinions are formed independently?

Exercise 6
Suppose you have a data of size n for which the linear regression model seems to work well. The key summary statis-
tics are represented by sample means x̄, ȳ, sample standard deviations sx , sy , and a sample correlation coefficient
r.
An important use of the linear regression model is forecasting. Assume we are interested in the response to a
particular value x of the explanatory variable.

(a) The exact 100(1 − α)% confidence interval for the mean response value is given by the formula:
s 2
1 1 x − x̄
b0 + b1 x ± tα/2,n−2 · s · + .
n n−1 sx
Explain carefully the meaning and role of each of the terms.
(b) Another important formula in this context
s 2
1 1 x − x̄
b0 + b1 x ± tα/2,n−2 · s · 1+ +
n n−1 sx

is called the exact 100(1 − α)% prediction interval. Explain the difference between these two formulae. Illustrate
by a simple example.
(c) Comment on the predictor properties depending on the distance from the given value x to the sample mean
x̄. Illustrate using appropriate plots.

Exercise 7
In an experimental study two volunteer male subjects aged 23 and 25 underwent three treatments to compare a
new drug against no drug and placebo. Each volunteer had one treatment per day and the time order of these
three treatments was randomized.

(a) Comment on the details of the experimental design.


(b) Find the exact null distribution for the test statistic of an appropriate non-parametric test.

Exercise 8
You have got a grant to measure the average weight of the hippopotamus at birth. You have seen in a previous
publication by Stanley and Livingstone that for male calves the distribution of weights has a mean of roughly 70
kg and a standard deviation of 10 kg, while these numbers are 60 kg and 5 kg for females, but you are interested
in a better remeasurement of the overall average.
The experimental procedure is simple: you wait for the herd of hippopotami to be sleeping, you approach a
newborn, you put it quickly on the scales, and you pray for the mother not to wake up. You managed to weigh 13
female and 23 male newborns with the following results:
Female Male
Sample mean 62.8 69.7
Sample standard deviation 6.8 11.7
(a) Test the null hypothesis of the equal sex ratio for the newborn hippopotami (meaning that the ratio of
males to females at birth is 1 to 1).
(b) Assuming the ratio of males to females at birth is 1 to 1, suggest two different unbiased point estimates for
the overall average weight of the hippopotamus at birth.
(c) Compute the standard errors for these point estimates.
(d) What assumptions do you make for these calculations?

130
Exercise 9
Identify important statistical terms hiding behind “bla-bla-bla” in the following extracts from the Internet (one
term per item).

(a) From a sample you can only get one value of a statistic like trimmed mean. You do not know the confidence
interval of the trimmed mean or its distribution. Bla-bla-bla samples give more detail on the sampling distribution
of the statistic of interest.
(b) Bla-bla-bla is a measure of the ”peakedness” of the probability distribution of a real-valued random variable,
although some sources are insistent that heavy tails, and not peakedness, is what is really being measured by bla-
bla-bla.
(c) Note that bla-bla-bla is the probability of finding a difference that does exist, as opposed to the likelihood
of declaring a difference that does not exist (which is known as a Type I error, or ”false positive”).
(d) The bla-bla-bla is a measure of the tendency to fail; the greater the value of the bla-bla-bla, the greater the
probability of impending failure... The bla-bla-bla is also known as the instantaneous failure rate.
(e) Naive interpretation of statistics derived from data sets that include bla-bla-bla may be misleading. For
example, if one is calculating the average temperature of 10 objects in a room, and most are between 20 and 25
degrees Celsius, but an oven is at 175 C, the median of the data may be 23 C but the mean temperature will be
between 35.5 and 40 C. In this case, the median better reflects the temperature of a randomly sampled object than
the mean; however, naively interpreting the mean as ”a typical sample”, equivalent to the median, is incorrect. As
illustrated in this case, bla-bla-bla may be indicative of data points that belong to a different population than the
rest of the sample set.

Exercise 10
Officials of a small transit system with only five buses want to evaluate four types of tires with respect to wear.
Applying a randomized block design, they decided to put one tire of each type on each of the five buses. The tires
are run for 15,000 miles, after which the tread wear, in millimeters, is measured.

Bus Tire 1 Tire 2 Tire 3 Tire 4 Mean


1 9.1 17.1 20.8 11.8 14.7
2 13.4 20.3 28.3 16.0 19.5
3 15.6 24.6 23.7 16.2 20.0
4 11.0 18.2 21.4 14.1 16.2
5 12.7 19.8 25.1 15.8 18.4
Mean 12.4 20.0 23.9 14.8 17.8

(a) State the most appropriate null hypothesis by referring to a suitable parametric model. What are the main
assumptions of the parametric model?
(b) Using a non-parametric procedure test the null hypothesis of no difference between the four types of tires.
(c) What kind of external effects are controlled by the suggested randomised block design? How the wheel
positions for different tire types should be assigned for each of the five buses?

Exercise 11
A study is conducted of the association between the rate at which words are spoken and the ability of a “talking
computer” to recognise commands that it is programmed to accept. A random sample of 50 commands is spoken
first at a rate under 60 words per minute, and then the SAME commands are repeated at a rate over 60 words per
minute. In the first case the computer recognised 42 out of 50 commands while in the second case it recognised
only 35 commands. Is the observed difference statistically significant?

Exercise 12
Suppose your prior beliefs about the probability p of success have mean 1/3 and variance 1/32. What is the
posterior mean after having observed 8 successes in 20 trials?

Exercise 13
The data of the following table were gathered for an environmental impact study that examined the relationship
between the depth of a stream and the rate of its flow

131
Depth Flow rate
0.34 0.64
0.29 0.32
0.28 0.73
0.42 1.33
0.29 0.49
0.41 0.92
0.76 7.35
0.73 5.89
0.46 1.98
0.40 1.12
(a) Draw the scatter plot for the given data using the x axis for depth. Fit by eye a regression line and plot
the residuals against the depth. What does it say to you about the relevance of the simple linear regression model
for this particular data?
(b) The least square estimates for the parameters of the simple linear regression model are b0 = −3.98, b1 =
13.83. Given the standard deviations are sx = 0.17 and sy = 2.46 estimate the noise size (σ) and find the coefficient
of determination.
(c) The statistics for a quadratic model are given in the following table:
Coefficient Estimate Standard error t value
β0 1.68 1.06 1.59
β1 -10.86 4.52 -2.40
β2 23.54 4.27 5.51
Compute a 95 procent confidence interval for β0 .
(d) Is the quadratic term statistically significant? Carefully explain.

Exercise 14
The article ”Effects of gamma radiation on juvenile and mature cuttings of quaking aspen” (Forest science, 1967)
reports the following data on exposure time to radiation (x, in kr/16 hr) and dry weight of roots (y, in mg×10−1 ):
x 0 2 4 6 8
y 110 123 119 86 62
The estimated quadratic regression function is y = 111.9 + 8.1x − 1.8x2 .

(a) What is the underlying multiple regression model? Write down the corresponding design matrix.
(b) Compute the predicted responses. Find an unbiased estimate s2 of the noise variance σ 2 .
(c) Compute the coefficient of multiple determination.

Exercise 15
The accompanying data resulted from an experiment carried out to investigate whether yield from a certain chemical
process depended either on the formulation of a particular input or on mixer speed.
Speed
60 70 80 Means
1 189.7 185.1 189.0
1 188.6 179.4 193.0 187.03
1 190.1 177.3 191.1
Formulation
2 165.1 161.7 163.3
2 165.9 159.8 166.6 164.66
2 167.6 161.6 170.3
Means 177.83 170.82 178.88 175.84
A statistical computer package gave
SSF orm = 2253.44, SSSpeed = 230.81, SSF orm ∗ Speed = 18.58, SSE = 71.87.
(a) Calculate estimates of the main effects.
(b) Does there appear to be interaction between the factors? In which various ways interaction between such
two factors could manifest itself? Illustrate with graphs.
(c) Does yield appear to depend either on formulation or speed.
(d) Why is it important to inspect the scatter plot of residuals?

132
Exercise 16
A study of the relationship between facility conditions at gasoline stations and aggressiveness in the pricing of
gasoline is based on n = 441 stations.

Pricing policy
Aggressive Neutral Nonaggressive Total
Substandard condition 24 15 17 56
Standard condition 52 73 80 205
Modern condition 58 86 36 180
Total 134 174 133 441

(a) Suggest a parametric model for the data and write down the corresponding likelihood function.
(b) What is a relevant null hypothesis for the data?
(c) Properly analyse the data and draw your conclusions.

Exercise 17
Mice were injected with a bacterial solution: some of the mice were also given penicillin. The results were

Without penicillin With penicillin


Survived 8 12
Died 48 62

(a) Find a 95% confidence interval for the difference between two probabilities of survival.
(b) Assume that both groups have the probability of survival p. How would you compute an exact credibility
interval for the population proportion p, if you could use a computer? Compute an approximate 95% credibility
interval using a normal approximation.

Exercise 18
In a controlled clinical trial which began in 1982 and ended in 1987, more than 22000 physicians participated. The
participants were randomly assigned in two groups: Aspirin and Placebo. The aspirin group have been taking
325 mg aspirin every second day. At the end of trial, the number of participants who suffered from myocardial
infarctions was assessed.

MyoInf No MyoInf Total


Aspirin 104 10933 11037
Placebo 189 10845 11034

The popular measure in assessing the results in clinical trials is Risk Ratio

104/11037
RR = RA /RP = = 0.55.
189/11034

(a) How would you interpret the obtained value of the risk ratio? What ratio of conditional probabilities is
estimated by RR?
(b) Is the observed value of RR significantly different from 1?

Exercise 19
Given a sample (x1 , . . . , xn ) of independent and identically distributed observations, we are interested in testing
H0 : m = m0 against the two-sided alternative H1 : mP 6= m0 concerning the population median m. No parametric
n
model is assumed. As a test statistic we take y = i=1 1{xi ≤m0 } , the number of observations below the null
hypothesis value.

(a) Find the exact null distribution of Y . What are your assumptions?
(b) Suppose n = 25. Suggest an approximate confidence interval formula for m.

133
Exercise 20
Consider the problem of comparison of two simple hypotheses H0 : p = p0 , H1 : p = p1 with p1 > p0 using the
large-sample test for the proportion.

(a) Let Y have a binomial distribution with parameters (n, p). The power function of the one-sided test is given
by

Pw(p1 ) = P( √ Y −np0 ≥ zα | p = p1 ).
np0 (1−p0 )

Explain in detail all parts of this formula.


(b) Suppose we want to plan for the sample size n to control the sizes of two types of error at levels α and β.
Derive the following formula for the optimal sample size
p p
√ zα p0 (1 − p0 ) + zβ p1 (1 − p1 )
n= .
|p1 − p0 |

Hint: under the alternative hypothesis, √ Y −np1 is approximately normally distributed with parameters (0,1).
np1 (1−p1 )
(c) What happens to the planned sample size if the alternatives are very close to each other? What happens if
we decrease the levels α and β?

Exercise 21
A sports statistician studied the relation between the time (Y in seconds) for a particular competitive swimming
event and the swimmer’s age (X in years) for 20 swimmers with age ranging from 8 to 18. She employed quadratic
regression model and obtained the following result

Ŷ = 147 − 11.11X + 0.2730X 2 .

The standard error for the curvature effect coefficient was estimated as sb2 = 0.1157.

(a) Plot the estimated regression function. Would it be reasonable to use this regression function when the
swimmer’s age is 40?

(b) Construct a 99 percent confidence interval for the curvature effect coefficient. Interpret your interval esti-
mate.

(c) Test whether or not the curvature effect can be dropped from the quadratic regression model, controlling
the α risk at 0.01. State the alternatives, the decision rule, the value of the test statistic, and the conclusion. What
is the P -value of the test?

Exercise 22
In the Bayesian estimation framework we search for an optimal action

a = {assign value a to unknown parameter θ}.

The optimal choice depends on the particular form of the loss function l(θ, a). Bayes action minimizes the posterior
risk Z X
R(a|x) = l(θ, a)h(θ|x)dθ or R(a|x) = l(θ, a)h(θ|x).
θ

(a) Explain the meaning of the posterior risk function. What does h(θ|x) stand for? How is h(θ|x) computed?

(b) The zero-one loss function is defined by l(θ, a) = 1{θ6=a} . Compute the posterior risk using the discrete
distribution formula. Why is it called the probability of misclassification?

(c) What Bayesian estimator corresponds to the optimal action with the zero-one loss function? Compare this
estimator to the maximum likelihood estimator.

134
Exercise 23
See the picture to the right.
From this observation we would like to estimate the
amount of work required to clean a street
from chewing gums.

(a) Describe a Poisson distribution model


suitable for this particular observation.
Summarise the data in a convenient way.
(b) Write down the likelihood function for
this particular observation.
Find the maximum likelihood estimate.
(c) Without performing the required statistical test
describe how to check whether the Poisson model fits to the data.

(d) Estimate the proportion of tiles free from chewing gums using the fitted Poisson model.

Exercise 24
Miscellaneous questions.

(a) Describe a situation when a stratified sampling is more effective than a simple random sampling for esti-
mating the population mean. Which characteristics of the strata will influence your sample allocation choice?
(b) Given a dataset how do you compute kurtosis? What is the purpose of this summary statistic? Why is it
important to compute the coefficient of skewness for a proper interpretation of the kurtosis value?
(c) What is the difference between the parametric and non-parametric bootstrap methods?
(d) Suppose we are interested in the average height for a population of size 2,000,000. To what extend can a
sample of 200 individuals be representative for the whole population?

Exercise 25
Three different varieties of tomato (Harvester, Pusa Early Dwarf, and Ife No. 1) and four different plant densities
(10, 20, 30, and 40 thousands plants per hectare) are being considered for planting in a particular region. To see
whether either variety or plant density affects yield, each combination of variety and plant density is used in three
different plots, resulting in the following data on yields:

Variety Density 10,000 Density 20,000 Density 30,000 Density 40,000 mean
H 10.5, 9.2, 7.9 12.8, 11.2, 13.3 12.1, 12.6, 14.0 10.8, 9.1, 12.5 11.33
Ife 8.1, 8.6, 10.1 12.7, 13.7, 11.5 14.4, 15.4, 13.7 11.3, 12.5, 14.5 12.21
P 16.1, 15.3, 17.5 16.6, 19.2, 18.5 20.8, 18.0, 21.0 18.4, 18.9, 17.2 18.13
mean 11.48 14.39 15.78 13.91 13.89

(a) Fill in the ANOVA table for the missing numbers

Source of variation SS df MS F
Varieties
Density
Interaction 8.03
Errors 38.04

(b) Clearly state the three pairs of hypotheses of interest. Test them using the normal theory approach.
(c) Estimate the noise size σ.

Exercise 26
For each of nine horses, a veterinary anatomist measured the density of nerve cells at specified sites in the intestine:

135
Animal Site I Site II
1 50.6 38.0
2 39.2 18.6
3 35.2 23.2
4 17.0 19.0
5 11.2 6.6
6 14.2 16.4
7 24.2 14.4
8 37.4 37.6
9 35.2 24.4

The null hypothesis of interest is that in the population of all horses there is no difference between the two sites.

(a) Which of the two non-parametric tests is appropriate here: the rank-sum test or the signed-rank test?
Explain your choice.
(b) On the basis of the data, would you reject the null-hypothesis? Use one of the tests named in the item (a).
(c) Explain the following extract from the course text book:

More precisely, with the signed rank test, H0 states that the distribution of the differences is symmetric
about zero. This will be true if the members of pairs of experimental units are assigned randomly to
treatment and control conditions, and the treatment has no effect at all.

Exercise 27
Suppose that grades of 10 students on a midterm and a final exams have a correlation coefficient of 0.5 and both
exams have an average score of 75 and a standard deviation of 10.

(a) Sketch a scatterplot illustrating performance on two exams for this group of 10 students.
(b) If Carl’s score on the midterm is 90, what would you predict his score on the final to be? How uncertain is
this prediction?
(c) If Maria scored 80 on the final, what would you guess that her score on the midterm was?
(d) Exactly what assumptions do you make to make your calculations in (b) and (c)?

Exercise 28
The gamma distribution Gamma(α, λ) is a conjugate prior for the Poisson data distribution with a parameter θ. If
x is a single observed value randomly sampled from the Poisson distribution, then the parameters (α0 , λ0 ) for the
posterior gamma distribution of θ are found by the following updating rule:
- the shape parameter α0 = α + x,
- the inverse scale parameter λ0 = λ + 1.

(a) Find θ̂PME , the posteriori mean estimate for the θ, under the exponential prior with parameter 1, given the
following iid sample values from the Poisson(θ) population distribution

x1 = 2, x2 = 0, x3 = 2, x4 = 5.

(b) What is the updating rule for an arbitrary sample size n? Compare the value of θ̂PME with the maximum
likelihood estimator θ̂MLE as n → ∞. Your conclusions?

Exercise 29
Extracorporeal membrane oxygenation (ECMO) is a potentially life-saving procedure that is used to treat newborn
babies who suffer from severe respiratory failure. An experiment was conducted in which 29 babies were treated
with ECMO and 10 babies were treated with conventional medical therapy (CMT). In the ECMO group only 1
patient died, while in the CMT group 4 patients died.

(a) Suggest a statistical model and compute the likelihood function for the data as a function of two parameters:
p - the probability to die under the ECMO treatment and q - the probability to die under the CMT treatment.
(b) Write down a relevant pair of statistical hypotheses in the parametric form. Perform the exact Fisher test.

136
Exercise 30
Suppose that we have an iid sample of size 100 from the normal distribution with mean µ and standard deviation
σ = 10. For H0 : µ = 0 and H1 : µ 6= 0 we use the absolute value of the sample mean T = |X̄| as the test statistic.
Denote by V the P-value of the test.

(a) Show that V = 2(1 − Φ(Tobs )), where Tobs is the observed value of the test statistic and Φ(x) is the standard
normal distribution function. Plot the null distribution curve for X̄ and graphically illustrate this formula.
(b) In what sense the P-value V is a random variable? Using (a) show that

P (V ≤ 0.05) = P (X̄ < −1.96) + P (X̄ > 1.96).

(c) Suppose that the true value of the population mean is µ = 4. Using (b) show that P (V ≤ 0.05) ≈ 0.975.
Illustrate by drawing the density curve for the true distribution of X̄.
(d) Comment on the result (c) in the light of the statement: ”P values, the ’gold standard’ of statistical validity,
are not as reliable as many scientists assume”.

Exercise 31
A population with mean µ consists of three subpopulations with means µ1 , µ2 , µ3 and the same variance σ 2 . Three
independent iid samples, each of size n = 13, from the three subpopulation distributions gave the following sample
means and standard deviations:
Sample 1 Sample 2 Sample 3
Mean 6.3 5.6 6.0
SD 2.14 2.47 3.27

(a) Compute a stratified sample mean, assuming that the three subpopulation sizes have the ratios N1 : N2 :
N3 = 0.3 : 0.2 : 0.5. Prove that this is an unbiased estimate for the population mean µ.
(b) Assume that all three subpopulation distributions are normal. Write down simultaneous confidence intervals
for the three differences µ1 − µ2 , µ1 − µ3 , and µ2 − µ3 .
(c) Would you reject the null hypothesis of equality µ1 = µ2 = µ3 in this case?

Exercise 32
The following table shows admission rates for the six most popular majors at the graduate school at the University
of California at Berkeley. The numbers in the table are the number of applicants and the percentage admitted.

Men Women
Major A 825 (62%) 108 (82%)
Major B 560 (63%) 25 (68%)
Major C 325 (37%) 593 (34%)
Major D 417 (33%) 375 (35%)
Major E 191 (28%) 393 (34%)
Major F 373 (6%) 341 (7%)

(a) If the percentage admitted are compared, women do not seem to be unfavourably treated. But when the
combined admission rates for all six majors are calculated, it is found that 44% of the men and only 30% of the
women were admitted. How this paradox is resolved?
(b) This is an example of an observational study. Suggest a controlled experiment testing relevant statistical
hypotheses.

Exercise 33
Represent the large sample test for a proportion as a chi-square test.

14.2 Numerical answers to miscellaneous exercises


Answer 1
Coefficient of determination is the squared sample correlation r2 = (0.2)2 = 0.04.

137
Answer 2
(a) Since the mean is 0, the variance is computed as
Z ∞ Z ∞
2
σ = 2
x f (x)dx = λ x2 e−λx dx = 2λ−2 .
−∞ 0

The kurtosis is the scaled fourth moment


∞ ∞
λ5
Z Z
−4
β2 = σ 4
x f (x)dx = x2 e−λx dx = 6.
−∞ 4 0

(b) The Laplace curve is symmetric. Its shape is formed by two exponentially declining curves: one for positive
x and the other for the negative x.

(c) For λ = 2 the mean is 0, the skewness is 0, and the kurtosis is 6. Compared to the normal curve with the
same mean but smaller kurtosis (=3), the Laplace distribution has heavier tails. Moreover, since the variances are
equal, the two curves should cross 4 times. This implies that the Laplace curve must also have higher peakedness.

Answer 3
Pn Pn
(a) Given i=1 xi = 58 and i=1 x2i = 260, the likelihood function is
n n x2 −2µ n 2
1 (xi −µ)2 1 1 2
P P
− i=1 i i=1 xi +nµ − 260−116µ+16µ
Y
2
L(µ, σ ) = √ e 2σ2 = e 2σ 2 = e 2σ 2 .
i=1
2πσ (2π)n/2 σ n (2π)8 σ 16
Pn Pn
(b) It is sufficient to know i=1 xi and i=1 x2i to compute the likelihood function.
Pn
(c) The MLE for the mean is x̄ = 3.63 and the MLE for the variance σ̂ 2 = n−1 i=1 x2i − x̄2 = 3.11. These are
computed by taking the derivative of the log-likelihood
Pn Pn
n x2 − 2µ i=1 xi + nµ2
l(µ, σ 2 ) := ln L(µ, σ 2 ) = − ln(2π) − n ln σ − i=1 i
2 2σ 2
and solving a pair of equations
Pn
−2 i=1 xi + 2nµ
2
= 0,

Pn Pn
2 2
n i=1 xi − 2µ i=1 xi + nµ
− + = 0.
σ σ3
Since
n
X σ2 n−1 2
E(n−1 Xi2 − X̄ 2 ) = σ 2 − = σ ,
i=1
n n
σ̂ 2 is a biased estimate of σ 2 .

Answer 4
(a) Multiple testing.

(b) Exact Fisher’s test.

(c) Nonparametric tests do not assume a particular form of the population distribution like normal distribution.

Answer 5
The null hypothesis is that everybody votes independently. Let p be the population proportion for ’yes’. Then
the number of ’yes’ for three voters in a household has the binomial distribution model X ∼ Bin(3, p) with an
unspecified parameter p. So the null hypothesis can be expressed in the following form
H0 : p0 = (1 − p)3 , p1 = 3p(1 − p)2 , p2 = 3p2 (1 − p), p3 = p3 .
The MLE of p is the sample mean p̂ = 0.5417. We use the Pearson chi-square test with expected counts
E0 = n(1 − p̂)3 = 19, E1 = 3np̂(1 − p̂)2 = 68, E2 = 3np̂2 (1 − p̂) = 81, E3 = 3np̂3 = 32.
The observed chi-square test statistic is χ2 = 11.8 which has a P-value less than 0.5% according to the approximate
null distribution χ2df with df = 4 − 1 − 1 = 2.
Reject the null hypothesis of independent voting.

138
Answer 7
(b) Friedman’s test for I = 3 treatments and J = 2 blocks. The test statistic
I
12J X I+1 2
Q= (R̄i. − 2 )
I(I + 1) i=1

is obtained from the ranks given by two subjects (Rij ) to the three treatments. Under the null distribution all 36
possible rank combinations
           
1 1 1 1 1 2 3 1 3 1 3 3
(Rij ) =  2 2  ,  2 3 , 2 1 ,..., 2 2 , 2 3 , 2 2 
3 3 3 2 3 3 1 3 1 2 1 1

are equally likely. The corresponding vector of rank averages (R̄1. , R̄2. , R̄3. ) takes 5 values (up to permutations)

A1 = (1, 2, 3), A2 = (1, 2.5, 2.5), A3 = (1.5, 1.5, 3), A4 = (1.5, 2, 2.5), A5 = (2, 2, 2)

according to the following table

1, 2, 3 1, 3, 2 2, 1, 3 2, 3, 1 3, 1, 2 3, 2, 1
1, 2, 3 A1 A2 A3 A4 A4 A5
1, 3, 2 A2 A1 A4 A3 A5 A4
2, 1, 3 A3 A4 A1 A5 A2 A4
2, 3, 1 A4 A3 A5 A1 A4 A2
3, 1, 2 A4 A5 A2 A4 A1 A3
3, 2, 1 A5 A4 A4 A2 A3 A1

Next we have
(R̄ , R̄ , R̄ ) = A1 A2 A3 A4 A5
P3 1. 2. 3.2
i=1 (R̄i. − 2) = 2 1.5 1.5 0.5 0
Probability = 1/6 1/6 1/6 1/3 1/6
Thus the null distribution of Q is the following one

P(Q = 0) = 1/6, P(Q = 1) = 1/3, P(Q = 2) = 1/3, P(Q = 3) = 1/6.

Answer 8
(a) Binomial model for the number of females Y ∼ Bin(36, p). Given Yobs = 13 we have to test H0 : p = 0.5
against the two-sided alternative H1 : p 6= 0.5. The approximate null distribution is Y ∼ N(18, 9), therefore, an
approximate two-sided P-value becomes

2 × (1 − Φ( 18−13
3 ) = 2(1 − Φ(1.67)) = 2 × 0.048 = 9.6%.

With such a high P-value we can not reject the null hypothesis of equal sex ratio.

(b) A simple sample mean

n1 x̄1 + n2 x̄2 13 × 62.8 + 23 × 69.7


x̄ = = = 67.2,
n 36
and a stratified sample mean
1 1 62.8 + 69.7
x̄s =
x̄1 + x̄2 = = 66.3.
2 2 2
(c) The standard error of the stratified sample mean is
s r
1 s21 s22 1 (6.8)2 (11.7)2
sx̄s = + = + = 1.54.
2 n1 n2 2 13 23

To compute the sample variance from the simple random sample take some effort. First we observe that
X X
(x1i − x̄1 )2 = x21i − nx̄21 .

139
It follows,
X X
x21i = (x1i − x̄1 )2 + n1 x̄21 = (n1 − 1)s21 + n1 x̄21
= 12 × (6.8)2 + 13 × (62.8)2 = 51825,
X
x22j = (n2 − 1)s22 + n2 x̄22
= 22 × (11.7)2 + 23 × (69.7)2 = 114748,
P 2
x1i + x22j
P
2 36
s = − x̄2 = 114.4.
35 35
q
114.4
So that sx̄ = 36 = 1.78.

Answer 9
(a) Bootstrap.

(b) Kurstosis.

(c) Power of the test.

(d) Hazard function.

(e) Outliers.

Answer 10
(a) Under the two-way ANOVA model the most interesting is H0 : α1 = α2 = α3 = α4 = 0 the null hypothesis
of no difference among different types of tires.

(b) The Friedman test


Bus Tire 1 Tire 2 Tire 3 Tire 4
1 1 3 4 2
2 1 3 4 2
3 1 4 3 2
4 1 3 4 2
5 1 3 4 2
Mean rank 1.0 3.2 3.8 2.0
results in a test statistics Q = 14.04. The null distribution is approximated by a chi-square distribution with df
= 3, whose table gives a P-value less than 0.5%. Reject H0 .

Answer 11
Matched pairs design for 50 independent trials with four possible outcomes (correct, correct), (correct, wrong),
(wrong, correct), (wrong, wrong). Assuming that in the slow regime the ”talking computer” recognizes correctly
all correct answers made in the fast regime we can summarize the results as follows
Fast correct Fast wrong Totals
Slow correct 35 7 42
Slow wrong 0 8 8
Totals 35 15 50
2
McNemara’s test statistics is (7−0) 2
7+0 = 7. The null distribution is approximated by the χ1 −distribution. Since the
square root of 7 is 2.65, the standard normal distribution gives a (two-sided) P-value 0.8%. We conclude that the
observed difference is statistically significant.
The conclusion will be different if our assumption is wrong. In the worst case the slow regime correct answers
are totally different and the table of the outcomes looks as
Fast correct Fast wrong Totals
Slow correct 27 15 42
Slow wrong 8 0 8
Totals 35 15 50

140
2
McNemara’s test statistics is then (7−0)
8+15 = 2.13. Since the square root of 2.13 is 1.46, the standard normal
distribution gives a two-sided p-value 14%. We can not reject the null hypothesis in this case.

Answer 12
We use a Beta prior with parameters (a, b) satisfying
1
a 1 (1 − 13 ) 1
= , 3 = .
a+b 3 a+b+1 32
The prior pseudo-counts are well approximated by a = 2 and b = 4. Thus the posterior Beta distribution has
parameters (10, 16) giving the posterior mean estimate p̂pme = 0.38.

Answer 13
(b) First we find the sample correlation coefficient by r = b1 ssxy = 0.96. The coefficient of determination is
r2 = 0.91. Using formula
n−1 2
s2 = s (1 − r2 ) = 0.589
n−2 y

the noise size is estimated as s = 0.589 = 0.77.

(c) An exact 95% CI for β0 is b0 ± tn−3 (0.025)sb0 = 1.68 ± 2.365 × 1.06 = [−0.83, 4.19].

(d) The observed test statistic t = 5.51 for the model utility test for H0 : β2 = 0 has an exact null distribution
t7 . After consulting the t7 −distribution we reject this null hypothesis at 0.5% significance level. The quadratic
term is therefore highly statistically significant.

Answer 14
(a) Multiple regression model Yi = β0 + β1 xi + β2 x2i + i , where the random variables i , i = 1, . . . , 5 are
independent and have the same normal distribution N(0, σ 2 ). The corresponding design matrix has the form
 
1 0 0
 1 2 4 
 
X=  1 4 16 

 1 6 36 
1 8 64

(b) Using the formula ŷi = 111.9 + 8.1xi − 1.8x2i we get

xi 0 2 4 6 8
yi 110 123 119 86 62
ŷi 111.9 120.9 115.5 95.7 61.5
ky−ŷk2 114.6
and then s2 = n−p = 2 = 57.3.

(c) Coefficient of multiple determination

SSE 114.6
R2 = 1 − =1− = 0.956.
SST 2630

Answer 15
(a) In terms of the two-way ANOVA model Yijk = µ + αi + βj + δij + ijk ( grand mean + main effects +
interaction+noise), we estimate the main effects as

α̂1 = 11.9, α̂2 = −11.8, β̂1 = 1.99, β̂2 = −5.02, β̂3 = 3.04.

(Notice the effect of rounding errors.)

(b) Compute the cell means

141
Speed
60 70 80
1 189.7 185.1 189.0
1 188.6 179.4 193.0
1 190.1 177.3 191.1
Cell means 189.5 180.6 191.0
2 165.1 161.7 163.3
2 165.9 159.8 166.6
2 167.6 161.6 170.3
Cell means 166.2 161.0 166.7

and draw two lines for the speed depending on two different formulations, see the left panel on the figure below.
These two lines are almost parallel indicating to the absence of interaction between two main factors. This is
confirmed by the ANOVA table below showing that the interaction is not significant.

190 190

160 160

60 70 80 60 70 80

One possible interaction effect could have the form on the right panel. In this case the formulation 2 interacts
with the speed factor in such a way that the yield becomes largest at the speed 70.

(c) Anova-2 table

Source df SS MS F Critical values Significance


Formulation 1 2253.44 2253.44 376.2 F1,12 = 4.75 Highly significant
Speed 2 230.81 115.41 19.3 F2,12 = 3.89 Highly significant
Interaction 2 18.58 9.29 1.55 F2,12 = 3.89 Not significant
Error 12 71.87 5.99
Total 17

(d) To check the normality assumption for the noise with the same variance across different values of the
explanatory variable.

Answer 16
(a) This is a single sample of size n = 441. Each of n observations falls in of 9 groups. The multinomial
distribution model

(n11 , n12 , n13 , n21 , n22 , n23 , n31 , n32 , n33 ) ∼ Mn(n, p11 , p12 , p13 , p21 , p22 , p23 , p31 , p32 , p33 )

gives the likelihood function

L(p11 , p12 , p13 , p21 , p22 , p23 , p31 , p32 , p33 )


= P (n11 = 24, n12 = 15, n13 = 17, n21 = 52, n22 = 73, n23 = 80, n31 = 58, n32 = 86, n33 = 36)
441!
= p24 · p15 · p17 · p52 · p73 · p80 · p58 · p86 · p36 .
24!15!17!52!73!80!58!86!36! 11 12 13 21 22 23 31 32 33
(b) The null hypothesis of independence H0 : pij = pi. · p.j meaning that there is no relationship between facility
conditions at gasoline stations and aggressiveness in the pricing of gasoline.

(c) The chi-square test statistic χ2 = 22.5 should be compared with the critical values of χ24 -distribution. Even
though the corresponding table is not given we may guess that the result must be significant as the square root of
22.5 is quite large. We reject the null hypothesis of independence and conclude that that there is a relationship
between facility conditions at gasoline stations and aggressiveness in the pricing of gasoline.

142
Pricing policy
Aggressive Neutral Nonaggressive Total
Substandard condition 24 (17) 15 (22) 17 (17) 56
Standard condition 52 (62.3) 73 (80.9) 80 (61.8) 205
Modern condition 58 (54.7) 86 (71) 36 (54.3) 180
Total 134 174 133 441
It looks like the standard conditions are coupled with the least aggressive pricing strategy.

Answer 17
8
(a) Two independent dichotomous samples with n = 56, p̂1 = 56 = 0.143 and m = 74, p̂2 = 1274 = 0.162. An
asymptotic 95% confidence interval for the population difference is given by
r
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
Ip1 −p2 = p̂1 − p̂2 ± 1.96 · + = −0.019 ± 0.125 = [−0.144, 0.106].
n−1 m−1
(b) For a credibility interval we can use the non-informative uniform prior p ∈ Beta(1, 1). Adding the pseudo-
counts (1, 1) to the total counts (8 + 12, 48 + 62) we get p ∈ Beta(21, 111) as the posterior distribution. Using
Matlab one can find the exact 95% credibility interval [a, b] for p by finding the 2.5% and 97.5% quantiles of the
posterior distribution. q
21 0.16(1−0.16)
Posterior mean µ = 21+111 = 0.16 and standard deviation σ = 132 = 0.03 leads to the normal
approximation of the posterior distribution with mean 0.16 and standard deviation 0.03. This yield an approximate
95% credibility interval
Jp = 0.16 ± 1.96 · 0.03 = [0.10, 0.22].

Answer 18
(a) The risk ratio compares the chances to suffer from myocardial infarction under the aspirin treatment vs the
chances to suffer from myocardial infarction under the placebo treatment:

P(MyoInf|Aspirin)
RR = .
P(MyoInf|Placebo)

(b) The null hypothesis of RR = 1 is equivalent to the hypothesis of homogeneity.

MyoInf No MyoInf Total


Aspirin 104 (146.5) 10933 (10887.5) 11037
Placebo 189 (146.5) 10845 (10887.5) 11034
Total 293 21778 22071
The corresponding chi-square test statistic is

42.52 42.52 42.52 42.52


χ2 = + + + = 25.
146.5 146.5 10887.5 10887.5
Since df=1 we can use the normal distribution table. The square root of 25 is 5 making the result highly significant.
Aspirin works!

Answer 19
(a) The null distribution of Y is Bin(n, 12 ) as each observation is smaller than the true median (assuming that
the distribution is continuous) with probability 0.5.

(b) A non-parametric CI for the midean M is given by (x(k) , x(n−k+1) ) where k is such that

PH0 (Y > n − k) ≈ 0.025.

With n = 25 we find k using the normal approximation with continuity correction:


Y − 12.5 13 − k 13 − k
0.025 ≈ PH0 (Y > 25 − k) = PH0 > ≈P Z> .
2.5 2.5 2.5
13−k
Thus 2.5 ≈ 1.96 and we get k = 8. The approximate 95% CI for M is given by (X(8) , X(18) ).

143
Answer 20
(a) The null distribution of Y is approximately normally distributed with parameters (np0 , np0 q0 ), where
q0 = 1 − p0 . At the significance level α, the rejection region for the one-sided alternative is
y − np0
√ ≥ zα .
np0 q0
The power function is the probability of rejecting the null hypothesis given the alternative one is true
Y −np0
Pw(p1 ) = P( √ np0 q0 ≥ zα | p = p1 ).

(b) To compute the required sample size observe first that


√ √
Y −np0 Y −np1 zα p0 q0 + n(p0 −p1 )
β = P( √ np0 q0 < zα | p = p1 ) = P( np1 q1 < | p = p1 ).
√ √
p1 q 1

Now, since under the alternative hypothesis Y is approximately normally distributed with parameters (np1 , np1 q1 ),
we get
√ √
zα p0 q0 + n(p0 −p1 )
β ≈ Φ( √
p1 q 1 ).

Combining this with


β = Φ(−zβ ),
we arrive at the equation
√ √
zα p0 q0 + n(p0 −p1 )

p1 q1 = −zβ ,

which brings the desired formula for the optimal sample size
√ √
√ zα p0 q0 + zβ p1 q1
n= .
|p1 − p0 |
(c) If the alternatives are very close to each other, the denominator goes to zero and the sample size becomes
very large. This is very intuitive as it becomes more difficult to distinguish between two close parameter values.
If we decrease the levels α and β, the values zα and zβ from the normal distribution table become larger and
the sample size will be larger as well. Clearly, if you want have more control over both types of errors, you have to
pay by collecting more data.

Answer 21
(a) The underlying parabola makes unrealistic prediction that ŷ40 = 139 sec compared to ŷ10 = 63 sec and
ŷ20 = 34 sec. One should be careful to extend the range of explanatory variable from that used in the data.

(b) Using t17 (0.005) = 2.898 we get the exact confidence interval (under the assumption of normality and
homoscedasticity for the noise component)

Iµ = 0.2730 ± 2.898 · 0.1157 = (−0.0623, 0.6083).

(c) Since the confidence interval from 2b covers zero, we do not reject the null hypothesis H0 : β2 = 0 at the 1%
0.2730
significance level. The observed t-test statistic 0.1157 = 2.36 ∈ (2.110, 2.567), and according to the t17 -distribution
table says that the two-sided p-value is between 2% and 5%.

Answer 22
(a) For a given action a, the posterior risk function
X
R(a|x) = l(θ, a)h(θ|x) = E(l(Θ, a)|x).
θ

is the expected loss when the unknown parameter θ is treated as a random variable Θ with the posterior distribution:

P(Θ = θ|x) = h(θ|x).

(b) For the 0-1 loss function in the discrete distribution case,
X
R(a|x) = h(θ|x) = 1 − h(a|x) = P(Θ 6= a|x)
θ6=a

144
is the probability of misclassification, that is the posterior probability that the chosen action a is different from the
true value of the parameter θ.

(c) The corresponding Bayesian estimator minimizing the risk function R(a|x) = 1 − h(a|x) maximizes h(a|x),
the posterior probability. It is denoted θ̂MAP and called the maximum a posteriori probability estimate. In the
case the prior distribution is non-informative, so that the posterior distribution is proportional to the likelihood
function, we have θ̂MAP = θ̂MLE .

Answer 23
(a) The numbers of chewing gums for different tiles are summarized in the form of observed counts
Number of gums per tile 0 1 2 3 4 ≥5
Counts 11 8 2 0 1 0
with the total number of tiles n = 22. The Poisson model assumes that the number of gums X1 , . . . , Xn are
independent random variable with the common one-parameter distribution

λk −λ
P(X = k) = e , k = 0, 1, . . .
k!
(b) The likelihood function
n
Y λxi e−nλ λnx̄ e−22λ λ16
L(λ) = e−λ = = .
i=1
xi ! x1 ! · · · xn ! 2!2!4!

The log likelihood


l(λ) = const − 22λ + 16 log λ.
0
The equation l (λ) = 0 gives
16
0 = −22 + .
λ
16
The MLE becomes λ̂ = 22 = 0.73, which is x̄.

(c) Use the Pearson chi-square test of goodness of fit. Combine the cells for 2 and more gums. Compute the
expected counts by
E0 = n · e−λ̂ , E1 = n · λ̂e−λ̂ , E2 = n − E0 − E1 .
P (Ok −Ek )2
Then find the test statistic χ2 = Ek and use the chi-square distribution with df = 3 − 1 − 1 = 1 table to
p
see if the result is significant. For example, if χ2 > 1.96, we reject the Poisson model hypothesis at α = 5%.

(d) Using the Poisson model we estimate p0 by p̂0 = e−λ̂ = 0.48, which is close to the sample proportion
11
22 = 0.50.

Answer 24
(a) When the population under investigation has a clear structure it is more effective to use stratified sampling
for estimating the overall population mean. In accordance with the optimal allocation formula:
wi σi
ni = n ,
σ̄
the allocation of observations should follow the next two key rules: put more observations in the larger strata, and
put more observations in the strata with higher variation.

(b) The sample kurtosis is computed from a sample (x1 , . . . , xn ) as


n
1 X
b2 = (xi − x̄)4 ,
ns4 i=1
Pn
where x̄ = x1 +...+x
n
n
is the sample mean and s2 = n−11 2
i=1 (xi − x̄) is the sample variance. If the corresponding
coefficient of skewness is close to zero and b2 ≈ 3, then we get an indication that the shape of the population
distribution curve is close to the normal distribution.

145
(c) Bootstrap is a resampling technique used to study the sampling distribution of a parameter estimator. In
the parametric bootstrap resampling is done from the given parametric distribution with the unknown parameters
replaced by their estimates obtained from the underlying sample. In the non-parametric bootstrap resampling is
performed with replacement directly from the the underlying sample.

s
(d) The standard error for the sample mean is sx̄ = √200 . Roughly: the range of heights 160−200 in centimeters
covers 95% of the population distribution. Treating this interval as the mean plus-minus two standard deviations,
we find s ≈ 10 cm and sx̄ is something like 0.7 cm. Thus a random sample of size 200 may give a decent estimate
of the population mean height.

Answer 25
(a)

Source of variation SS df MS F
Varieties 328.24 2 164.12 103.55
Density 86.68 3 28.89 18.23
Interaction 8.03 6 1.34 0.84
Errors 38.04 24 1.59

(b) Using the critical values

F2,24 = 3.40, F3,24 = 3.01, F6,24 = 2.51,

we reject both null hypotheses on the main factors and do not reject the null hypothesis on interaction.

(c) s = 1.59 = 1.26.

Answer 26
(a) This is an example of a paired sample, therefore the signed-rank test is appropriate for testing the null
hypothesis of no difference.

(b) We use the signed-rank test. The observed test statistics are W+ = 39 and W− = 6.

Animal Site I Site II Difference Signed rank


1 50.6 38.0 12.6 8
2 39.2 18.6 20.6 9
3 35.2 23.2 12.0 7
4 17.0 19.0 -2.0 -2
5 11.2 6.6 4.6 4
6 14.2 16.4 -2.2 -3
7 24.2 14.4 9.8 5
8 37.4 37.6 -0.2 -1
9 35.2 24.4 10.8 6

According to Figure 2, the two-sided p-value is larger than 5% because the smaller test statistic w− = 6 is larger
than the critical value 5 for n = 9. Therefore, we do not reject the null hypothesis of equality in favour of the
two-sided alternative.

(c) The extract from the course text book reminds that the null hypothesis for the signed rank test, beside
equality of two population distributions, assumes a symmetric distribution for the differences. It also explains why
such an assumption is reasonable.

Answer 27
(b) The fitted regression line for the final score y as a function of the midterm score x is y = 37.5 + 0.5x. Given
x = 90 we get a point prediction y = 82.5. The estimate of σ 2 is

n−1 2
s2 = s (1 − r2 ) = 84.4.
n−2 y

146
Thus the 95% prediction interval for Carl’s final score is
q
1
I = 82.5 ± t8 (0.025)s 1 + 9 + 81 ( 15 2
10 ) = 82.5 ± 24.6.

(c) The fitted regression line for the midterm score x as a function of the final score y is x = 37.5 + 0.5y. Given
y = 80 we get a point prediction x = 77.5.

Answer 28
(a) The exponential prior with parameter 1 is Gamma(1, 1). Applying the updating rule four times:
(1, 1) → (3, 2) → (3, 3) → (5, 4) → (10, 5),
we find the posterior distribution to be Gamma(10, 5). Therefore, θ̂PME = 10/5 = 2.

(b) The general updating rule for an arbitrary sample (x1 , . . . , xn becomes
- the shape parameter α0 = α + nx̄,
- the inverse scale parameter λ0 = λ + n.

α+nx̄
We have θ̂PME = λ+n . Comparing this to the maximum likelihood estimator θ̂MLE = x̄, we see that
α + nx̄ α − λx̄
θ̂PME − θ̂MLE = − x̄ = → 0,
λ+n λ+n
as n → ∞. This means that the role of the prior is less important with large sample sizes.

Answer 29
(a) We have two independent samples from two distributions: one with parameter p, and the other with
parameter q. Using Bin(29, p) and Bin(10, q) we compute the likelihood function as

29 28 10
L(p, q) = p(1 − p) q 4 (1 − q)6 .
1 4
(b) We test H0 : p = q against H1 : p 6= q using the exact Fisher test.
ECMO CMT Total
Died 1 4 5
Alive 28 6 34
Total 29 10 39
5
The count y = 1 is our observed test statistics whose null distribution is Hg(39, 29, 39 ). The one-sided p-value is
5 34 5 34

34!29!10! 5 · 34!29!10!
P(Y = 0) + P(Y = 1) = 0 3929 + 1 3928 = +
29 29
5!29!39! 6!28!39!
10 · 9 · 8 · 7 · 6 10 · 9 · 8 · 7 · 5 · 29 10 · 9 · 8 · 7 · (6 + 5 · 29)
= + = = 0.011.
39 · 38 · 37 · 36 · 35 39 · 38 · 37 · 36 · 35 39 · 38 · 37 · 36 · 35
The two-sided p-value becomes 2% and we can reject the null hypothesis.

Answer 30
(a) The null distribution of |X̄| is standard normal. The p-value of the test is the probability under the null
distribution that |X̄| > tobs . Thus
V = P(|X̄| > tobs |H0 ) = 2(1 − Φ(tobs )).
(b) Different samples will give different observed values tobs = |x̄obs |, in this sense the p-value
V = 2(1 − Φ(Tobs ))
is a random variable. We have
P(V ≤ 0.05) = P(1 − Φ(|X̄obs |) ≤ 0.025) = P(Φ(|X̄|) ≥ 0.975) = P(|X̄| > 1.96).
(c) If the true value of the population mean is µ = 4, then X̄ has distribution N(4, 1). Using (b) we find
P(V ≤ 0.05) ≈ P(X̄ > 2) = 1 − Φ(2 − 4) = Φ(2) ≈ 0.975.
(d) We see from (c) that even with such a big separation between the null-hypothesis and the true parameter
values, there is a probability of 2.5% that the p-value will exceed 5%. One has to be aware of this variability while
interpreting the p-value produced by your statistical analysis.

147
Answer 31
(a) Stratified sample mean x̄s = 0.3 · 6.3 + 0.2 · 5.6 + 0.5 · 6.0 = 6.01.

(b) We are in the one-way Anova setting with I = 3 and J = 13. The 95% Bonferroni simultaneous confidence
intervals for the three differences µ1 − µ2 , µ1 − µ3 , and µ2 − µ3 are computed as
p
Bµu −µv = x̄u − x̄v ± t36 (0.05/6)sp 2/13,

with the pooled sample variance given by

12 · s21 + 12 · s22 + 12 · s23 2.142 + 2.472 + 3.272


s2p = = = 2.672 .
36 3
This yields
Bµu −µv = x̄u − x̄v ± 2.5 · 2.67 · 0.39 = x̄u − x̄v ± 2.62.
(c) We would not reject the null hypothesis of equality µ1 = µ2 = µ3 , since for all three pairwise differences the
confidence intervals contain zero:

0.7 ± 2.62, 0.3 ± 2.62, 0.4 ± 2.62.

Answer 32
(a) This is another example of the Simpson paradox. The confounding factor here is the difficulty to enter
the programmes. Men tend to apply for easy programs, while women more often apply for programs with low
admission rates.

(b) A simple hypothetical experimental study could be based on two independent random samples. Focus
on one major program, say major F. Take n randomly chosen female candidates and n randomly chosen male
candidates. Ask all of them to apply for major F. Compare two sample proportions of the admitted applicants.
Of course this experiment is impossible to perform in practice!

Answer 33
Let X ∼ Bin(n, p). Two observed counts (x, n − x) are used to test H0 : p = p0 . The corresponding chi-square test
statistic
2
2
(O j −Ej ) (x−np0 )2 (n−x−n(1−p0 ))2 (x−np0 )2
X
Ej = np0 + n(1−p0 ) = np0 (1−p0 ) = z2,
j=1

where
z = √ x−np0
np0 (1−p0 )

is the test statistic for the large sample test for a proportion.

148

You might also like