Regn_lect_3
Regn_lect_3
We are often interested in the relationship between our response and one (or more) variables
(a) We might wish to see whether rhknow increases with age in Stepping stones
(b) We could investigate whether the time taken for the TOP procedure depends on
gestational age in the misoprostol study.
We do this using the method of “least squares” which fits the line
y= a + bx
which passes as closely as possible to our observed data points.
If yi = a + bxi is the i-th predicted value, then a and b are chosen to minimize the sum of the
squared vertical deviations between the observed and fitted line. In terms of our simple linear
regression model
yi = α + β xi + εi
we minimize ∑ε2 with respect to α and β
It turns out that the minimizing values are
2
b = Sxy / Sxx
where Sxy = ∑ xi yi - (∑xi ) (∑yi )/n is the corrected sum of products and
Sxx = ∑ xi 2 - (∑xi )2 /n is the corrected sum of squares for x
And a = mean(y) – b * mean(x)
Our fitted line
y = a + bx will always pass through the point mean(x), mean(y)
Ex: The table below shows plasma volume and body weight in eight healthy men (Kirkwood &
Sterne page 87). Our outcome variable is plasma volume and our explanatory (exposure)
variable is body weight; we wish to see whether plasma volume is related to body weight.
The plot below shows that as body weight increases, plasma volume also tends to increase i.e.
there seems to be a relationship between plasma volume and body weight, so we can estimate
the regression equation.
55 60 65 70 75
Body weight ( kg )
. reg plasvol wt
------------------------------------------------------------------------------
plasvol | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wt | .0436153 .0152684 2.86 0.029 .006255 .0809757
_cons | .0857244 1.023998 0.08 0.936 -2.419909 2.591358
------------------------------------------------------------------------------
Thus plasma volume increases on average by 0.0436 litres for every 1 kg increase in body
weight (or by 10*0.0436 = 0.436 for a 10-kg weight increase).
To plot the observed points with the fitted line (as done above) we use the predict command to
save the fitted values and then the relevant graph command as shown below. The predict
command is a very powerful command that can be used to estimate or predict a number of
quantities after fitting a regression model and takes the form
predict newvar , options
where newvar is the name of the new variable which will contain the predicted value and the
option chosen specifies what this new variable will contain. If we omit the option (as below) then
newvar will contain the default which is the fitted values (or the values of y predicted from our
model).
. predict plasfit
(option xb assumed; fitted values)
. twoway (scatter plasvol wt, sort msymbol(plus)) (connected plasfit wt, sort
msymbol(none)) ,
> xtitle (body weight (kg)) xlabel (55 (5) 75) ylabel (2.5 (0.25) 3.5)
ti("Regression of plasma volume on body weight")
We can test this by constructing the analysis of variance table for the regression model:
We subdivide the total variation in our sample of y values into the variation which can be
attributed to the regression line (the regression sum of squares) and the residual variation about
the fitted line, as shown on the graph below in which we have drawn in the mean of the x-values
and the mean of the y-values.
5
55 60 65 70 75
Body weight ( kg )
In practice the analysis of variance is obtained by calculating the total sum of squares as the
sum of the squared deviations about the mean of the y-values i.e . Syy and the regression sum
of squares given by
Regression ss = (Sxy )2 / Sxx
We can then calculate the residual sum of squares by subtraction
Residual ss = total ss – regression ss
Note that the residual sum of squares is the minimized value of the sum of squared deviations.
The partition of variation is summarized in the analysis of variance table. To construct this we
need the degrees of freedom (number of pieces of independent information that go into the
calculation of each sum of squares). The (n-1) degrees of freedom for the total sum of squares
are divided into 1 d.f. for the slope (regression) and (n-2) for the residual – we can understand
this as having n observations from which we have estimated 2 parameters (a and b) – hence we
have (n-2) d.f. for the residual.
From each sum of squares we calculate a mean square by dividing the sum of squares by the
associated degrees of freedom. The residual mean square s2 found by dividing the residual sum
of squares by (n-2) provides an estimate of the residual variation in y having “adjusted for the
effect of x”.
6
The regression sum of squares is given by (Sxy)2 / Sxx = (8.9575)2/ 205.375 = 0.39068
Thus by subtraction the residual sum of squares is 0.67795 – 0.39068 = 0.28727
We can then construct our ANOVA table noting that since we have 8 observations our total
degrees of freedom will be (n-1)=7 and the residual degrees of freedom will be
(n-2)=6.
Total 7 0.67795
Under H0 we would expect the regression and residual mean squares to be comparable (i.e. the
variation due to the effect of body weight should be comparable to the residual or random
variation) and thus we would expect F to be about 1; in fact under H0 the F-ratio follows an F-
distribution with 1,6 degrees of freedom. Comparing our F-ratio of 8.16 with F-tables shows that
the probability of getting such an F-value if H0 is true is less than 0.05 (in fact P=0.0289) so we
can reject the null hypothesis and conclude that there is a relationship between y and x.
The components of the ANOVA table are given as part of the output from the regress command
(and are reproduced below).
7
. reg plasvol wt
Note:
1. The output from regress in Stata gives the standard errors for the regression estimates.
------------------------------------------------------------------------------
plasvol | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wt | .0436153 .0152684 2.86 0.029 .006255 .0809757
_cons | .0857244 1.023998 0.08 0.936 -2.419909 2.591358
------------------------------------------------------------------------------
The standard errors can be used to test the hypothesis that the parameter is zero and also to
construct confidence intervals for the parameter.
In this case the T-statistic (on 6 d.f.) for testing the hypothesis that β=0 is 2.86 and we reject the
null hypothesis (P=0.029). Note that in the case of simple linear regression the t-test is identical
to the F-test (in fact T2 = F or 2.862 ≈ 8.16 – to within rounding error). The 95% confidence limits
for β are 0.006255 to 0.08098 – so we are confident that the plasma volume will increase by
between 0.00626 and 0.08098 litres for every 1 kg increase in body weight.
8
2. In this example the intercept is not meaningful. Literally it could be interpreted as the
estimated mean plasma volume when weight=0 kg. The intercept can be made more
meaningful by centering the exposure variable i.e. subtracting the mean of the exposure
variable from each observation, so that the new exposure variable has mean 0. The
intercept in a simple linear regression model with a centered exposure variable is equal to
the mean of the outcome variable.
3. Also note that in general the test of the null hypothesis that the intercept is zero is usually
not of interest.
Prediction
We can use our regression equation for prediction; the predicted value y* when x=x* is given by
y* = a + bx*
The standard error of prediction is given by
s.e.(y*) = s √ {1+ 1/n + [x*-mean(x)]2 / Sxx }
The standard error is smallest when x* is close to the mean of the x-values and increases as x*
moves away from the mean, thus prediction is most precise close to the mean of the exposure
variable.
Also note that in general we should avoid extrapolation i.e. using the regression line for
predicting values outside the range of x in the original data, as the linear relationship may not
hold true beyond the range over which it has been fitted.
Note that we have seen how to use the predict command to find our predicted values
corresponding to each observation. We can use predict with the stdf option to find standard
errors for the predicted values (stdf is short for standard error of the forecast value)
predict seyfit , stdf
Ex: Since measuring plasma volume is time consuming, we may wish to predict it from body
weight. The predicted plasma volume for a man weighing 66 kg is
0.0832 + 0.0436*66 = 2.96 litres
s.e. (y*) = 0.2189 √ {1+ 1/8 + [66 – 66.9]2 / 205.375 } = 0.23 litres
So 95% confidence limits for the plasma volume for a man of 66 kg (known as the prediction
interval) are
2.96 ± 2.45 * 0.23 or (2.40 ; 3.52).
9
Correlation
As well as estimating the best fitting straight line, we may wish to examine the strength of the
linear association between the outcome and exposure variable. This is measured by the
correlation coefficient r which is estimated as
r = Sxy / √{ Sxx Syy }
In our example on plasma volume
r = 8.96 / √{ 205.375 * 0.6780 } = 0.76
The correlation coefficient always lies between -1 and 1 and equals zero if there is no linear
association between x and y. It is positive if high values of y are associated with high values of x
and low values of y are associated with low values of x; the higher its value, the stronger the
association. It is negative if high values of y are associated with low values of x and low values
of y are associated with high values of x. The correlation coefficient has the same sign as the
regression slope b.
A useful interpretation of the correlation coefficient is that it is the number of standard deviations
that the outcome y changes for each standard deviation change in the exposure x.
For larger samples (n>100) this gives a simple method of finding a confidence interval for the
correlation coefficient.
For studies with smaller sample sizes, confidence intervals for the correlation coefficient can be
derived using Fisher’s transformation
zr = 0.5 loge {(1+r) / (1-r)}
s.e.(zr) ≈ 1 / √ (n-3)
We can then find a confidence interval for zr which can then be back-transformed to give a
confidence interval for r using the inverse of Fisher’s transformation
r = {exp(2 zr ) – 1} / {exp(2 zr ) + 1}
10
In our example
zr = 0.5 loge {(1.76) / (0.24)} = 0.994
s.e.(zr) ≈ 1 / √ (8-3) = 0.447
So our 95% c.i. for zr is
0.994 – 1.96*0.447 ; 0.994 + 1.96*0.447
or 0.1176 ; 1.8706
Applying the inverse of Fisher’s transformation gives a confidence interval for r of
0.1171 ; 0.9536
Note that R2, the proportion of variation in y explained by its relationship to x, is simply the
square of the correlation coefficient.
Correlation is particularly useful when looking for association between variables when there is
no clear outcome variable and no clear exposure variable (e.g. if we look for an association
between the results of two blood chemistry variables such as sodium and potassium) and will be
important later on when we consider multi-collinearity between exposure variables in multiple
regression. For examining the relationship between an outcome variable and an exposure
variable, simple linear regression is generally preferred by many statisticians.