0% found this document useful (0 votes)
11 views

Regn_lect_3

Epidemiology linear regression

Uploaded by

Martha Reuben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Regn_lect_3

Epidemiology linear regression

Uploaded by

Martha Reuben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

Lecture 3: Simple Linear Regression


Objectives
 Interpret a scatter diagram that shows the relation between two quantitative variables
 Understand the assumptions that are made in a linear regression analysis
 Interpret a simple linear regression equation, describing the intercept and slope
 Learn how to test whether a regression relationship is statistically significant
 Understand why extrapolation may be inappropriate
 Understand the relationship between correlation and linear regression

We are often interested in the relationship between our response and one (or more) variables
(a) We might wish to see whether rhknow increases with age in Stepping stones
(b) We could investigate whether the time taken for the TOP procedure depends on
gestational age in the misoprostol study.

We investigate relationships statistically using regression models. The simplest type of


regression model is the simple linear regression model in which the response variable y is
related to the input variable x according to the equation Y = α + β x
So our line has two parameters namely α and β
α is the intercept i.e. the value of y when x=0
β is the slope i.e. when x increases by 1 unit, y will increase by β units
In practice we have random error in our observations, so we adopt our simple linear regression
model yi = α + β xi + εi
We do not know the true population parameters α and β, so we have to estimate them from
our data using sample statistics a and b.

We do this using the method of “least squares” which fits the line
y= a + bx
which passes as closely as possible to our observed data points.
If yi = a + bxi is the i-th predicted value, then a and b are chosen to minimize the sum of the
squared vertical deviations between the observed and fitted line. In terms of our simple linear
regression model
yi = α + β xi + εi
we minimize ∑ε2 with respect to α and β
It turns out that the minimizing values are
2

b = Sxy / Sxx
where Sxy = ∑ xi yi - (∑xi ) (∑yi )/n is the corrected sum of products and
Sxx = ∑ xi 2 - (∑xi )2 /n is the corrected sum of squares for x
And a = mean(y) – b * mean(x)
Our fitted line
y = a + bx will always pass through the point mean(x), mean(y)

Ex: The table below shows plasma volume and body weight in eight healthy men (Kirkwood &
Sterne page 87). Our outcome variable is plasma volume and our explanatory (exposure)
variable is body weight; we wish to see whether plasma volume is related to body weight.

Subject Body weight (kg) Plasma volume (litres)


1 58.0 2.75
2 70.0 2.86
3 74.0 3.37
4 63.5 2.76
5 62.0 2.62
6 70.5 3.49
7 71.0 3.05
8 66.0 3.12

The plot below shows that as body weight increases, plasma volume also tends to increase i.e.
there seems to be a relationship between plasma volume and body weight, so we can estimate
the regression equation.

Regression of Plasma Volume on Body Weight


3.5
3.25
3
2.75
2.5

55 60 65 70 75
Body weight ( kg )

plasma volume(litres) Fitted values


3

Assumptions of our regression model


1. The independent variable x is measured without error
2. The true value of the response variable y is linearly related to x ; y is subject to random error
yi = α + β xi + εi
3. The deviations εi are assumed to be
(a) independent
(b) normally distributed with zero mean and constant variance σ2

Fitting the regression line


In order to fit the regression line we need to estimate the parameters i.e. find a and b. We can
do this by using the formulae and carrying out the calculations as shown below:
Sxy = ∑ xi yi - (∑xi ) (∑yi )/n = 1615.295 – (535)*(24.02)/8 = 8.9575
Sxx = ∑ xi 2 - (∑xi )2 /n = 35983.5 – (535)2/8 = 205.375
b = Sxy / Sxx = 0.043615
a = mean(y) – b * mean(x) = 3.0025 – 0.043615*66.875 = 0.0857

Thus our fitted line has equation y = 0.0857 + 0.043615 x


(where y is the plasma volume and x is the body weight).
In order to plot the fitted line we need the co-ordinates of two points which can be found by
choosing two x values in the range of the data and using the fitted equation to find the
corresponding y values.
When x=60, y = 0.0857 + 0.043615*60 = 2.7
When x=70, y = 0.0857 + 0.043615*70 = 3.1
In addition the line should pass through the point given by the mean of x and the mean of y (in
this case where x = 66.875 and y = 3.0025).
In practice fitting the regression line is most easily done using a statistical package such as
Stata using the regress command as shown below.

. reg plasvol wt

Source | SS df MS Number of obs = 8


-------------+------------------------------ F( 1, 6) = 8.16
Model | .390684335 1 .390684335 Prob > F = 0.0289
Residual | .287265681 6 .047877614 R-squared = 0.5763
-------------+------------------------------ Adj R-squared = 0.5057
Total | .677950016 7 .096850002 Root MSE = .21881
4

------------------------------------------------------------------------------
plasvol | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wt | .0436153 .0152684 2.86 0.029 .006255 .0809757
_cons | .0857244 1.023998 0.08 0.936 -2.419909 2.591358
------------------------------------------------------------------------------

Thus our fitted line has the equation


Plasvol = 0.0857 + 0.043615 * bodyweight (as before)

Thus plasma volume increases on average by 0.0436 litres for every 1 kg increase in body
weight (or by 10*0.0436 = 0.436 for a 10-kg weight increase).

To plot the observed points with the fitted line (as done above) we use the predict command to
save the fitted values and then the relevant graph command as shown below. The predict
command is a very powerful command that can be used to estimate or predict a number of
quantities after fitting a regression model and takes the form
predict newvar , options

where newvar is the name of the new variable which will contain the predicted value and the
option chosen specifies what this new variable will contain. If we omit the option (as below) then
newvar will contain the default which is the fitted values (or the values of y predicted from our
model).
. predict plasfit
(option xb assumed; fitted values)
. twoway (scatter plasvol wt, sort msymbol(plus)) (connected plasfit wt, sort
msymbol(none)) ,
> xtitle (body weight (kg)) xlabel (55 (5) 75) ylabel (2.5 (0.25) 3.5)
ti("Regression of plasma volume on body weight")

Assessing the regression line


We are interested in the question: “Is there a real relationship between y and x?”
That is whether the slope of the assumed linear regression genuinely is different from zero.

We can test this by constructing the analysis of variance table for the regression model:
We subdivide the total variation in our sample of y values into the variation which can be
attributed to the regression line (the regression sum of squares) and the residual variation about
the fitted line, as shown on the graph below in which we have drawn in the mean of the x-values
and the mean of the y-values.
5

Regression of Plasma Volume on Body Weight


3.5
3.25
3
2.75
2.5

55 60 65 70 75
Body weight ( kg )

plasma volume(litres) Fitted values

In practice the analysis of variance is obtained by calculating the total sum of squares as the
sum of the squared deviations about the mean of the y-values i.e . Syy and the regression sum
of squares given by
Regression ss = (Sxy )2 / Sxx
We can then calculate the residual sum of squares by subtraction
Residual ss = total ss – regression ss
Note that the residual sum of squares is the minimized value of the sum of squared deviations.

The partition of variation is summarized in the analysis of variance table. To construct this we
need the degrees of freedom (number of pieces of independent information that go into the
calculation of each sum of squares). The (n-1) degrees of freedom for the total sum of squares
are divided into 1 d.f. for the slope (regression) and (n-2) for the residual – we can understand
this as having n observations from which we have estimated 2 parameters (a and b) – hence we
have (n-2) d.f. for the residual.
From each sum of squares we calculate a mean square by dividing the sum of squares by the
associated degrees of freedom. The residual mean square s2 found by dividing the residual sum
of squares by (n-2) provides an estimate of the residual variation in y having “adjusted for the
effect of x”.
6

In our example Syy = ∑ yi 2 - (∑yi )2 /n = 72.798 – (24.02)2/8 = 0.67795

The regression sum of squares is given by (Sxy)2 / Sxx = (8.9575)2/ 205.375 = 0.39068
Thus by subtraction the residual sum of squares is 0.67795 – 0.39068 = 0.28727

We can then construct our ANOVA table noting that since we have 8 observations our total
degrees of freedom will be (n-1)=7 and the residual degrees of freedom will be
(n-2)=6.

Source of variation d.f. Sum of Squares Mean Square F ratio

Regression 1 0.39068 0.39068 8.16

Residual 6 0.28727 0.047878

Total 7 0.67795

We can use the F-ratio (8.16) to test


H0: There is no relationship between y and x
versus
H1: There is a relationship between y and x

Under H0 we would expect the regression and residual mean squares to be comparable (i.e. the
variation due to the effect of body weight should be comparable to the residual or random
variation) and thus we would expect F to be about 1; in fact under H0 the F-ratio follows an F-
distribution with 1,6 degrees of freedom. Comparing our F-ratio of 8.16 with F-tables shows that
the probability of getting such an F-value if H0 is true is less than 0.05 (in fact P=0.0289) so we
can reject the null hypothesis and conclude that there is a relationship between y and x.
The components of the ANOVA table are given as part of the output from the regress command
(and are reproduced below).
7

. reg plasvol wt

Source | SS df MS Number of obs = 8


-------------+------------------------------ F( 1, 6) = 8.16
Model | .390684335 1 .390684335 Prob > F = 0.0289
Residual | .287265681 6 .047877614 R-squared = 0.5763
-------------+------------------------------ Adj R-squared = 0.5057
Total | .677950016 7 .096850002 Root MSE = .21881

Standard error of regression estimates


The calculated values of a and b are estimates of the population parameters α and β (the
intercept and slope) and are thus subject to sampling variation. As with other estimates, their
precision is measured by their standard errors.
It can be shown that
s.e.(a) = s √ { 1/n + mean(x)2 / Sxx }
s.e.(b) = s / √ Sxx
where s is the square root of the residual mean square form the ANOVA table and gives the
standard deviation of the data points about the regression line (on n-2 degrees of freedom).
In our example s= √0.047878 = 0.2188
s.e. (a) = 0.2188 * √ { 1/8 + (66.9)2 / 205.375 } = 1.024
s.e. (b) = 0.2188 / √ 205.375 = 0.01528

Note:
1. The output from regress in Stata gives the standard errors for the regression estimates.
------------------------------------------------------------------------------
plasvol | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wt | .0436153 .0152684 2.86 0.029 .006255 .0809757
_cons | .0857244 1.023998 0.08 0.936 -2.419909 2.591358
------------------------------------------------------------------------------

The standard errors can be used to test the hypothesis that the parameter is zero and also to
construct confidence intervals for the parameter.
In this case the T-statistic (on 6 d.f.) for testing the hypothesis that β=0 is 2.86 and we reject the
null hypothesis (P=0.029). Note that in the case of simple linear regression the t-test is identical
to the F-test (in fact T2 = F or 2.862 ≈ 8.16 – to within rounding error). The 95% confidence limits
for β are 0.006255 to 0.08098 – so we are confident that the plasma volume will increase by
between 0.00626 and 0.08098 litres for every 1 kg increase in body weight.
8

2. In this example the intercept is not meaningful. Literally it could be interpreted as the
estimated mean plasma volume when weight=0 kg. The intercept can be made more
meaningful by centering the exposure variable i.e. subtracting the mean of the exposure
variable from each observation, so that the new exposure variable has mean 0. The
intercept in a simple linear regression model with a centered exposure variable is equal to
the mean of the outcome variable.
3. Also note that in general the test of the null hypothesis that the intercept is zero is usually
not of interest.

Prediction
We can use our regression equation for prediction; the predicted value y* when x=x* is given by
y* = a + bx*
The standard error of prediction is given by
s.e.(y*) = s √ {1+ 1/n + [x*-mean(x)]2 / Sxx }
The standard error is smallest when x* is close to the mean of the x-values and increases as x*
moves away from the mean, thus prediction is most precise close to the mean of the exposure
variable.
Also note that in general we should avoid extrapolation i.e. using the regression line for
predicting values outside the range of x in the original data, as the linear relationship may not
hold true beyond the range over which it has been fitted.
Note that we have seen how to use the predict command to find our predicted values
corresponding to each observation. We can use predict with the stdf option to find standard
errors for the predicted values (stdf is short for standard error of the forecast value)
predict seyfit , stdf

Ex: Since measuring plasma volume is time consuming, we may wish to predict it from body
weight. The predicted plasma volume for a man weighing 66 kg is
0.0832 + 0.0436*66 = 2.96 litres
s.e. (y*) = 0.2189 √ {1+ 1/8 + [66 – 66.9]2 / 205.375 } = 0.23 litres

So 95% confidence limits for the plasma volume for a man of 66 kg (known as the prediction
interval) are
2.96 ± 2.45 * 0.23 or (2.40 ; 3.52).
9

Proportion of variation explained by regression


The proportion of variation in y that is explained by the regression on x is given by
R2 = (Regression sum of squares) / (Total sum of squares)
In this example R2 = 0.576 (given as part of the Stata output)
Thus 57.6% of the variation in plasma volume is explained by its linear relationship with body
weight – so in this case 42.4% of the variation is still unexplained.

Correlation
As well as estimating the best fitting straight line, we may wish to examine the strength of the
linear association between the outcome and exposure variable. This is measured by the
correlation coefficient r which is estimated as
r = Sxy / √{ Sxx Syy }
In our example on plasma volume
r = 8.96 / √{ 205.375 * 0.6780 } = 0.76
The correlation coefficient always lies between -1 and 1 and equals zero if there is no linear
association between x and y. It is positive if high values of y are associated with high values of x
and low values of y are associated with low values of x; the higher its value, the stronger the
association. It is negative if high values of y are associated with low values of x and low values
of y are associated with high values of x. The correlation coefficient has the same sign as the
regression slope b.
A useful interpretation of the correlation coefficient is that it is the number of standard deviations
that the outcome y changes for each standard deviation change in the exposure x.
For larger samples (n>100) this gives a simple method of finding a confidence interval for the
correlation coefficient.

For studies with smaller sample sizes, confidence intervals for the correlation coefficient can be
derived using Fisher’s transformation
zr = 0.5 loge {(1+r) / (1-r)}
s.e.(zr) ≈ 1 / √ (n-3)
We can then find a confidence interval for zr which can then be back-transformed to give a
confidence interval for r using the inverse of Fisher’s transformation
r = {exp(2 zr ) – 1} / {exp(2 zr ) + 1}
10

In our example
zr = 0.5 loge {(1.76) / (0.24)} = 0.994
s.e.(zr) ≈ 1 / √ (8-3) = 0.447
So our 95% c.i. for zr is
0.994 – 1.96*0.447 ; 0.994 + 1.96*0.447
or 0.1176 ; 1.8706
Applying the inverse of Fisher’s transformation gives a confidence interval for r of
0.1171 ; 0.9536

Note that R2, the proportion of variation in y explained by its relationship to x, is simply the
square of the correlation coefficient.
Correlation is particularly useful when looking for association between variables when there is
no clear outcome variable and no clear exposure variable (e.g. if we look for an association
between the results of two blood chemistry variables such as sodium and potassium) and will be
important later on when we consider multi-collinearity between exposure variables in multiple
regression. For examining the relationship between an outcome variable and an exposure
variable, simple linear regression is generally preferred by many statisticians.

You might also like