05 Class RegressionCorrelation
05 Class RegressionCorrelation
1
LINEAR REGRESSION AND
CORRELATION
2
Lecture Topics
Scatterplots
Review equation of a line
Estimates of regression coefficients
Inference on regression slope
Correlation vs. slope
Correlation not causation
Multiple linear regression
3
Regression and Correlation
4
Regression and Correlation
5
Example
Association between body weight (x) and
plasma volume (y)
Body Plasma
Subject
Weight(kg) Volume(l)
1 58.0 2.75
2 70.0 2.86
3 74.0 3.37
4 63.5 2.76
5 62.0 2.62
6 70.5 3.49
7 71.0 3.05
8 66.0 3.12
6
Simple Linear Regression
y is the dependent variable
x is the independent variable
– Predictor
– Regressor
– Covariate
We try to predict y from x
Called “simple” because there is only one
independent variable x
If there are several independent variables,
it’s called multiple linear regression
7
Y, plasma volume (liters) Simple Linear Regression
3.5
2.5
55 60 65 70 75
i1
9
Y, plasma volume (liters) Simple Linear Regression
3.5
2.5
55 60 65 70 75
12
The Equation of a Line
b̂0 and b̂1 are called regression coefficients
These two quantities are estimated by the
“least squares” method
The intercept b̂0 is the estimated expected
value of y when x is 0
The slope b̂1 is the estimated expected
change in y corresponding to a unit increase
in x
13
The Equation of a Line
y
y
y bˆ0 bˆ1 x
b̂1
b̂0
x
0
14
The Slope, b̂1
The slope b1 is the expected change in y
corresponding to a unit increase in x
– b̂1 = 0 No association between y
and x
– b̂1 > 0 Positive association (as x
increases y tends to increase)
– b̂1 < 0 Negative association (as x
increases y tends to
decrease)
15
The Slope
y
y b̂1 > 0
b̂1 = 0
b̂1 < 0
x
0
16
Simple Linear Regression Model
Points don’t fall exactly on line, so to represent that
we add “ε”
– y bˆ0 bˆ1 x
– Observed value = mean + “error”
ε is often called the “residual” or “error”
Model without “ε” sometimes written
yˆ bˆ0 bˆ1 x
where ŷ represents “predicted value” or point on
the line
Note:
i yi yˆ i
17
Simple Linear Regression Model
Assumptions about ε
– Random noise
– Sometimes positive, sometimes
negative but, on average, it’s 0
– Normally distributed about 0
18
Plasma Example
Suppose we used least squares regression
to estimate a line relating plasma volume to
body weight from 8 data points
Plasma volume = 0.0857 + .0436 * weight
y = 0.0857 + .0436 x
Estimate of the intercept = 0.0857
Estimate of the slope = 0.0436
– Interpretation: For each kilogram
increase in body weight, we expect
plasma volume to increase by .0436
liters
19
Plasma Example
Measurement of plasma volume very time
consuming
Body weight easy to measure Use equation
and body weight to estimate plasma volume
– If we know an individual’s weight, we
can estimate his plasma volume
– Estimate the plasma volume for 60kg
man
– y = 0.0857 + .0436 x
– For this individual, x = 60
→ y = 0.0857 + .0436 * 60 = 2.7 liters
20
Estimated vs True Slope
Is the estimated slope (.0436) a good
estimate?
Is it close to the true population slope?
Standard error of estimated slope: Measures
precision of estimated slope
Standard error of estimated slope, b̂1 , is
influenced by sample size (n)
– The larger n, the smaller SE(b̂1 )
– Just like we’ve seen before, more
information means more precision!
21
Precision of Estimated Slope
Plasma/Weight example
– 0.0436 ± 2.447 * .0153
– 0.0436 ± .0374
– (.0062, .0810)
With 95% confidence, we can say the
true value of b1 is between .0062 and
.0810 (i.e., a plausible range of values
for b1 is from .0063 to .0810)
23
Hypothesis Testing
H0 : b1 = 0 (slope is zero)
Ha : b1 0
We need to calculate a test statistic based
on our sample and compare to appropriate
distribution to get a p-value
estimated_slope bˆ1 0.0436
t
SE(estimated_slope) SE (b1 ) 0.0153 = 2.85
ˆ
26
The Correlation Coefficient, r
27
r = 0.01
r = 0.68
Y
Y
X X
12
10
r = 0.98
8
Y
Y
r = -0.9
4
X 28
Correlation in the Plasma Example
3.5
Y, plasma volume (liters)
r = .76
2.5
55 60 65 70 75
Y
relationship between
y and x, and r may
not catch it
X
30
Examples of the Correlation
Coefficient
Outliers can really affect correlation coefficient
One extreme point can change r sizably
r = .7
31
The Moral
A picture is worth . . .
Always look at your scatterplot when
trying to interpret correlation coefficient
32
Slope (b1) vs. the Correlation
Coefficient (r)
Both indicate direction of association (positive
or negative)
Slope b̂ is the estimated expected change in
1
y per unit increase in x
Larger slopes do not necessarily mean a
stronger linear association
The correlation coefficient is scaled between -
1 and +1
r can be small, but you can still have a slope
b1 that is significantly different from 0
33
Testing for an Association
34
R2—Coefficient of Determination
R2 is just as it sounds—r2—but is
written with an uppercase R.
Fraction of observed variability in y
that can be explained by x
R2 is a number between 0 and 1
When there is a straight line relation,
some of the variation in y is accounted
for by the fact that as x changes, it
pulls y along with it
35
Plasma/Weight Example
r .76
R .76 .58
2 2
36
Notes and Caveats on Correlation
Coefficient
Measuring only linear relationships
Other kinds of relationships are also
important
Look and graph the data
Sensitive to outliers
Useful when x values are measured, but not
controlled by the experimental design
That is, x and y are random
What about ‘confounders’?
37
Multiple Linear Regression
38
Multiple Linear Regression
Multiple linear regression (MLR) is a method
to estimate:
– an equation to predict a value of the
dependent variable y from multiple
independent variables
(the x’s)
– the size and statistical significance of
the relationship between a dependent
variable, y, and one predictor, x1, after
adjusting for confounders (x2 . . .)
39
Hemoglobin (Hb) and Packed Cell
Volume (PCV)
Suppose we have collected data on 21
women and want to investigate the
relationship between Hb levels and
PCV
Previous research suggests age
related to both PCV and Hb level
In order to correctly estimate the
Hb/PCV relationship, we would need a
control for age
40
Hemoglobin (Hb) and Packed Cell
Volume (PCV)
One solution:
– Stratify women by age and perform
separate simple regressions of Hb on
PCV
Potential problems:
– We only have 21 women in our sample:
Strata could be quite small
– Estimates unstable [large SE (slopes)]
– This would be a pain
41
Hemoglobin (Hb) and Packed
Cell Volume (PCV)
Better solution:
– Perform multiple linear regression with Hb level
as outcome; PCV and age as predictors
– Here we would need a multiple linear
regression equation with two predictors
– Just keep adding x’s to the simple linear
regression equation
42
Multiple Regression Equation
MLR with 2 predictors
44
Hemoglobin vs. PCV(%)
Hemoglobin (g/l) Vs. PCV(%)
18
Hemoglobin(g/dl)
b̂0 = 5.78
16
14
Hb
b̂1 = .203
12
10
20 30 40 50 60
PCV
PCV(%)
Interpretation: Hemoglobin increases by
.203 gm/dl for every one unit increase in PCV
45
Hemoglobin and PCV vs. Age(yrs)
Hemoglobin (g/l) Vs. Age (years) PCV (%) Vs. Age (years)
18
60
Hemoglobin(g/dl)
16
50
PCV(%)
14
PCV
Hb
40
12
30
10
20
20 30 40 50 60 70 20 30 40 50 60 70
age age
46
Hemoglobin and PCV
Main questions:
– Is the relationship between hemoglobin
and PCV only because both are related
to age?
– Can we control for age as a potential
confounder?
How does MLR estimate size of relationship
between Hb and PCV while controlling for
age?
– ‘statistical adjustment’
47
MLR results
y bˆ bˆ x bˆ x
0 1 1 2 2
b̂0 = 5.52
b̂1 = 0.102
b̂2 = 0.101
48
Hemoglobin and PCV
The resulting equation:
– y = 5.5 + .102*PCV + .101*Age
– (It is a coincidence that coefficients for age and
PCV are nearly equal)
Interpretation
– For a given age, hemoglobin increases by .102
gm/dl for every one percent increase in PCV
– Recall, the coefficient of PCV from the simple
linear regression was .203
Conclusion
– Hemoglobin is positively associated with PCV
after accounting for age (p = .004)
49
Hemoglobin and Age
Interpretation
– For a given PCV level, hemoglobin
increases by .101 gm/dl for every one
year increase in age
Conclusion
– Hemoglobin is positively associated with
age after accounting for PCV (p = <
.001)
50
Regression Coefficients
General interpretation of regression
coefficients from equation:
53
Interpreting Coefficients
y bˆ0 bˆ1 x bˆ2 x2
–y = hemoglobin level
–x1 = age
–x2 = “menopausal status”
–b̂1 = expected change in hemoglobin
for a one year increase in age, after
adjusting for menopause
b
How can we interpret 2 ?
54
Interpreting Coefficients
Write out the equation for both “types”
of women with same age (say 50)
ˆ bˆ (50) bˆ
– Post: y b0 1 2
Interpretation
– After controlling for age, we would
expect post-menopausal women to
have hemoglobin level 1.9 units higher
than pre-menopausal women
56
Statistical Inference
95% confidence intervals can be calculated
for coefficients in MLR