0% found this document useful (0 votes)
17 views

STAT22209 - Chapter 03-Multiple Regression - 2022

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

STAT22209 - Chapter 03-Multiple Regression - 2022

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Advanced Statistics II

( PST22209/ FST 22209/ ESNRM22209)

R.M. KAPILA RATHNAYAKA


B.Sc. Special (Math. & Stat. ) (Ruhuna), M.Sc. (Industrial Mathematics) (USJ),
M.Sc. (Stat. ) (WHUT, China),
Ph.D. (Applied Statistics, WHUT)
Why we need alternative method than
Linear Regression…….
Polynomial Regression
• In situations where the functional relationship between the
response Y and the independent variable x cannot be
adequately approximated by a linear relationship, it is
sometimes possible to obtain a reasonable fit by considering a
polynomial relationship.

• where are regression coefficients that would have to be estimated

• h is called the degree of the polynomial.


• To determine these estimators, we take partial derivatives
with respect to of the foregoing sum of squares, and then
set these equal to 0 so as to determine the minimizing
values.

• On doing so, and then rearranging the resulting equations,


we obtain that the least square estimators satisfy the
following set of linear equations called the normal
equations.
Degree of the polynomial

• where h is called the degree of the polynomial. For


lower degrees, the relationship has a specific names.
• h = 2 is called quadratic
• h = 3 is called cubic,
• h = 4 is called quartic, and so on.
Second-degree Polynomial – quadratic Trend
• Practically, most of the real world data patterns are best described by

curves, not straight lines. In these instances, the linear trend model does

not adequately describe the change in the variable as time changers.

• To overcome this problem, we often use a parabolic curve, which is

described by mathematically by a second-degree equation.

• The general form for an estimated second-degree equation is;

• Where;

• estimate of the dependent variable

• numerical constants
• However, we can determine the values of the numerical constants
from the following three equations.
Second-degree Polynomial – quadratic Trend
Applications
• Fit a polynomial to the following data.
X Y
0 0
1 0
2 2
3 6
4 12
• However, we can determine the values of the
numerical constants from the following three
equations. X Y
0 0
1 0
2 2
3 6
4 12
Example
• Fit a polynomial to the following data.
quadratic Trend
• However, we can determine the values of the numerical constants
from the following three equations.
• the estimated quadratic regression equation I

• The estimated quadratic regression equation is


Matrix notation to solve equation system

• which has the solution

which has the solution


Example 2
• You are studying the relationship between a particular
machine setting and the amount of energy consumed.
• log transformation of the response variable will produce a
more symmetric error distribution.
Multiple Linear Regression

• Multiple regression is an extension of simple linear


regression.

• It is used when we want to predict the value of a variable


based on the value of two or more other variables.

• Suppose that we have a linear model


Example
• you could use multiple regression to understand whether
exam performance can be predicted based on
– revision time,
– test anxiety,
– lecture attendance
– gender.
• Alternately, you could use multiple regression to understand
whether daily cigarette consumption can be predicted based
on
– smoking duration,
– age when started smoking,
– smoker type,
– income
– gender.
Assumption #1:

• Dependent variable should be measured on a continuous


scale (i.e., it is either an interval or ratio variable)

• Example:
– revision time (measured in hours),
– intelligence (measured using IQ score),
– exam performance (measured from 0 to 100),
– weight (measured in kg)
Assumption #2:

• Two or more independent variables, which can be either


continuous (i.e., an interval or ratio variable) or categorical
(i.e., an ordinal or nominal variable).

• Examples of nominal variables include ;


– gender (male and female),

– ethnicity (Caucasian, African American and Hispanic),

– physical activity level (sedentary, low, moderate and high),

– profession (surgeon, doctor, nurse, dentist, therapist),


Numerical Data (Data that is Numbers) :
Continuous Random Variables

• Continuous Variable –

Continuous variables is a variable whose value is obtained


by measuring.
height of students in class
weight of students in class
 time it takes to get to school
distance traveled between classes
Numerical Data (Data that is Numbers) :
Discrete Random Variables
• A discrete variable is a variable whose value is obtained by
counting.

• All continuous variables are numeric, but not all numeric


variables are continuous.

• Examples:
– number of students present
– number of red marbles in a jar
– number of heads when flipping three coins
– students’ grade level
Categorical Data (Data that is not
numbers) : Nominal Variable
• Sometimes there is no hierarchy in categorical data.
• If eye colour was coded
– 0-- “Blue”
– 1 --“Green”
– 2 --“Brown”

we have to randomly choose which option gets which


number.
• It doesn’t matter whether Blue eyes is zero, or one, or two,
because there is no hierarchy in eye colour.
Categorical Data (Data that is not
numbers) : Ordinal Variable
• Annoying surveys often ask you to answer with the options
“Strongly Disagree”, “Disagree”, “Neutral”, “Agree” or
“Strongly agree”.
• This data has a special structure, because if these are coded 0
“Strongly Disagree” to 4 “Strongly agree”;
– 0 = Strongly Disagree
– 1 = Disagree
– 2 = Neutral
– 3 = Agree
– 4 = Strongly agree
Assumption #3:
• Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move
along the line.

Assumption #4:
• Data must not show multicollinearity, which occurs when
you have two or more independent variables that are highly
correlated with each other.
What is Multicollinearity?
The following data on 20 individuals with high blood pressure:
1. blood pressure (y = BP, in mm Hg)

2. age (x1 = Age, in years)

3. weight (x2 = Weight, in kg)

4. body surface area (x3 = BSA, in sq m)

5. duration of hypertension (x4 = Dur, in years)

6. basal pulse (x5 = Pulse, in beats per minute)

7. stress index (x6 = Stress)


BP Age Weight BSA Dur Pulse
Age 0.659
Weight 0.950 0.407
BSA 0.866 0.378 0.875
Dur 0.293 0.344 0.201 0.131
Pulse 0.721 0.619 0.659 0.465 0.402
Stress 0.164 0.368 0.034 0.018 0.312 0.506

• Cell Contents: Pearson correlation


• Blood pressure appears to be related fairly strongly to Weight (r = 0.950)
and BSA (r = 0.866), and hardly related at all to Stress level (r = 0.164).
• Weight and BSA appear to be strongly related (r = 0.875)

• The high correlation among some of the predictors suggests that data-
based multicollinearity exists.
Assumption #5:
• There should be

– no significant outliers,
– high leverage points
– highly influential points.
• These different classifications of unusual points reflect the different
impact they have on the regression line.
What are outliers in the data?
• An outlier is an observation that lies an abnormal distance from other values
in a random sample from a population.

• The box plot is a useful graphical display for describing the behavior of the
data in the middle as well as at the ends of the distributions.

• The following quantities (called fences) are needed for identifying extreme
values in the tails of the distribution:
– lower inner fence: Q1 - 1.5*IQ
– upper inner fence: Q3 + 1.5*IQ
– lower outer fence: Q1 - 3*IQ
– upper outer fence: Q3 + 3*IQ

• A point beyond an inner fence on either side is considered a mild outlier. A


point beyond an outer fence is considered an extreme outlier.
Assumption #5:
• You should have independence of observations (i.e., independence of

residuals), which you can easily check using the Durbin-Watson statistic

Assumption #6:
• There needs to be a linear relationship between
– the dependent variable and each of your independent
variables
Assumption #7:
• Finally, you need to check that the residuals (errors) are
approximately normally distributed

• Two common methods to check this assumption include using:


– histogram (with a superimposed normal curve) and a
Normal P-P Plot;
– Normal Q-Q Plot of the studentized residuals.
Multiple Linear Regression

• Suppose that we have a linear model

And we make independent observations , on .

• We can write the observation as


,

• where is the setting of the independent variable for the


observation, .
,

• We now define the following matrices, with :



• , , ,
• Thus, the equations representing as a function of the ’s, ’s,
and ’s can be simultaneously written as
Regression with Two Independent Variables
• For observations from a simple linear regression model of the form
,
, , ,
• The least-squares equations for and were given in the previous
section as
Regression with Two Independent Variables
• Assume the model production function below,

.
• Where is total production, is labor input, is total capital and
the information about each factor is given below for the 15
year period of 2001 to 2016.

Year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

20 35 30 47 60 68 76 90 100 105 130 140 125 120 135

10 15 21 26 40 37 42 33 30 38 60 65 50 35 42

12 10 9 8 5 7 4 5 7 5 3 4 3 1 2
• By using above data, estimate the and
parameters of

by using the ordinary least square (OLS) method.

You might also like