0% found this document useful (0 votes)

15 views

Lecture 09_02.09.2024_Regression-01

Uploaded by

Amritanshu Vivek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Lecture 09_02.09.2024_Regression-01

Uploaded by

Amritanshu Vivek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

1

Regression - 1

Prof. Sashikumaar Ganesan, IISc Bangalore

“ How can we predict the value
of a continuous variable in a
supervised setting?
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression

Introduction to linear regression

Assumptions in linear regression

Significance Testing

Residual Analysis
What is “Linear”?

• Remember this :
• Y = mX + B ?
• A slope of two implies that every one unit change in X yields two units change in Y.
Assumptions

• The relationship between X and Y is linear

• Y is distributed normally at each value of X
• The variance of Y at every value of X is the same (homogeneity of variances)
• The observations are independent.
Linear Regression : Prediction

• If you know something about X, this knowledge helps you predict something about
Y. (Sound familiar?…sound like conditional probabilities?)
• Expected value of y for given x is

Fixed Random
exactly error with
on the normal
line distribution
Signficance Testing

• How do we assess the nature of fit?

• The t-test helps to determine if this linear relationship is statistically significant.

• We can calculate the t-statistic using (s.e. Denotes standard error)

Significance Testing

• Formula for the standard error of beta. We consider n-2 since we lose two degree of
freedom (i.e., the slope and y-intercept).

• Using the t-value statistic we can then calculate the p-value.

• P-values are used in hypothesis testing to help decide whether to reject the null
hypothesis. The smaller the p value, the more likely you are to reject the null
hypothesis. We can use standard tables that related p-values and the t-statistic.
P-Value Interpretation
Confidence Intervals

• Have a look at the following example :

Residual Analysis

• Residual
• The residual for observation i, ei, is the difference between its observed and predicted
value
• Check the assumptions of regression by examining the residuals
• Examine for linearity assumption
• Examine for constant variance for all levels of X (homoscedasticity)
• Evaluate normal distribution assumption
• Evaluate independence assumption
Analysis For Linearity
Analysis For Homoscedasticity
Analysis For Independence
Overall Picture
Summary
• Modelling the relation between a scalar response (target/dependent variable) and one or
more explanatory (features/independent) variables
• The case of one explanatory variable is called simple linear regression

• It is important to assess the quality of fit using some statistical tools that were mentioned
“ How can we try to solve for
multilinear regression (more
than one feature) ?
Normal Equation

Set-up of our problem

Derivation of normal equation

Computational Aspects
Normal Equation

• Suppose we have “n” observations (𝑥𝑖 , 𝑦𝑖 ), i=1,2, …, n, where

𝑥𝑖 ∈ ℝ𝑚 , 𝑦𝑖 ∈ ℝ
• In function notation 𝑦𝑖 = 𝑓 𝑥𝑖 , 𝑤ℎ𝑒𝑟𝑒 𝑓: ℝ𝑚 → ℝ
Normal Equation
Normal Equation

• Suppose we have “n” observations (𝑥𝑖 , 𝑦𝑖 ), i=1,2, …, n, where

𝑥𝑖 ∈ ℝ𝑚 , 𝑦𝑖 ∈ ℝ
• In function notation 𝑦𝑖 = 𝑓 𝑥𝑖 , 𝑤ℎ𝑒𝑟𝑒 𝑓: ℝ𝑚 → ℝ

Linear Regression:

𝑦𝑖 = 𝑓 𝑥𝑖 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + . . . +𝑏𝑚 𝑥𝑚 + 𝜖𝑖
𝑚
= 𝑏0 ∗ 1 + ෍ 𝑥𝑖𝑗 𝑏𝑗 + 𝜖𝑖
𝑗=1
Derivation

Does the inverse exist?

Derivation

• Find the optimal coefficients, which fit the model better, instead of
Solving the equation:
Derivation

• Does the inverse of XX^T exist?

• Can we use SVD or any other direct method to solve the normal equations?
Computational Complexity

• Recall the order of computational complexity

• Computational complexity of inverting a matrix is typically in the order
between 2.5 and 3
• The order will be 2 in SVD
• Alternatively, iterative methods is preferred when the size of the matrix
is large
• Irrespective of the size of the matrix, the prediction cost is very less
compared to the training cost
Summary

• We can derive the normal equation by considering many features and many
training examples.
• Solving the normal equation is computationally expensive due to the calculation
of the inverse.
• Hence iterative methods are preferred to prevent large computation
overheads.
“
How do we perform linear regression
when the training data is large?
Gradient Descent

Intuition

Optimizing the cost function with GD

Choosing hyperparameters
Intuition

• How can we reach a minimum point of a function from an arbitrary

initial guess?
Intuition

• Can we always reach the global minimum of a function?

Gradient Descent

• A generic optimization algorithm

• Idea is to tweak parameters iteratively to minimize a cost function
• We need the derivative of the cost function
Hyperparameter Selection

How should we choose the step size?

Hyperparameter Selection
Types of Gradient Descent

Batch Gradient Descent

Stochastic Gradient Descent

Mini-batch Gradient Descent

Batch Gradient Descent

• Recall the derivative calculation in the Gradient Descent

• Instead of calculating these derivatives individually, can we compute

it in one go?
• This results in the Batch Gradient Descent (BGD)
Stochastic Gradient Descent

• What happens if the data doesn’t fit in RAM?

• Also recall memory bound!
• Can we apply SG for a random data point “i”, instead of the
complete data points?
• Let “p” be a random integer in [0, n], then
Mini-batch Gradient Descent

• BGD - compute the full gradient

• SGD - compute the derivative one by one
• Can we combine both?
• It results in Mini-batch Gradient
• Computes gradient on a small data-point set
• Cache optimization can be exploited
Summary

• Iterative algorithms like gradient descent, where we try to optimize a "cost

function/loss function,” are used when the problem is high-dimensional.
• Choosing the right step size can help gradient descent converge to a minima
faster.
• Various techniques like SGD, BGD are used to calculate the gradient due to RAM
limitations (memory bound) and computational limitations (compute bound).
“ How can we assess the
goodness of the fit?
Model Validation

Mean Squared error

Mean Absolute Error

R2
Mean Squared Error

• If yi is the target value and fi is the predicted value for the linear
regression fit on n samples, then the mean squared error for the fit
is given by -

• A higher value of mean squared error indicates a worse fit.

Mean Absolute Error (MAE)

• If yi is the target value and fi is the predicted value for the linear
regression fit on n samples, then the mean absolute error for the fit
is given by -

• A higher value of mean absolute error indicates a worse fit.

R2 (Coefficient of Determination)

• If yi is the target value, then the mean target value for n samples is

• The residual sum of squares for the fit is proportional to the MSE

• The total sum of squares is proportional to the variance inherent in

the data
R2 (Coefficient of Determination)

• The coefficient of determination, also called R2, is a number

between 0 and 1 that describes how much of the total variance in
the data is captured by the linear regression fit

• What does R2 = 1 imply?

• What does R2 = 0 imply? Which one would you say is a better fit?
“ How can we fit non-linear data?
Non-Linear Regression

Polynomial Regression

How to fit a quadratic regression

Higher-order polynomial regression

Polynomial Regression

• How to fit nonlinear data, that is, when the relation between the
target value and the features is nonlinear?
Quadratic Regression

• Let us consider a simple linear regression

• How about

• Is it still linear?
Quadratic Regression

• How about sample data with two features? How will you write the
target y as a function of features x1 and x2 for a quadratic regression
problem?
Higher-Order Polynomial Regression

• Can we increase the polynomial order arbitrarily? (Dimensionality)

o Suppose there are “n” features, and the polynomial degree is “d”
o Then the total number of unknown parameters to be fitted in the polynomial regression
is (n + d)! / d!n!
o This makes the computation very expensive for a high polynomial degree

• Can we increase the polynomial order arbitrarily? (Overfitting)

o Arbitrarily increasing the polynomial degree may lead to overfitting, which we will
discuss in the next lecture.
o Overfit models tend to not generalize well and hence perform poorly on test data.
Higher-Order Polynomial Regression
Summary

• Linear regression can be extended by increasing the degree of the features to

perform polynomial regression.
• Polynomial regression can fit non-linear data
• Care must be taken while choosing the polynomial degree to avoid expensive
computations and overfitting.
“ How do we select the best
model?
Model Selection

Model Performance and Complexity

Akaike Information Criterion

Bayesian Information Criterion

Information Criteria

• Probabilistic model selection (or “information criteria”) provides an analytical

technique for scoring and choosing among candidate models.
• Models are scored both on their performance on the training dataset and based
on the complexity of the model -
1. Model Performance. How well a candidate model has performed on the
training dataset.
2. Model Complexity. How complicated the trained candidate model is
after training.
Akaike Information Criterion

• For a linear regression model fit on n samples with k parameters,

the Akaike Information Criterion (AIC) is given by

• The model with lowest AIC is selected i.e. AIC is to be minimized

• AIC puts emphasis on model performance on the training dataset
and may penalize models with lower complexity.
Bayesian Information Criterion

• For a linear regression model fit on n samples with k parameters,

the Bayesian Information Criterion (AIC) is given by

• The model with lowest BIC is selected i.e. BIC is to be minimized

• Unlike the AIC, BIC penalizes a model more for its complexity.
Hence more complex models are less likely to be selected.
Summary

• Model selection criteria can choose between models depending

on their performance and complexity.
• AIC tends to choose models with better performance and high
complexity.
• BIC tends to choose models with poor performance but low
complexity.

Omega-Technical Reference
No ratings yet
Omega-Technical Reference
263 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Formulation of A Biodegradable and Biosynthetic Latex Paint
No ratings yet
Formulation of A Biodegradable and Biosynthetic Latex Paint
166 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture 10_04.09.2024_Regression-02 Lecture Slides
No ratings yet
Lecture 10_04.09.2024_Regression-02 Lecture Slides
61 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Regression
No ratings yet
Regression
45 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
week2
No ratings yet
week2
43 pages
Ch06 MultipleLinearRegression
0% (2)
Ch06 MultipleLinearRegression
19 pages
Machine Learning (CSO851) - Lecture 02
No ratings yet
Machine Learning (CSO851) - Lecture 02
74 pages
Regression_Questionnaire
No ratings yet
Regression_Questionnaire
10 pages
Predictive ModellingAnalytics
No ratings yet
Predictive ModellingAnalytics
27 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
5_AML Lecture 5_Linear regression
No ratings yet
5_AML Lecture 5_Linear regression
56 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
UNIt-3 TY
No ratings yet
UNIt-3 TY
67 pages
2-(9-3) Regression Classifiers
No ratings yet
2-(9-3) Regression Classifiers
35 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Linear Regression-Part 2
No ratings yet
Linear Regression-Part 2
26 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
UNIT-2 ML
No ratings yet
UNIT-2 ML
39 pages
Module01.1 LinearRegression
No ratings yet
Module01.1 LinearRegression
32 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
LinearRegression1 210720 171800
No ratings yet
LinearRegression1 210720 171800
41 pages
Unit2 ML Notes
No ratings yet
Unit2 ML Notes
19 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
CSE_412__Lab_Manual_3___Linear_Regression
No ratings yet
CSE_412__Lab_Manual_3___Linear_Regression
10 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Linear-Regression ML
No ratings yet
Linear-Regression ML
36 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
unit-2.pptx
No ratings yet
unit-2.pptx
133 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
Linear Regression - Everything You Need To Know About Linear Regression
No ratings yet
Linear Regression - Everything You Need To Know About Linear Regression
17 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
Unit 2 ML_Ver 2
No ratings yet
Unit 2 ML_Ver 2
129 pages
Lecture Notes on High Dimensional Linear Regression
No ratings yet
Lecture Notes on High Dimensional Linear Regression
73 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
(Slide) Non Linear Regression
No ratings yet
(Slide) Non Linear Regression
39 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
No ratings yet
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
17 pages
Module B Handbook
No ratings yet
Module B Handbook
11 pages
Module01 LinearRegression
No ratings yet
Module01 LinearRegression
41 pages
Unit 2 ML_Ver 2
No ratings yet
Unit 2 ML_Ver 2
129 pages
Predictive Maintenance
No ratings yet
Predictive Maintenance
66 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Lecture 15_23.09.2024_ Feature Selection
No ratings yet
Lecture 15_23.09.2024_ Feature Selection
47 pages
Lecture 16_25.09.2024_PCA, Unsupervised Learning-Clustring & Metrics
No ratings yet
Lecture 16_25.09.2024_PCA, Unsupervised Learning-Clustring & Metrics
51 pages
Lecture 11_09.09.24 Classification Part 1
No ratings yet
Lecture 11_09.09.24 Classification Part 1
51 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
Engineering Graphics I
No ratings yet
Engineering Graphics I
10 pages
daa1
No ratings yet
daa1
5 pages
Discrete Structures, Logic and Computability by James L. Hein
No ratings yet
Discrete Structures, Logic and Computability by James L. Hein
2 pages
Liftmaster 8500 Garage Door Opener Manual
No ratings yet
Liftmaster 8500 Garage Door Opener Manual
40 pages
Element Wanted Poster Rubric
100% (1)
Element Wanted Poster Rubric
1 page
TESDA-SOP-CO-01-f07 - Competency Based Curriculum - SMAW NC III
No ratings yet
TESDA-SOP-CO-01-f07 - Competency Based Curriculum - SMAW NC III
41 pages
Empathy Fieldguide: How Do You Feel?
No ratings yet
Empathy Fieldguide: How Do You Feel?
9 pages
JD 1376 17 e
No ratings yet
JD 1376 17 e
6 pages
STANAG 6001 Language Proficiency Levels
0% (1)
STANAG 6001 Language Proficiency Levels
32 pages
Fullset Logging System
No ratings yet
Fullset Logging System
10 pages
Nyein Nyein Ei (Hotel)
No ratings yet
Nyein Nyein Ei (Hotel)
2 pages
Lesson Plan
No ratings yet
Lesson Plan
2 pages
Product Management Brochure
No ratings yet
Product Management Brochure
20 pages
Proposal For GNM WTP Revamping
No ratings yet
Proposal For GNM WTP Revamping
4 pages
International Conference on Reliable Systems Engineering (ICoRSE) - 2023 1st Edition Daniela Doina Cioboată - Get instant access to the full ebook content
No ratings yet
International Conference on Reliable Systems Engineering (ICoRSE) - 2023 1st Edition Daniela Doina Cioboată - Get instant access to the full ebook content
74 pages
SAP Market Rates Management, Bring Your Own Rates Data Option
No ratings yet
SAP Market Rates Management, Bring Your Own Rates Data Option
90 pages
List of Chinese Universities Without Application Fee
No ratings yet
List of Chinese Universities Without Application Fee
1 page
Navigate B1 Wordlist 6
No ratings yet
Navigate B1 Wordlist 6
1 page
Elantra: 2017 HYUNDAI
No ratings yet
Elantra: 2017 HYUNDAI
16 pages
Automated Methods in the Microbiology Lab PPT 4
No ratings yet
Automated Methods in the Microbiology Lab PPT 4
1 page
Application Note: Wideband Mm-Wave Signal Generation and Analysis
No ratings yet
Application Note: Wideband Mm-Wave Signal Generation and Analysis
24 pages
BC 1 Notes
No ratings yet
BC 1 Notes
13 pages
The Game Theory Malaysia S Position in S
No ratings yet
The Game Theory Malaysia S Position in S
24 pages
Reading The Image - Alice G
No ratings yet
Reading The Image - Alice G
2 pages
The Structural Design of Light Gauge Silo Hoppers 2
No ratings yet
The Structural Design of Light Gauge Silo Hoppers 2
27 pages
Designing Universal Family Care Digital Version FINAL
No ratings yet
Designing Universal Family Care Digital Version FINAL
310 pages
ARALING PANLIPUNAN - issues-and-concerns-in-the-implementation-of-MATATAG-Curriculum
No ratings yet
ARALING PANLIPUNAN - issues-and-concerns-in-the-implementation-of-MATATAG-Curriculum
3 pages
Plasma Spect USP 40
No ratings yet
Plasma Spect USP 40
4 pages
Plastic Waste Conversion To Fuel A Review On Pyrolysis Process and Influence of Operating Parameters
No ratings yet
Plastic Waste Conversion To Fuel A Review On Pyrolysis Process and Influence of Operating Parameters
22 pages

Uploaded by

Uploaded by

1

Prof. Sashikumaar Ganesan, IISc Bangalore

Introduction to linear regression

Assumptions in linear regression

• The relationship between X and Y is linear

• How do we assess the nature of fit?

• We can calculate the t-statistic using (s.e. Denotes standard error)

• Using the t-value statistic we can then calculate the p-value.

• Have a look at the following example :

Set-up of our problem

Derivation of normal equation

• Suppose we have “n” observations (𝑥𝑖 , 𝑦𝑖 ), i=1,2, …, n, where

• Suppose we have “n” observations (𝑥𝑖 , 𝑦𝑖 ), i=1,2, …, n, where

Does the inverse exist?

• Does the inverse of XX^T exist?

• Recall the order of computational complexity

Optimizing the cost function with GD

• How can we reach a minimum point of a function from an arbitrary

• Can we always reach the global minimum of a function?

• A generic optimization algorithm

How should we choose the step size?

Batch Gradient Descent

Stochastic Gradient Descent

Mini-batch Gradient Descent

• Recall the derivative calculation in the Gradient Descent

• Instead of calculating these derivatives individually, can we compute

• What happens if the data doesn’t fit in RAM?

• BGD - compute the full gradient

• Iterative algorithms like gradient descent, where we try to optimize a "cost

Mean Squared error

Mean Absolute Error

• A higher value of mean squared error indicates a worse fit.

• A higher value of mean absolute error indicates a worse fit.

• The total sum of squares is proportional to the variance inherent in

• The coefficient of determination, also called R2, is a number

• What does R2 = 1 imply?

How to fit a quadratic regression

Higher-order polynomial regression

• Let us consider a simple linear regression

• Can we increase the polynomial order arbitrarily? (Dimensionality)

• Can we increase the polynomial order arbitrarily? (Overfitting)

• Linear regression can be extended by increasing the degree of the features to

Model Performance and Complexity

Akaike Information Criterion

Bayesian Information Criterion

• Probabilistic model selection (or “information criteria”) provides an analytical

• For a linear regression model fit on n samples with k parameters,

• The model with lowest AIC is selected i.e. AIC is to be minimized

• For a linear regression model fit on n samples with k parameters,

• The model with lowest BIC is selected i.e. BIC is to be minimized

• Model selection criteria can choose between models depending

You might also like