0% found this document useful (0 votes)
15 views

Lecture 09_02.09.2024_Regression-01

Uploaded by

Amritanshu Vivek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture 09_02.09.2024_Regression-01

Uploaded by

Amritanshu Vivek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

1

Regression - 1

Prof. Sashikumaar Ganesan, IISc Bangalore


“ How can we predict the value
of a continuous variable in a
supervised setting?
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression

Introduction to linear regression

Assumptions in linear regression

Significance Testing

Residual Analysis
What is “Linear”?

• Remember this :
• Y = mX + B ?
• A slope of two implies that every one unit change in X yields two units change in Y.
Assumptions

• The relationship between X and Y is linear


• Y is distributed normally at each value of X
• The variance of Y at every value of X is the same (homogeneity of variances)
• The observations are independent.
Linear Regression : Prediction

• If you know something about X, this knowledge helps you predict something about
Y. (Sound familiar?…sound like conditional probabilities?)
• Expected value of y for given x is

Fixed Random
exactly error with
on the normal
line distribution
Signficance Testing

• How do we assess the nature of fit?


• The t-test helps to determine if this linear relationship is statistically significant.

• We can calculate the t-statistic using (s.e. Denotes standard error)


Significance Testing

• Formula for the standard error of beta. We consider n-2 since we lose two degree of
freedom (i.e., the slope and y-intercept).

• Using the t-value statistic we can then calculate the p-value.


• P-values are used in hypothesis testing to help decide whether to reject the null
hypothesis. The smaller the p value, the more likely you are to reject the null
hypothesis. We can use standard tables that related p-values and the t-statistic.
P-Value Interpretation
Confidence Intervals

• Have a look at the following example :


Residual Analysis

• Residual
• The residual for observation i, ei, is the difference between its observed and predicted
value
• Check the assumptions of regression by examining the residuals
• Examine for linearity assumption
• Examine for constant variance for all levels of X (homoscedasticity)
• Evaluate normal distribution assumption
• Evaluate independence assumption
Analysis For Linearity
Analysis For Homoscedasticity
Analysis For Independence
Overall Picture
Summary
• Modelling the relation between a scalar response (target/dependent variable) and one or
more explanatory (features/independent) variables
• The case of one explanatory variable is called simple linear regression

• It is important to assess the quality of fit using some statistical tools that were mentioned
“ How can we try to solve for
multilinear regression (more
than one feature) ?
Normal Equation

Set-up of our problem

Derivation of normal equation

Computational Aspects
Normal Equation

• Suppose we have “n” observations (𝑥𝑖 , 𝑦𝑖 ), i=1,2, …, n, where


𝑥𝑖 ∈ ℝ𝑚 , 𝑦𝑖 ∈ ℝ
• In function notation 𝑦𝑖 = 𝑓 𝑥𝑖 , 𝑤ℎ𝑒𝑟𝑒 𝑓: ℝ𝑚 → ℝ
Normal Equation
Normal Equation

• Suppose we have “n” observations (𝑥𝑖 , 𝑦𝑖 ), i=1,2, …, n, where


𝑥𝑖 ∈ ℝ𝑚 , 𝑦𝑖 ∈ ℝ
• In function notation 𝑦𝑖 = 𝑓 𝑥𝑖 , 𝑤ℎ𝑒𝑟𝑒 𝑓: ℝ𝑚 → ℝ

Linear Regression:

𝑦𝑖 = 𝑓 𝑥𝑖 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + . . . +𝑏𝑚 𝑥𝑚 + 𝜖𝑖
𝑚
= 𝑏0 ∗ 1 + ෍ 𝑥𝑖𝑗 𝑏𝑗 + 𝜖𝑖
𝑗=1
Derivation

Does the inverse exist?


Derivation

• Find the optimal coefficients, which fit the model better, instead of
Solving the equation:
Derivation

• Does the inverse of XX^T exist?

• Can we use SVD or any other direct method to solve the normal equations?
Computational Complexity

• Recall the order of computational complexity


• Computational complexity of inverting a matrix is typically in the order
between 2.5 and 3
• The order will be 2 in SVD
• Alternatively, iterative methods is preferred when the size of the matrix
is large
• Irrespective of the size of the matrix, the prediction cost is very less
compared to the training cost
Summary

• We can derive the normal equation by considering many features and many
training examples.
• Solving the normal equation is computationally expensive due to the calculation
of the inverse.
• Hence iterative methods are preferred to prevent large computation
overheads.

How do we perform linear regression
when the training data is large?
Gradient Descent

Intuition

Optimizing the cost function with GD

Choosing hyperparameters
Intuition

• How can we reach a minimum point of a function from an arbitrary


initial guess?
Intuition

• Can we always reach the global minimum of a function?


Gradient Descent

• A generic optimization algorithm


• Idea is to tweak parameters iteratively to minimize a cost function
• We need the derivative of the cost function
Hyperparameter Selection

How should we choose the step size?


Hyperparameter Selection
Types of Gradient Descent

Batch Gradient Descent

Stochastic Gradient Descent

Mini-batch Gradient Descent


Batch Gradient Descent

• Recall the derivative calculation in the Gradient Descent

• Instead of calculating these derivatives individually, can we compute


it in one go?
• This results in the Batch Gradient Descent (BGD)
Stochastic Gradient Descent

• What happens if the data doesn’t fit in RAM?


• Also recall memory bound!
• Can we apply SG for a random data point “i”, instead of the
complete data points?
• Let “p” be a random integer in [0, n], then
Mini-batch Gradient Descent

• BGD - compute the full gradient


• SGD - compute the derivative one by one
• Can we combine both?
• It results in Mini-batch Gradient
• Computes gradient on a small data-point set
• Cache optimization can be exploited
Summary

• Iterative algorithms like gradient descent, where we try to optimize a "cost


function/loss function,” are used when the problem is high-dimensional.
• Choosing the right step size can help gradient descent converge to a minima
faster.
• Various techniques like SGD, BGD are used to calculate the gradient due to RAM
limitations (memory bound) and computational limitations (compute bound).
“ How can we assess the
goodness of the fit?
Model Validation

Mean Squared error

Mean Absolute Error

R2
Mean Squared Error

• If yi is the target value and fi is the predicted value for the linear
regression fit on n samples, then the mean squared error for the fit
is given by -

• A higher value of mean squared error indicates a worse fit.


Mean Absolute Error (MAE)

• If yi is the target value and fi is the predicted value for the linear
regression fit on n samples, then the mean absolute error for the fit
is given by -

• A higher value of mean absolute error indicates a worse fit.


R2 ​(Coefficient of Determination)

• If yi is the target value, then the mean target value for n samples is

• The residual sum of squares for the fit is proportional to the MSE

• The total sum of squares is proportional to the variance inherent in


the data
R2 ​(Coefficient of Determination)

• The coefficient of determination, also called R2, is a number


between 0 and 1 that describes how much of the total variance in
the data is captured by the linear regression fit

• What does R2 = 1 imply?

• What does R2 = 0 imply? Which one would you say is a better fit?
“ How can we fit non-linear data?
Non-Linear Regression

Polynomial Regression

How to fit a quadratic regression

Higher-order polynomial regression


Polynomial Regression

• How to fit nonlinear data, that is, when the relation between the
target value and the features is nonlinear?
Quadratic Regression

• Let us consider a simple linear regression

• How about

• Is it still linear?
Quadratic Regression

• How about sample data with two features? How will you write the
target y as a function of features x1 and x2 for a quadratic regression
problem?
Higher-Order Polynomial Regression

• Can we increase the polynomial order arbitrarily? (Dimensionality)


o Suppose there are “n” features, and the polynomial degree is “d”
o Then the total number of unknown parameters to be fitted in the polynomial regression
is (n + d)! / d!n!
o This makes the computation very expensive for a high polynomial degree

• Can we increase the polynomial order arbitrarily? (Overfitting)


o Arbitrarily increasing the polynomial degree may lead to overfitting, which we will
discuss in the next lecture.
o Overfit models tend to not generalize well and hence perform poorly on test data.
Higher-Order Polynomial Regression
Summary

• Linear regression can be extended by increasing the degree of the features to


perform polynomial regression.
• Polynomial regression can fit non-linear data
• Care must be taken while choosing the polynomial degree to avoid expensive
computations and overfitting.
“ How do we select the best
model?
Model Selection

Model Performance and Complexity

Akaike Information Criterion

Bayesian Information Criterion


Information Criteria

• Probabilistic model selection (or “information criteria”) provides an analytical


technique for scoring and choosing among candidate models.
• Models are scored both on their performance on the training dataset and based
on the complexity of the model -
1. Model Performance. How well a candidate model has performed on the
training dataset.
2. Model Complexity. How complicated the trained candidate model is
after training.
Akaike Information Criterion

• For a linear regression model fit on n samples with k parameters,


the Akaike Information Criterion (AIC) is given by

• The model with lowest AIC is selected i.e. AIC is to be minimized


• AIC puts emphasis on model performance on the training dataset
and may penalize models with lower complexity.
Bayesian Information Criterion

• For a linear regression model fit on n samples with k parameters,


the Bayesian Information Criterion (AIC) is given by

• The model with lowest BIC is selected i.e. BIC is to be minimized


• Unlike the AIC, BIC penalizes a model more for its complexity.
Hence more complex models are less likely to be selected.
Summary

• Model selection criteria can choose between models depending


on their performance and complexity.
• AIC tends to choose models with better performance and high
complexity.
• BIC tends to choose models with poor performance but low
complexity.

You might also like