0% found this document useful (0 votes)
28 views

2.1 Regression Analysis

Uploaded by

didihem948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

2.1 Regression Analysis

Uploaded by

didihem948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Regression Analysis

Dr. Tanu Shree


Explanation

 Regression analysis is a statistical method to model the


relationship between a dependent (target) and
independent (predictor) variables with one or more
independent variables.
 It predicts continuous/real values such as temperature,
age, salary, price, etc.
Example:
Suppose there is a marketing
company A, who does various
advertisement every year and
get sales on that. The below list
shows the advertisement made
by the company in the last 5
years and the corresponding
sales:
Regression:

 is a supervised learning technique

 used for prediction, forecasting, time series modeling, and


determining the causal-effect relationship between variables.

 "Regression shows a line or curve that passes through all the


datapoints on target-predictor graph in such a way that the
vertical distance between the datapoints and the regression
line is minimum."
Terminologies Related to the
Regression Analysis:
• Dependent Variable or target variable.
• Independent Variable or predictor.
• Outliers
• Multicollinearity: If the independent variables are highly correlated
with each other than other variables, then such condition is called
Multicollinearity. It should not be present in the dataset, because it
creates problem while ranking the most affecting variable.
• Underfitting and Overfitting: If our algorithm works well with the
training dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.
Linear Regression
 Relationship between one dependent & two or more independent variables is a linear function
 For example, suppose that height was the only determinant of body weight. If we were to
plot height (the independent or 'predictor' variable) as a function of body weight (the
dependent or 'outcome' variable), we might see a very linear relationship, as illustrated
below.

 We could also describe this relationship with the equation for a line, Y = a + b(x), where
'a' is the Y-intercept and 'b' is the slope of the line. We could use the equation to predict
weight if we knew an individual's height. In this example, if an individual was 70 inches
tall, we would predict his weight to be:
Weight = 80 + 2 x (70) = 220 lbs.
Question: Find linear regression equation for the following
two sets of data:

 Solution :
Multiple linear regression

 In real estate, we can predict the selling price of a house based on


various factors such as area, number of bedrooms, number of floors, and
location. This is where multiple linear regression comes into play.
Multiple Linear Regression
where: Population Population Random
Y-intercept slopes error
0 = y-intercept {a
constant value}
1 = slope of Y with variable x1 Y  00  11X 11   22 X 22     PP X PP  
holding the variables x2,
x3, ..., xP effects constant Independent
Dependent
(explanatory)
P = slope of Y with variable xP (response) variables
holding all other variables’ variable
effects constant
Logistic Regression
• In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
• Logistic regression algorithm works with the categorical variable
such as 0 or 1, Yes or No, True or False, Spam or not spam, etc.
• It is a predictive analysis algorithm which works on the concept
of probability.
• Logistic regression is a type of regression, but it is different from
the linear regression algorithm in the term how they are used.
 Logistic regression uses sigmoid function or logistic
function which is a complex cost function. This sigmoid
function is used to model the data in logistic regression.
The function can be represented as:

•f(x)= Output between the 0 and 1 value.


•x= input to the function
•e= base of natural logarithm.
 There are three types of logistic
regression:
• Binary(0/1, pass/fail)
• Multi (cats, dogs, lions)
• Ordinal (low, medium, high)
 Logistic regression is often used
in healthcare to estimate binary
outcomes, like whether a patient
will develop a particular disease.
For example, we could use
logistic regression to predict the
likelihood of a patient having
diabetes based on factors like
age, BMI, family history, and
blood sugar levels.
Polynomial Regression
• Polynomial Regression is a type of regression which models the non-
linear dataset using a linear model.
• It is similar to multiple linear regression, but it fits a non-linear curve
between the value of x and corresponding conditional values of y.
• Suppose there is a dataset which consists of datapoints which are present
in a non-linear fashion, so for such case, linear regression will not best fit
to those datapoints. To cover such datapoints, we need Polynomial
regression.
• In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial
line.
• The equation for polynomial
regression also derived from linear
regression equation that means
Linear regression equation Y= b0+
b1x, is transformed into Polynomial
regression equation
• Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
• Here Y is the predicted/target
output, b0, b1,... bn are the
regression coefficients. x is
our independent/input variable.
• The model is still linear as the
coefficients are still linear with
quadratic
Difference between Polynomial Regression
and Multiple Linear Regression
 Lets start with Example:
 let’s take a little imaginative trip to Central Perk, the beloved coffee shop from
“Friends”.
Picture this: Ross, the paleontologist, is trying to figure out how the number of
coffees he drinks and the hours he spends at the museum affect his mood. He’s been
keeping a track of these three variables for a month.
 Ross decides to use Multiple Linear Regression. The independent variables are
“coffees drunk” and “hours at the museum”, and the dependent variable is “mood
score”, a rating Ross gives himself at the end of the day. Ross uses his stats software
(let’s assume Python) and finds that both the number of coffees and hours at the
museum significantly predict his mood score. Surprisingly, more coffee and less
time at the museum lead to a better mood score (Ross wonders if he should consider
a career change).
 Meanwhile, Chandler has been trying to understand his
relationship with Janice.
 He decides to analyze the pattern of their breakups and
makeups over the years. Because their relationship isn’t
exactly linear (it’s more of a roller-coaster), Chandler
decides to use Polynomial Regression. He finds that the
pattern of their relationship can be modeled quite well
with a cubic function. Chandler is amused — he always
knew his relationship was a little “cubed”.
Polynomial Regression
 Polynomial Regression is a form of regression analysis in which the
relationship between the independent variable x and the dependent
variable y is modelled as an nth degree polynomial.
 The real world is complex, and the relationship between variables in
real-world data can often be non-linear. When Linear Regression
fails to accurately capture the relationship between variables due to
its linearity, Polynomial Regression can be a good alternative
because it can model more complex relationships.
 For instance, a cubic regression uses three variables, X, X², and X³
as predictors.
 Pros:
1. Can model more complex, non-linear relationships between
variables.
2. Provides a more flexible curve that can fit the data better in many
cases.
 Cons:
1. Choosing the correct polynomial degree can be challenging. Too
low and the model underfits; too high and it overfits.
2. Polynomial regression models can be sensitive to the scale of the
dataset, and sometimes require scaling of the features.
Multiple Linear Regression
 Pros:
1. Allows for the analysis of the effects of multiple predictors on the
outcome.
2. Useful for real-world scenarios where multiple factors influence the
outcome.
 Cons:
1. Assumes a linear relationship between predictors and outcome, which
might not always hold true.
2. Multicollinearity, where predictors are correlated with each other, can
be a problem and make the model interpretation difficult.
3. Like other regression models, outliers can heavily influence the model.
Support Vector Regression
Below are some keywords which are used in Support Vector Regression:
• Kernel: It is a function used to map a lower-dimensional data into higher
dimensional data.
• Hyperplane: In general SVM, it is a separation line between two classes, but
in SVR, it is a line which helps to predict the continuous variables and cover
most of the datapoints.
• Boundary line: Boundary lines are the two lines apart from hyperplane,
which creates a margin for datapoints.
• Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.
 The main goal of SVR is
to consider the
maximum datapoints
within the boundary
lines and the hyperplane
(best-fit line) must
contain a maximum
number of datapoints.
Decision Tree Regression
 solve both classification and
regression problems.
 Decision Tree regression builds a tree-
like structure in which each internal
node represents the "test" for an
attribute, each branch represent the
result of the test, and each leaf node
represents the final decision or result.
Ridge Regression
 statistical regularization technique.
 It corrects for overfitting on training data in machine learning models.
 Ridge regression is a procedure for eliminating the bias of coefficients
and reducing the mean square error by shrinking the coefficients of a
model towards zero in order to solve problems of overfitting or
multicollinearity that are normally associated with ordinary least
squares regression.
 Ridge regression specifically corrects for multicollinearity in
regression analysis.
•The amount of bias added to the model is known as Ridge
Regression penalty. We can compute this penalty term by
multiplying with the lambda to the squared weight of each
individual features.
•The equation for ridge regression will be:

•A general linear or polynomial regression will fail if there is high


collinearity between the independent variables, so to solve such
problems, Ridge regression can be used.
•Ridge regression is a regularization technique, which is used to
reduce the complexity of the model. It is also called as L2
regularization.
•It helps to solve the problems if we have more parameters than
samples.
Lasso Regression:

• Lasso regression is another regularization technique to reduce the


complexity of the model.
• It is similar to the Ridge Regression except that penalty term contains
only the absolute weights instead of a square of weights.
• Since it takes absolute values, hence, it can shrink the slope to 0,
whereas Ridge Regression can only shrink it near to 0.
• It is also called as L1 regularization. The equation for Lasso
regression will be:

You might also like