0% found this document useful (0 votes)
30 views

UNIT-2 ML

Uploaded by

Varsha Saxena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

UNIT-2 ML

Uploaded by

Varsha Saxena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

•UNIT # 2

Linear Regression
• It’s a supervised learning technique.
• Regression is a technique to model the predictive relationship between one or more
independent variables and one dependent variable.
• The objective of regression is to find the best fitting curve for dependent variable
which is the function of one or more independent variables. If independent variable
is one then its called 1 dimensional regression analysis and if independent variables
are more than one then its called Multidimensional regression analysis.
• The curve could be a straight line or Non linear curve.
• The quality of fitting of the curve is measured by calculating coefficient of
correlation(r). Depending on the value of r the quality of regression is declared.
• Coefficient of correlation is the square root of the amount of variance given by the
curve.
Possible Cases
• For the given cases two variables x & y are shown. In the first case as x increases y also
increases and a straight line can be drawn which fits the curve.
• In the second case, as x increases y decreases. Here also a straight line can be drawn
which will perfectly fit with the data set.
• In third case a straight line cannot be drawn but a curve is drawn to fit the given data set.
Its called curvilinear regression analysis. Fourth case is also an example of curvilinear
regression analysis.
• In the last two cases neither a straight line nor a curve can be drawn. So no relationship
can be established.
Correlation
• For two numeric variables x and y, correlation can be calculated as

• 𝑥ഥ is the mean of x and 𝑦ത is the mean of y.


• Correlation defines the kind and strength of relationship developed between two
variables.
• It’s a quantitative measure which that is measured in the range from 0 to 1.
• A correlation of 1 indicates perfect relationship and correlation of 0 indicates no
relationship.
Coefficient of correlation
• The relationship can be positive or it can be an inverse relationship i.e.
variables can move in the same direction or in the opposite direction.
• So better measure is the correlation coefficient instead of correlation.
Correlation coefficient is the square root of the correlation. It varies in the
range of -1 to +1.
• 1 shows perfect relationship in the same direction and -1 shows the
perfect relationship in the opposite direction. 0shows no relationship.
Steps of Regression Analysis

• List all the variables available for making the model.


• Establish a dependent variable of interest.
• Examine the relationship between variables of interest.
• Find a way to predict dependent variable using other variables.
Linear Regression Model
Multiple Regression Model
Example 1 of Regression equation to predict
Glucose level

S.No. Age (x) Glucose Level (y)


1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
7 55 ?
Linear Regression Equation, Yi = b0+b1Xi
where bo and b1 can be computed as follows:
Step 1: Calculate the values of XY X2
S.No. Age (X) Glucose Level (Y) XY X2

1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Sum 247 486 20485 11409
Step 2: Find the value of b0 and b1
Step 3: Insert the values in the equation
Yi = b0+b1Xi
Y = 65.14+0.385225X

Step 4: Prediction of the y for the given value of x =55


Y= 65.14+0.385225*55
= 86.327
So the glucose level predicted for the age of 55 is 86.327
2. Find the linear regression equation for the
following data set
S.No. x y
1 2 3
2 4 7
3 6 5
4 8 10
Sum 20 25
S.No. x y X2 xy
1 2 3 4 6
2 4 7 16 28
3 6 5 36 30
4 8 10 64 80
Sum 20 25 120 144

4∗144−20∗25
• b1(slope)= =0.95
4∗120−400
25∗120−20∗144
• bo= =1.5
4∗120−20∗20

• y=0.95x+1.5
Coefficient of determination ( R2 )
• Determination coefficient tells the goodness of the fit of the model. The
value of R2 lies in range of 0-1.
• Where y is the observed output , yp is the predicted output and y is
the mean value of observed output.

• In the given figure R2 =0.9 tells the model is a good fit model as the
observed values and the predicted values are almost equal or have
small error
Model with R2 =0.2 tells the model is not a good fit model as the observed values and the
predicted values are far away from each other or have large error. Model is not able to fit
the data points
For the dataset given in example 2 determine the value of R2
• y=0.95x+1.5 (regression model)
• 𝑦=6.25
ത (mean of y)

S.No. x y yp y-yp (y-yp)2 (y-𝑦)


ത (y-𝑦)
ത2
1 2 3 3.4 -0.4 0.16 -3.25 10.562
2 4 7 5.3 1.7 2.89 0.75 0.562
3 6 5 7.2 -2.2 4.84 -1.25 1.562
4 8 10 9.1 0.9 0.81 3.75 14.062
Sum 20 25 25 8.7 26.748

8.7
R2 = 1- = 0.67
26.748
As determination coefficient is close to 1, so predicted model is good.
3. The values of x and their corresponding y is given in the table

(i) Find the regression line for the given data points
(ii) Check wether it is a best fit line or not

S.No. x y
1 1 3
2 2 4
3 3 2
4 4 4
5 5 5
Finding the best fit line:

• When working with linear regression, our main goal is to find the
best fit line that means the error between predicted values and
actual values should be minimized. The best fit line will have the
least error.
• The different values for weights or the coefficient of lines (b0, b1)
gives a different line of regression, so we need to calculate the best
values for b0 and b1 to find the best fit line, so to calculate this we
use cost function.
Cost function
• The different values for weights or coefficient of lines (b0, b1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
• Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
• We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
• For Linear Regression, we use the Mean Squared Error (MSE) cost function, which
is the average of squared error occurred between the predicted values and actual
values. It can be written as:
• For the above linear equation, MSE can be calculated as:
Gradient Descent

• Gradient descent is an optimization algorithm used to find the values of


parameters of a function that minimizes a cost function. It is an iterative
algorithm. We use gradient descent to update the parameters of the model.
Parameters refer to coefficients in Linear Regression and weights in neural
networks.
• Gradient descent is used to minimize the MSE by calculating the gradient of the
cost function.
• A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
• It is done by a random selection of values of coefficient and then iteratively
update the values to reach the minimum cost function.
Multiple Linear Regression
• In the previous topic, we have learned about Simple Linear Regression, where a
single Independent/Predictor(X) variable is used to model the response variable
(Y). But there may be various cases in which the response variable is affected by
more than one predictor variable; for such cases, the Multiple Linear Regression
algorithm is used.
• Moreover, Multiple Linear Regression is an extension of Simple Linear regression
as it takes more than one predictor variable to predict the response variable.
• For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
• Each feature variable must model the linear relationship with the dependent
variable.
• MLR tries to fit a regression line through a multidimensional space of data-points.
Assumptions for Multiple Linear Regression:
• A linear relationship should exist between the Target and predictor
variables.
• The regression residuals must be normally distributed.
• MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.
Logistic Regression
• Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
Representation of S (sigmoid function)
Logistic Regression is a Machine Learning algorithm which is used for the
classification problems, it is a predictive analysis algorithm and based on the concept of
probability. Here we use sigmoid function to map predictions to probabilities as given
below.

Where:
y = β0 + β1x (in case of univariate Logistic regression)
y = β0 + β1x1 + β2x2 … +βnxn (in case of multivariate logistic regression)

Univariate Logistic Regression means the output variable is predicted using only one
predictor variable, while Multivariate Logistic Regression means output variable is
predicted using multiple predictor variables.
The logistic regression function converts the values of logits also called log-odds that
range from −∞ to +∞ to a range between 0 and 1.
Now let us try to simply what we said. Let P be the probability of occurrence of an event.
So probability the event will not occur is 1-P.
Odds is defined as the ratio of the probability of occurrence of a particular event to the
probability of the event not occurring.

We know that logistic regression function gives us probability value. So we can write :
Now since we mentioned log odds, let us take the natural log of both sides of
the Odds equation and substitute the value of P.

Thus we get a more simplified form of logistic regression function equation and we
can say that log odds has linear relationship with the predictor variable x.
Maximum Likelihood Estimation

In order that our model predicts output variable as 0 or 1, we need


to find the best fit sigmoid curve, that gives the optimum values of
beta coefficients. That is we need to create an efficient boundary
between the 0 and 1 values.
Now a cost function tells you how close your values are from
actual. So here we need a cost function which maximizes the
likelihood of getting desired output values. Such a cost function is
called as Maximum Likelihood Estimation (MLE) function.
For points to be 0, we need the probabilities P1, P2 and P4 to be as minimum as possible and for
points to be 1, we need the probabilities P3, P5, P6 and P7 to be as high as possible, for correct
classification.
We can also say that (1-P1), (1-P2), P3, (1-P4), P5, P6 and P7 should be as high as possible.
The joint probability is nothing but the product of probabilities.
So the product :[ (1-P1)*(1-P2)* P3*(1-P4)*P5*P6*P7 ] should be maximum.
This joint probability function is nothing but our cost function which should be maximized in order to
get a best fit sigmoid curve. Or we can say predicted values to be close to the actual values.
Sigmoid Function
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Values above the threshold value tends to 1, and a value below
the threshold values tends to 0.
Assumptions for Logistic Regression:
• The dependent variable must be categorical in nature.
Type of Logistic Regression:
• On the basis of the categories, Logistic Regression can be classified into
three types:
• Binomial: In binomial Logistic regression, there can be only two possible
types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or "High".
Advantages & Disadvantages of Logistic Regression

• The main advantage of logistic regression is that it is much easier to


set up and train than other machine learning and AI applications.
• Another advantage is that it is one of the most efficient algorithms
when the different outcomes or distinctions represented by the data
are linearly separable. This means that you can draw a straight line
separating the results of a logistic regression calculation.
• One of the biggest attractions of logistic regression for statisticians
is that it can help reveal the interrelationships between different
variables and their impact on outcomes.
Logistic regression vs. linear regression

• The main difference between logistic and linear regression is that logistic regression
provides a constant output, while linear regression provides a continuous output.
• In logistic regression, the outcome, or dependent variable, has only two possible
values. However, in linear regression, the outcome is continuous, which means that it
can have any one of an infinite number of possible values.
• Logistic regression is used when the response variable is categorical, such as yes/no,
true/false and pass/fail. Linear regression is used when the response variable is
continuous, such as hours, height and weight.
• For example, given data on the time a student spent studying and that student's exam
scores, logistic regression and linear regression can predict different things.
• With logistic regression predictions, only specific values or categories are allowed.
Therefore, logistic regression predicts whether the student passed or failed. Since
linear regression predictions are continuous, such as numbers in a range, it can predict
the student's test score on a scale of 0 to100.

You might also like