UNIT-2 ML
UNIT-2 ML
Linear Regression
• It’s a supervised learning technique.
• Regression is a technique to model the predictive relationship between one or more
independent variables and one dependent variable.
• The objective of regression is to find the best fitting curve for dependent variable
which is the function of one or more independent variables. If independent variable
is one then its called 1 dimensional regression analysis and if independent variables
are more than one then its called Multidimensional regression analysis.
• The curve could be a straight line or Non linear curve.
• The quality of fitting of the curve is measured by calculating coefficient of
correlation(r). Depending on the value of r the quality of regression is declared.
• Coefficient of correlation is the square root of the amount of variance given by the
curve.
Possible Cases
• For the given cases two variables x & y are shown. In the first case as x increases y also
increases and a straight line can be drawn which fits the curve.
• In the second case, as x increases y decreases. Here also a straight line can be drawn
which will perfectly fit with the data set.
• In third case a straight line cannot be drawn but a curve is drawn to fit the given data set.
Its called curvilinear regression analysis. Fourth case is also an example of curvilinear
regression analysis.
• In the last two cases neither a straight line nor a curve can be drawn. So no relationship
can be established.
Correlation
• For two numeric variables x and y, correlation can be calculated as
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Sum 247 486 20485 11409
Step 2: Find the value of b0 and b1
Step 3: Insert the values in the equation
Yi = b0+b1Xi
Y = 65.14+0.385225X
4∗144−20∗25
• b1(slope)= =0.95
4∗120−400
25∗120−20∗144
• bo= =1.5
4∗120−20∗20
• y=0.95x+1.5
Coefficient of determination ( R2 )
• Determination coefficient tells the goodness of the fit of the model. The
value of R2 lies in range of 0-1.
• Where y is the observed output , yp is the predicted output and y is
the mean value of observed output.
• In the given figure R2 =0.9 tells the model is a good fit model as the
observed values and the predicted values are almost equal or have
small error
Model with R2 =0.2 tells the model is not a good fit model as the observed values and the
predicted values are far away from each other or have large error. Model is not able to fit
the data points
For the dataset given in example 2 determine the value of R2
• y=0.95x+1.5 (regression model)
• 𝑦=6.25
ത (mean of y)
8.7
R2 = 1- = 0.67
26.748
As determination coefficient is close to 1, so predicted model is good.
3. The values of x and their corresponding y is given in the table
(i) Find the regression line for the given data points
(ii) Check wether it is a best fit line or not
S.No. x y
1 1 3
2 2 4
3 3 2
4 4 4
5 5 5
Finding the best fit line:
• When working with linear regression, our main goal is to find the
best fit line that means the error between predicted values and
actual values should be minimized. The best fit line will have the
least error.
• The different values for weights or the coefficient of lines (b0, b1)
gives a different line of regression, so we need to calculate the best
values for b0 and b1 to find the best fit line, so to calculate this we
use cost function.
Cost function
• The different values for weights or coefficient of lines (b0, b1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
• Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
• We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
• For Linear Regression, we use the Mean Squared Error (MSE) cost function, which
is the average of squared error occurred between the predicted values and actual
values. It can be written as:
• For the above linear equation, MSE can be calculated as:
Gradient Descent
Where:
y = β0 + β1x (in case of univariate Logistic regression)
y = β0 + β1x1 + β2x2 … +βnxn (in case of multivariate logistic regression)
Univariate Logistic Regression means the output variable is predicted using only one
predictor variable, while Multivariate Logistic Regression means output variable is
predicted using multiple predictor variables.
The logistic regression function converts the values of logits also called log-odds that
range from −∞ to +∞ to a range between 0 and 1.
Now let us try to simply what we said. Let P be the probability of occurrence of an event.
So probability the event will not occur is 1-P.
Odds is defined as the ratio of the probability of occurrence of a particular event to the
probability of the event not occurring.
We know that logistic regression function gives us probability value. So we can write :
Now since we mentioned log odds, let us take the natural log of both sides of
the Odds equation and substitute the value of P.
Thus we get a more simplified form of logistic regression function equation and we
can say that log odds has linear relationship with the predictor variable x.
Maximum Likelihood Estimation
• The main difference between logistic and linear regression is that logistic regression
provides a constant output, while linear regression provides a continuous output.
• In logistic regression, the outcome, or dependent variable, has only two possible
values. However, in linear regression, the outcome is continuous, which means that it
can have any one of an infinite number of possible values.
• Logistic regression is used when the response variable is categorical, such as yes/no,
true/false and pass/fail. Linear regression is used when the response variable is
continuous, such as hours, height and weight.
• For example, given data on the time a student spent studying and that student's exam
scores, logistic regression and linear regression can predict different things.
• With logistic regression predictions, only specific values or categories are allowed.
Therefore, logistic regression predicts whether the student passed or failed. Since
linear regression predictions are continuous, such as numbers in a range, it can predict
the student's test score on a scale of 0 to100.