0% found this document useful (0 votes)
6 views

02 Regression and Classification Problems

The document discusses regression and classification problems in data analysis, defining regression as supervised learning for continuous output and classification for discrete output. It explains linear regression types, evaluation metrics like Mean Squared Error and Root Mean Squared Error, and introduces logistic regression for predicting categorical outcomes. The document also provides examples of parameter estimation and application of logistic regression in predicting outcomes based on specific data.

Uploaded by

meghanaalluri2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

02 Regression and Classification Problems

The document discusses regression and classification problems in data analysis, defining regression as supervised learning for continuous output and classification for discrete output. It explains linear regression types, evaluation metrics like Mean Squared Error and Root Mean Squared Error, and introduces logistic regression for predicting categorical outcomes. The document also provides examples of parameter estimation and application of logistic regression in predicting outcomes based on specific data.

Uploaded by

meghanaalluri2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Regression and Classification Problems

The expression multivariate analysis is used to describe analyses of data that are
multivariate in the sense that numerous observations or variables are obtained for each
individual or unit studied.
Regression Problems:
• Supervised learning problems where the output is a continuous value are called as
regression problems.
• The Regression technique is used for predicting a continuous value.
• For example, predicting things like the price of a house based on its characteristics,
or to estimate the Co2 emission from a car’s engine, etc.

Regression Analysis In statistical modeling, regression analysis is a set of statistical


processes for estimating the relationships among variables.
Regression analysis is a predictive modelling technique. It estimates the relationship
between the input variables (x) and the output variable (y). Regression is a problem of
predicting the value 𝑌� (or response) given the values of the input variables x 1,x2, ...,x𝑝�
(or predictors).
• In linear regression, we assume that the function 𝑓�(𝑋�) corresponding to the
relationship 𝑌� = 𝑓�(x1,x2, ...,x𝑝�) is linear.
• The task is to find coefficients for the linear model (parameter estimation).
There are two types of Linear Regression models:
Simple Linear Regression:
• When there is a single input variable (x), the method is referred to as simple
linear regression.
• Predict Co2emission using EngineSize of all cars
• Independent variable (x): EngineSize
• Dependent variable (y): Co2emission
Multiple Linear Regression:
• When there are multiple input variables, literature from statistics often refers to
the method as multiple linear regression.
• Predict Co2emission using EngineSize and Cylinders of all cars
• Independent variables (x): EngineSize, Cylinders
• Dependent variable (y): Co2emission

Simple Linear Regression


• The simplest mathematical relationship between two variables x and y is a linear
relationship:
y = β0 + β1x
• x: the input, or independent, or predictor, or explanatory variable (usually
known).
• y: the output, or dependent, or response, or study variable.
• Objective: to find out the parameters.
• The points (x1, y1), …, (xn, yn) resulting from n independent observations will then
be scattered about the true regression line:

• The simple linear regression model is:


y = β0 + β1x c +
where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
Evaluation Metrics in Regression Models:
Evaluation metrics are used to explain the performance of a model. As mentioned,
basically, we can compare the actual values and predicted values, to calculate the
accuracy of our regression model.
A residual is a measure of how far away a point is from the regression line. Simply, it is
the error between a predicted value and the observed actual value.
Mean Squared Error (MSE) is the mean of the squared error. It's more popular than
mean absolute error because the focus is geared more towards large errors. This is due
to the squared term exponentially increasing larger errors in comparison to smaller ones.

Root Mean Squared Error (RMSE) is the square root of the mean squared error. This
is one of the most popular of the evaluation metrics because root mean squared error is
interpretable in the same units as the response vector or y units, making it easy to relate
its information.

Estimation of Parameters in Simple Linear Regression using Ordinary Least


Squares:
Ordinary Least Squares (OLS) works by minimizing the sum of the squares of the
differences between the observed dependent variable in the given dataset and those
predicted by the linear function.
This method allows finding such estimators 𝛽 ̂0 and 𝛽
̂1 for parameters β0 and β1 that
minimize the sum of squared errors 𝜀�(β0, β1) in the observed 𝑛� experiments.
In other words, we minimize the function

and find the arguments minimizing the function.


To solve the minimization problem, we can use the following theorem.
Theorem: The minimum of the function

is unique and attained when


Where 𝑋̅ is the mean of x values, and 𝑌̅ is the mean of y values.

Example – Dataset of patient's age and their blood pressure

Our aim is to find the regression line:

𝑋̅ = 491/10= 49.1, and 𝑌̅ = 1410/10= 141


The slope (β1) can be calculated as: β1= 2335/2048.9 =1.14
The intercept (β0) is calculated as: β0= 141-1.14*49.1 = 85.026
• Now substitute the regression coefficients into the regression equation
• Estimated blood pressure:
(Ŷ) = 85.026 + 1.14 * 𝑎�𝑔�𝑒�

Classification Problems:
• The problems where the output is a discrete value are called as classification
problems.
• Classification is the process of predicting a discrete class label, or categories.
• For example, if a cell is benign or malignant, if an email is spam or not.
• The classification problem not necessarily has only two outcomes, which means
it isn’t limited to two classes. For example, the problem of handwritten digit
recognition (that is a classification problem) has ten outcome

Logistic Regression
Logistic regression is a classification algorithm designed to predict categorical target
labels based on historical feature data. It allows us to predict the probability of a
dependent variable given an input, and a model. Logistic regression can be used for both
binary classification and multi-class classification.
Sigmoid Function
Logistic Regression uses the sigmoid function also known as the logistic function to
perform classification. The sigmoid function takes in any value and map it into a
value between 0 and 1. The key thing to notice here is that it doesn’t matter what value
of y you put into the logistics or the sigmoid function you’ll always get a value between
0 and 1. This means we can take our linear regression solution and place it into
the sigmoid function and it looks something like this:

• We can formulate the algorithm for predicting the class of the new object x with
the predictors (x1, x2, ..., x𝑝�) once the coefficients β0, β1, ..., β𝑝� are found.
1. Calculate the value 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥 2 + ⋯ + 𝛽𝑝 𝑥 𝑝
2. Calculate the probability P:
3. If P ≥ 0.5, the object x will fall into the class 1 or 0 otherwise.
(In practice, the choice of a probability cut-off is up to a researcher)

Let’s apply the logistic regression algorithm to specific data.


• Our data is football statistics. It has three predictors, including shots on target
(𝑋�1), possession (𝑋� 2), and shots (𝑋�3).
• The response 𝑌� takes only two values. The value 1 corresponds to a win (class
+1), and the value 0 is a loss or draw (class 0).
• The training data provides the following values of the model parameters:
β0= −0.046, β1=0.541, β2= −0.014, β3= −0.132.
• We classify the new object 𝑧�:
𝑧� = (1, 40, 3).
• It’s a team that had 1 shot on target, 40 percent of possession, and 3 shots.
According to the described algorithm, the probability that the team wins equals:
1
P+= 1+𝑒 −(𝛽0 +𝛽1 𝑥1+𝛽2 𝑥2 +𝛽3 𝑥3 )
1
=1+𝑒 −(−0.046+0.541∗1−0.014∗40−0.132∗3)
=0.38
• It means that it will likely lose.

You might also like