0% found this document useful (0 votes)
49 views12 pages

Regression and Introduction To Bayesian Network

This document gives you the concept of AKTU Data Analytics of topic Regression and Introduction to Bayesian Network

Uploaded by

Semmy Baghel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views12 pages

Regression and Introduction To Bayesian Network

This document gives you the concept of AKTU Data Analytics of topic Regression and Introduction to Bayesian Network

Uploaded by

Semmy Baghel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Linear regression is used for finding linear relationship between target and one or more

predictors. There are two types of linear regression- Simple and Multiple.

Simple Linear Regression

Simple linear regression is useful for finding relationship between two continuous variables.
One is predictor or independent variable and other is response or dependent variable. It looks
for statistical relationship but not deterministic relationship. Relationship between two variables
is said to be deterministic if one variable can be accurately expressed by the other. For example,
using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical
relationship is not accurate in determining relationship between two variables. For example,
relationship between height and weight.

The core idea is to obtain a line that best fits the data. The best fit line is the one for which total
prediction error (all data points) are as small as possible. Error is the distance between the point
to the regression line.

Simple Regression Calculation

To calculate best-fit line linear regression uses a traditional slope-intercept form which is given
below,

Yi = β0 + β1Xi

where Yi = Dependent variable, β0 = constant/Intercept, β1 = Slope/Intercept, Xi = Independent


variable.

This algorithm explains the linear relationship between the dependent(output) variable y and the
independent(predictor) variable X using a straight line Y= B0 + B1 X.
But how the linear regression finds out which is the best fit line?

Use various methods

 Closed form equation

 Matrix inversion
 Gradient descent
 Stochastic gradient descent
 Hill climbing
 Matrix decomposition

Closed form
Evaluation Metrics for Linear Regression

The strength of any linear regression model can be assessed using various evaluation metrics.
These evaluation metrics usually provide a measure of how well the observed outputs are being
generated by the model.

The most used metrics are,

 Coefficient of Determination or R-Squared (R2)

 Root Mean Squared Error (RSME) and Residual Standard Error (RSE)
 Coefficient of Determination or R-Squared (R2)

R-Squared is a number that explains the amount of variation that is explained/captured by the
developed model. It always ranges between 0 & 1 . Overall, the higher the value of R-squared,
the better the model fits the data.

Mathematically it can be represented as,

R2 = 1 – ( RSS/TSS )

Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each data
point in the plot/data. It is the measure of the difference between the expected and the actual
observed output.

Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean

of the response variable. Mathematically TSS is,

where y hat is the mean of the sample data points.


Root Mean Squared Error

The Root Mean Squared Error is the square root of the variance of the residuals. It specifies the
absolute fit of the model to the data i.e. how close the observed data points are to the predicted
values. Mathematically it can be represented as,

Assumptions of Linear Regression

Regression is a parametric approach, which means that it makes assumptions about the data for
the purpose of analysis. For successful regression analysis, it’s essential to validate the following
assumptions.

Linearity of residuals: There needs to be a linear relationship between the dependent variable
and independent variable(s).

Independence of residuals: The error terms should not be dependent on one another There
should be no correlation between the residual terms. The absence of this phenomenon is known
as Autocorrelation.
Normal distribution of residuals: The mean of residuals should follow a normal distribution
with a mean equal to zero or close to zero. This is done in order to check whether the selected
line is actually the line of best fit or not. If the error terms are non-normally distributed, suggests
that there are a few unusual data points that must be studied closely to make a better model. r-
squared significance

The equal variance of residuals: The error terms must have constant variance. This
phenomenon is known as Homoscedasticity.

The presence of non-constant variance in the error terms is referred to as Heteroscedasticity.

Generally, non-constant variance arises in the presence of outliers or extreme leverage values.

Hypothesis in Linear Regression

Once you have fitted a straight line on the data, you need to ask, “Is this straight line a significant

fit for the data?” Or “Is the beta coefficient explain the variance in the data plotted?” And here
comes the idea of hypothesis testing on the beta coefficient. The Null and Alternate hypotheses
in this case are:
H0: B1 = 0

HA: B1 ≠ 0

To test this hypothesis we use a t-test, test statistics for the beta coefficient is given by,

Assessing the model fit

Some other parameters to assess a model are:

t statistic: It is used to determine the p-value and hence, helps in determining whether the
coefficient is significant or not

F statistic: It is used to assess whether the overall model fit is significant or not. Generally, the
higher the value of the F-statistic, the more significant a model turns out to be.

Multiple Linear Regression

Multiple linear regression is a technique to understand the relationship between a single

dependent variable and multiple independent variables.

The formulation for multiple linear regression is also similar to simple linear regression with

the small change that instead of having one beta variable, you will now have betas for all the
variables used. The formula is given as:

Y = B0 + B1X1 + B2X2 + … + BpXp + ε

Considerations of Multiple Linear Regression

All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear
Regression along with a few new additional assumptions.
Overfitting: When more and more variables are added to a model, the model may become far
too complex and usually ends up memorizing all the data points in the training set. This
phenomenon is known as the overfitting of a model. This usually leads to high training accuracy
and very low test accuracy.

Multicollinearity: It is the phenomenon where a model with several independent variables, may
have some variables interrelated.

Feature Selection: With more variables present, selecting the optimal set of predictors from the
pool of given features (many of which might be redundant) becomes an important task for
building a relevant and better model.

Multicollinearity

As multicollinearity makes it difficult to find out which variable is actually contributing towards
the prediction of the response variable, it leads one to conclude incorrectly, the effects of a
variable on the target variable. Though it does not affect the precision of the predictions, it is
essential to properly detect and deal with the multicollinearity present in the model, as random
removal of any of these correlated variables from the model causes the coefficient values to
swing wildly and even change signs.

Multicollinearity can be detected using the following methods.

Pairwise Correlations: Checking the pairwise correlations between different pairs of


independent variables can throw useful insights in detecting multicollinearity.

Variance Inflation Factor (VIF): Pairwise correlations may not always be useful as it is
possible that just one variable might not be able to completely explain some other variable but
some of the variables combined could be ready to do this. Thus, to check these sorts of relations
between variables, one can use VIF. VIF basically explains the relationship of one independent
variable with all the other independent variables. VIF is given by,
where i refers to the ith variable which is being represented as a linear combination of the rest
of the independent variables. The common heuristic followed for the VIF values is if VIF > 10
then the value is definitely high and it should be dropped. And if the VIF=5 then it may be valid
but should be inspected first. If VIF < 5, then it is considered a good vif value.

Logistic regression

Logistic regression is the appropriate regression analysis to conduct when the dependent
variable is dichotomous (binary). Like all regression analyses, logistic regression is a
predictive analysis. It is used to describe data and to explain the relationship between one
dependent binary variable and one or more nominal, ordinal, interval or ratio-level
independent variables.

For example, A person will survive this accident or not, The student will pass this exam or
not. The outcome can either be yes or no (2 outputs). This regression technique is similar
to linear regression and can be used to predict the Probabilities for classification
problems.

Why do we use Logistic Regression rather than Linear Regression?

it is only used when our dependent variable is binary and in linear regression this
dependent variable is continuous.

The second problem is that if we add an outlier in our dataset, the best fit line in linear
regression shifts to fit that point.

Now, if we use linear regression to find the best fit line which aims at minimizing the
distance between the predicted value and actual value, the line will be like this:
Here the threshold value is 0.5, which means if the value of h(x) is greater than 0.5 then
we predict malignant tumor (1) and if it is less than 0.5 then we predict benign tumor (0).
Everything seems okay here but now let’s change it a bit, we add some outliers in our
dataset, now this best fit line will shift to that point. Hence the line will be somewhat like
this:

Do you see any problem here? The blue line represents the old threshold and the yellow
line represents the new threshold which is maybe 0.2 here. To keep our predictions right
we had to lower our threshold value. Hence we can say that linear regression is prone to
outliers. Now here if h(x) is greater than 0.2 then only this regression will give correct
outputs.

Another problem with linear regression is that the predicted values may be out of range.
We know that probability can be between 0 and 1, but if we use linear regression this
probability may exceed 1 or go below 0.

To overcome these problems we use Logistic Regression, which converts this straight best
fit line in linear regression to an S-curve using the sigmoid function, which will always
give values between 0 and 1.
Logistic regression works in the following steps:

1. Prepare the data: The data should be in a format where each row represents a
single observation and each column represents a different variable. The target
variable (the variable you want to predict) should be binary (yes/no, true/false, 0/1).
2. Train the model: We teach the model by showing it the training data. This involves
finding the values of the model parameters that minimize the error in the training
data.
3. Evaluate the model: The model is evaluated on the held-out test data to assess its
performance on unseen data.
4. Use the model to make predictions: After the model has been trained and
assessed, it can be used to forecast outcomes on new data.

Bayesian network

A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks including
prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts’ opinions, and it
consists of two parts:

 Directed Acyclic Graph


 Table of conditional probabilities.

Each node corresponds to the random variables, and a variable can be continuous or discrete.
Arc or directed arrows represent the causal relationship or conditional probabilities between
random variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
 In the above diagram, A, B, C, and D are random variables represented by the nodes
of the network graph.
 If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
 Node C is independent of node A.
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.

You might also like