0% found this document useful (0 votes)

49 views12 pages

Regression and Introduction To Bayesian Network

This document gives you the concept of AKTU Data Analytics of topic Regression and Introduction to Bayesian Network

Uploaded by

Semmy Baghel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views12 pages

Regression and Introduction To Bayesian Network

This document gives you the concept of AKTU Data Analytics of topic Regression and Introduction to Bayesian Network

Uploaded by

Semmy Baghel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Linear regression is used for finding linear relationship between target and one or more

predictors. There are two types of linear regression- Simple and Multiple.

Simple Linear Regression

Simple linear regression is useful for finding relationship between two continuous variables.
One is predictor or independent variable and other is response or dependent variable. It looks
for statistical relationship but not deterministic relationship. Relationship between two variables
is said to be deterministic if one variable can be accurately expressed by the other. For example,
using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical
relationship is not accurate in determining relationship between two variables. For example,
relationship between height and weight.

The core idea is to obtain a line that best fits the data. The best fit line is the one for which total
prediction error (all data points) are as small as possible. Error is the distance between the point
to the regression line.

Simple Regression Calculation

To calculate best-fit line linear regression uses a traditional slope-intercept form which is given
below,

Yi = β0 + β1Xi

where Yi = Dependent variable, β0 = constant/Intercept, β1 = Slope/Intercept, Xi = Independent

variable.

This algorithm explains the linear relationship between the dependent(output) variable y and the
independent(predictor) variable X using a straight line Y= B0 + B1 X.
But how the linear regression finds out which is the best fit line?

Use various methods

 Closed form equation

 Matrix inversion
 Gradient descent
 Stochastic gradient descent
 Hill climbing
 Matrix decomposition

Closed form
Evaluation Metrics for Linear Regression

The strength of any linear regression model can be assessed using various evaluation metrics.
These evaluation metrics usually provide a measure of how well the observed outputs are being
generated by the model.

The most used metrics are,

 Coefficient of Determination or R-Squared (R2)

 Root Mean Squared Error (RSME) and Residual Standard Error (RSE)
 Coefficient of Determination or R-Squared (R2)

R-Squared is a number that explains the amount of variation that is explained/captured by the
developed model. It always ranges between 0 & 1 . Overall, the higher the value of R-squared,
the better the model fits the data.

Mathematically it can be represented as,

R2 = 1 – ( RSS/TSS )

Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each data
point in the plot/data. It is the measure of the difference between the expected and the actual
observed output.

Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean

of the response variable. Mathematically TSS is,

where y hat is the mean of the sample data points.

Root Mean Squared Error

The Root Mean Squared Error is the square root of the variance of the residuals. It specifies the
absolute fit of the model to the data i.e. how close the observed data points are to the predicted
values. Mathematically it can be represented as,

Assumptions of Linear Regression

Regression is a parametric approach, which means that it makes assumptions about the data for
the purpose of analysis. For successful regression analysis, it’s essential to validate the following
assumptions.

Linearity of residuals: There needs to be a linear relationship between the dependent variable
and independent variable(s).

Independence of residuals: The error terms should not be dependent on one another There
should be no correlation between the residual terms. The absence of this phenomenon is known
as Autocorrelation.
Normal distribution of residuals: The mean of residuals should follow a normal distribution
with a mean equal to zero or close to zero. This is done in order to check whether the selected
line is actually the line of best fit or not. If the error terms are non-normally distributed, suggests
that there are a few unusual data points that must be studied closely to make a better model. r-
squared significance

The equal variance of residuals: The error terms must have constant variance. This
phenomenon is known as Homoscedasticity.

The presence of non-constant variance in the error terms is referred to as Heteroscedasticity.

Generally, non-constant variance arises in the presence of outliers or extreme leverage values.

Hypothesis in Linear Regression

Once you have fitted a straight line on the data, you need to ask, “Is this straight line a significant

fit for the data?” Or “Is the beta coefficient explain the variance in the data plotted?” And here
comes the idea of hypothesis testing on the beta coefficient. The Null and Alternate hypotheses
in this case are:
H0: B1 = 0

HA: B1 ≠ 0

To test this hypothesis we use a t-test, test statistics for the beta coefficient is given by,

Assessing the model fit

Some other parameters to assess a model are:

t statistic: It is used to determine the p-value and hence, helps in determining whether the
coefficient is significant or not

F statistic: It is used to assess whether the overall model fit is significant or not. Generally, the
higher the value of the F-statistic, the more significant a model turns out to be.

Multiple Linear Regression

Multiple linear regression is a technique to understand the relationship between a single

dependent variable and multiple independent variables.

The formulation for multiple linear regression is also similar to simple linear regression with

the small change that instead of having one beta variable, you will now have betas for all the
variables used. The formula is given as:

Y = B0 + B1X1 + B2X2 + … + BpXp + ε

Considerations of Multiple Linear Regression

All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear
Regression along with a few new additional assumptions.
Overfitting: When more and more variables are added to a model, the model may become far
too complex and usually ends up memorizing all the data points in the training set. This
phenomenon is known as the overfitting of a model. This usually leads to high training accuracy
and very low test accuracy.

Multicollinearity: It is the phenomenon where a model with several independent variables, may
have some variables interrelated.

Feature Selection: With more variables present, selecting the optimal set of predictors from the
pool of given features (many of which might be redundant) becomes an important task for
building a relevant and better model.

Multicollinearity

As multicollinearity makes it difficult to find out which variable is actually contributing towards
the prediction of the response variable, it leads one to conclude incorrectly, the effects of a
variable on the target variable. Though it does not affect the precision of the predictions, it is
essential to properly detect and deal with the multicollinearity present in the model, as random
removal of any of these correlated variables from the model causes the coefficient values to
swing wildly and even change signs.

Multicollinearity can be detected using the following methods.

Pairwise Correlations: Checking the pairwise correlations between different pairs of

independent variables can throw useful insights in detecting multicollinearity.

Variance Inflation Factor (VIF): Pairwise correlations may not always be useful as it is
possible that just one variable might not be able to completely explain some other variable but
some of the variables combined could be ready to do this. Thus, to check these sorts of relations
between variables, one can use VIF. VIF basically explains the relationship of one independent
variable with all the other independent variables. VIF is given by,
where i refers to the ith variable which is being represented as a linear combination of the rest
of the independent variables. The common heuristic followed for the VIF values is if VIF > 10
then the value is definitely high and it should be dropped. And if the VIF=5 then it may be valid
but should be inspected first. If VIF < 5, then it is considered a good vif value.

Logistic regression

Logistic regression is the appropriate regression analysis to conduct when the dependent
variable is dichotomous (binary). Like all regression analyses, logistic regression is a
predictive analysis. It is used to describe data and to explain the relationship between one
dependent binary variable and one or more nominal, ordinal, interval or ratio-level
independent variables.

For example, A person will survive this accident or not, The student will pass this exam or
not. The outcome can either be yes or no (2 outputs). This regression technique is similar
to linear regression and can be used to predict the Probabilities for classification
problems.

Why do we use Logistic Regression rather than Linear Regression?

it is only used when our dependent variable is binary and in linear regression this
dependent variable is continuous.

The second problem is that if we add an outlier in our dataset, the best fit line in linear
regression shifts to fit that point.

Now, if we use linear regression to find the best fit line which aims at minimizing the
distance between the predicted value and actual value, the line will be like this:
Here the threshold value is 0.5, which means if the value of h(x) is greater than 0.5 then
we predict malignant tumor (1) and if it is less than 0.5 then we predict benign tumor (0).
Everything seems okay here but now let’s change it a bit, we add some outliers in our
dataset, now this best fit line will shift to that point. Hence the line will be somewhat like
this:

Do you see any problem here? The blue line represents the old threshold and the yellow
line represents the new threshold which is maybe 0.2 here. To keep our predictions right
we had to lower our threshold value. Hence we can say that linear regression is prone to
outliers. Now here if h(x) is greater than 0.2 then only this regression will give correct
outputs.

Another problem with linear regression is that the predicted values may be out of range.
We know that probability can be between 0 and 1, but if we use linear regression this
probability may exceed 1 or go below 0.

To overcome these problems we use Logistic Regression, which converts this straight best
fit line in linear regression to an S-curve using the sigmoid function, which will always
give values between 0 and 1.
Logistic regression works in the following steps:

1. Prepare the data: The data should be in a format where each row represents a
single observation and each column represents a different variable. The target
variable (the variable you want to predict) should be binary (yes/no, true/false, 0/1).
2. Train the model: We teach the model by showing it the training data. This involves
finding the values of the model parameters that minimize the error in the training
data.
3. Evaluate the model: The model is evaluated on the held-out test data to assess its
performance on unseen data.
4. Use the model to make predictions: After the model has been trained and
assessed, it can be used to forecast outcomes on new data.

Bayesian network

A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks including
prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts’ opinions, and it
consists of two parts:

 Directed Acyclic Graph

 Table of conditional probabilities.

Each node corresponds to the random variables, and a variable can be continuous or discrete.
Arc or directed arrows represent the causal relationship or conditional probabilities between
random variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
 In the above diagram, A, B, C, and D are random variables represented by the nodes
of the network graph.
 If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
 Node C is independent of node A.
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.

Chapter 4, Accounting
100% (1)
Chapter 4, Accounting
13 pages
DA_UNIT_3_R22
No ratings yet
DA_UNIT_3_R22
15 pages
DA unit-III
No ratings yet
DA unit-III
30 pages
BA3-4-5modules
No ratings yet
BA3-4-5modules
258 pages
【NCE2】L01
No ratings yet
【NCE2】L01
120 pages
Copy of Unit 5 Business Analytics
No ratings yet
Copy of Unit 5 Business Analytics
24 pages
AIML MSE 2 Notes
No ratings yet
AIML MSE 2 Notes
35 pages
Models Assignment
No ratings yet
Models Assignment
43 pages
Unit 2-1
No ratings yet
Unit 2-1
30 pages
Ml Module3 Regression
No ratings yet
Ml Module3 Regression
51 pages
IDS UNIT 5 Linear Regression
No ratings yet
IDS UNIT 5 Linear Regression
27 pages
OE-ML unit -3
No ratings yet
OE-ML unit -3
29 pages
Understanding Regression
No ratings yet
Understanding Regression
40 pages
Cfa Level 2 2023 Summary
No ratings yet
Cfa Level 2 2023 Summary
100 pages
B.Sc-OT AT
No ratings yet
B.Sc-OT AT
80 pages
Linear Regression Basic Interview Questions
No ratings yet
Linear Regression Basic Interview Questions
36 pages
Unit II-II
No ratings yet
Unit II-II
21 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
SSDMA UNIT 2 PART1
No ratings yet
SSDMA UNIT 2 PART1
20 pages
Regression Notes
No ratings yet
Regression Notes
7 pages
What Are Linear Models in Machine Learning[1].Docx (Unit3 Ml)
No ratings yet
What Are Linear Models in Machine Learning[1].Docx (Unit3 Ml)
60 pages
AXCLDYHEN28-SS Hennessy 28 inch Laundry Cabinet with Faucet and SS Sink
No ratings yet
AXCLDYHEN28-SS Hennessy 28 inch Laundry Cabinet with Faucet and SS Sink
12 pages
A Summer Internship Project Report On: Master of Business Administration (Affiliated To Savitribai Phule Pune University)
No ratings yet
A Summer Internship Project Report On: Master of Business Administration (Affiliated To Savitribai Phule Pune University)
24 pages
SUDOKU for Students spring 2025
No ratings yet
SUDOKU for Students spring 2025
16 pages
Ra Web
No ratings yet
Ra Web
70 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
Unit-III
No ratings yet
Unit-III
13 pages
11 Physics23 24sp04
No ratings yet
11 Physics23 24sp04
19 pages
PG 329-357
No ratings yet
PG 329-357
29 pages
Jurnal Vco CKD
No ratings yet
Jurnal Vco CKD
13 pages
(Unit-04) Part-01 - ML Algo
No ratings yet
(Unit-04) Part-01 - ML Algo
49 pages
Day 2-Data Science
No ratings yet
Day 2-Data Science
16 pages
Cultural Industries
No ratings yet
Cultural Industries
13 pages
Linear_Regression (1)
No ratings yet
Linear_Regression (1)
35 pages
Linear Regression. Com
No ratings yet
Linear Regression. Com
13 pages
Regression Notes
No ratings yet
Regression Notes
6 pages
Linear - Regression & Evaluation Metrics
No ratings yet
Linear - Regression & Evaluation Metrics
31 pages
MODULE-3
No ratings yet
MODULE-3
34 pages
Data Science
100% (1)
Data Science
14 pages
Business Analytics: Advance: Simple & Multiple Linear Regression
No ratings yet
Business Analytics: Advance: Simple & Multiple Linear Regression
38 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Linear Regression
No ratings yet
Linear Regression
18 pages
MOD3_EDA
No ratings yet
MOD3_EDA
16 pages
Hanan
No ratings yet
Hanan
9 pages
01 - Quantitative Methods
No ratings yet
01 - Quantitative Methods
28 pages
3 Da
No ratings yet
3 Da
16 pages
Full Load Amp Table PDF
No ratings yet
Full Load Amp Table PDF
2 pages
Betrayal Harold Pinter
100% (6)
Betrayal Harold Pinter
17 pages
1.5.Linear Regression
No ratings yet
1.5.Linear Regression
5 pages
Regression
No ratings yet
Regression
24 pages
Predictive Modelling Using Linear Regression: © Analy Datalab Inc., 2016. All Rights Reserved
No ratings yet
Predictive Modelling Using Linear Regression: © Analy Datalab Inc., 2016. All Rights Reserved
16 pages
1.linear Regression PSP
No ratings yet
1.linear Regression PSP
92 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
Regression
No ratings yet
Regression
14 pages
C1 Vocabulary Reading Pollution
100% (1)
C1 Vocabulary Reading Pollution
6 pages
Sun Temple
No ratings yet
Sun Temple
2 pages
Halford Mackinder Heartland Thesis
100% (2)
Halford Mackinder Heartland Thesis
5 pages
Simple Step: Configure Folder Redirection in Window Server 2012 R2
No ratings yet
Simple Step: Configure Folder Redirection in Window Server 2012 R2
7 pages
Regression
No ratings yet
Regression
25 pages
Testbank for Shigleys Mechanical Engineering Design 10th Edition
No ratings yet
Testbank for Shigleys Mechanical Engineering Design 10th Edition
18 pages
Simple and Multiple Linear Regression
No ratings yet
Simple and Multiple Linear Regression
6 pages
Reliability-Based Design of Proof Load T PDF
No ratings yet
Reliability-Based Design of Proof Load T PDF
8 pages
Comparison Between Lte and Wimax: Rayan Jaha
No ratings yet
Comparison Between Lte and Wimax: Rayan Jaha
5 pages
Lancia Thesis Ps Profis
100% (3)
Lancia Thesis Ps Profis
6 pages
API 5th Ed points
No ratings yet
API 5th Ed points
1 page
Linear Regression
No ratings yet
Linear Regression
16 pages
Harden, R.,1986 - Ten Questions To Ask When Planning A Course or Curriculum
No ratings yet
Harden, R.,1986 - Ten Questions To Ask When Planning A Course or Curriculum
10 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
Regression PDF
No ratings yet
Regression PDF
16 pages
Cad Paper 2024 May
No ratings yet
Cad Paper 2024 May
2 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Beneath The Graves v1.0
No ratings yet
Beneath The Graves v1.0
2 pages
PR1 - Lesson 6 - Sample and Sampling Techniques
No ratings yet
PR1 - Lesson 6 - Sample and Sampling Techniques
2 pages
SimpleMultipleLinearRegression_FoundationalMathofAI_S24
No ratings yet
SimpleMultipleLinearRegression_FoundationalMathofAI_S24
6 pages
Eguieron Carlo B Ee06l Activity 3
No ratings yet
Eguieron Carlo B Ee06l Activity 3
1 page
Linear Regression
No ratings yet
Linear Regression
10 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Proves Accés CFGS. 2008. Anglès. S2
No ratings yet
Proves Accés CFGS. 2008. Anglès. S2
5 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Establishing An Anatomic Pathology Laboratory at Cleveland Clinic Abu Dhabi
No ratings yet
Establishing An Anatomic Pathology Laboratory at Cleveland Clinic Abu Dhabi
11 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Department of Science and Technology: Republic of The Philippines
No ratings yet
Department of Science and Technology: Republic of The Philippines
2 pages
Eufhuwe PDF
75% (4)
Eufhuwe PDF
2 pages
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)

Uploaded by

Uploaded by

Linear regression is used for finding linear relationship between target and one or more

Simple Linear Regression

Simple Regression Calculation

where Yi = Dependent variable, β0 = constant/Intercept, β1 = Slope/Intercept, Xi = Independent

Use various methods

 Closed form equation

The most used metrics are,

 Coefficient of Determination or R-Squared (R2)

Mathematically it can be represented as,

of the response variable. Mathematically TSS is,

where y hat is the mean of the sample data points.

Assumptions of Linear Regression

The presence of non-constant variance in the error terms is referred to as Heteroscedasticity.

Hypothesis in Linear Regression

Assessing the model fit

Some other parameters to assess a model are:

Multiple Linear Regression

Multiple linear regression is a technique to understand the relationship between a single

dependent variable and multiple independent variables.

Y = B0 + B1X1 + B2X2 + … + BpXp + ε

Considerations of Multiple Linear Regression

Multicollinearity can be detected using the following methods.

Pairwise Correlations: Checking the pairwise correlations between different pairs of

Why do we use Logistic Regression rather than Linear Regression?

 Directed Acyclic Graph

You might also like