Linear Regression Experiment
Linear Regression Experiment
Theory: Linear regression is a regression model that uses a straight line to describe the
relationship between variables. It finds the line of best fit through given data by searching for
the value of the regression coefficient(s) that minimizes the total error of the model.
install.packages("ggplot2")
install.packages("dplyr")
install.packages("broom")
install.packages("ggpubr")
Next, load the packages into R environment by running this code (you need to do this every
time you restart R):
library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)
Once the data is loaded, check that it has been read in correctly using summary().
1. Simple regression
summary(income.data)
Because both the variables are quantitative, when this function executed it shows
output as a table with a numeric summary of the data. This tells that
the minimum, median, mean, and maximum values of the independent variable
(income) and dependent variable (happiness):
2. Multiple regression
summary(heart.data)
Again, because the variables are quantitative, running the code produces a numeric
summary of the data for the independent variables (smoking and biking) and the
dependent variable (heart disease):
Step 3: To check whether the dependent variable follows a normal distribution, use
the hist() function.
hist(income.data$happiness)
Step 4: The relationship between the independent and dependent variable must be
linear. To test this visually with a scatter plot to see if the distribution of data points
could be described with a straight line or not.
Step 5: Use the cor() function to test the relationship between independent variables and
make sure they aren’t too highly correlated.
cor(heart.data$biking, heart.data$smoking)
When this code is executed, the output is 0.015. The correlation between biking and smoking
is small (0.015 is only a 1.5% correlation), so that include both parameters in our model.
Step 6: Use the hist() function to test whether your dependent variable follows a normal
distribution.
hist(heart.data$heart.disease)
Step 7: Linearity property is checked using two scatterplots: one for biking and heart
disease, and one for smoking and heart disease.
When the data meet the assumptions, perform a linear regression analysis to evaluate the
relationship between the independent and dependent variables.
Check if there’s a linear relationship between income and happiness in a survey of 500
people with incomes ranging from $15k to $75k, where happiness is measured on a scale of 1
to 10.
To perform a simple linear regression analysis and check the results, run the below two lines
of code. The first line of code makes the linear model, and the second line prints out the
summary of the model:
summary(income.happiness.lm)
This output table first presents the model equation, then summarizes the model residuals (see
step 4).
The Coefficients section shows:
The estimates (Estimate) for the model parameters – the value of the y-intercept (in this case
0.204) and the estimated effect of income on happiness (0.713).
The final three lines are model diagnostics – the most important thing to note is the p-
value (here it is 2.2e-16, or almost zero), which will indicate whether the model fits the data
well.
Conclusion: The above result shows that there is a significant positive relationship between
income and happiness (p-value < 0.001), with a 0.713-unit (+/- 0.01) increase in happiness
for every unit increase in income.