0% found this document useful (0 votes)
38 views

Linear Regression Experiment

Linear Regression

Uploaded by

rekha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Linear Regression Experiment

Linear Regression

Uploaded by

rekha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Aim: Implement Linear regression using R tool

Theory: Linear regression is a regression model that uses a straight line to describe the
relationship between variables. It finds the line of best fit through given data by searching for
the value of the regression coefficient(s) that minimizes the total error of the model.

There are two main types of linear regression:

 Simple linear regression uses only one independent variable


 Multiple linear regression uses two or more independent variables

Consider two datasets for implementing Linear Regression in R.


Simple Linear Regression: The first dataset contains observations about income (in a range
of $15k to $75k) and happiness (rated on a scale of 1 to 10) in an imaginary sample of 500
people. The income values are divided by 10,000 to make the income data match the scale of
the happiness scores (so a value of $2 represents $20,000, $3 is $30,000, etc.)
https://cdn.scribbr.com/wp-content/uploads//2020/02/income.data_.zip
Multiple Linear Regression :The second dataset contains observations on the percentage of
people biking to work each day, the percentage of people smoking, and the percentage of
people with heart disease in an imaginary sample of 500 towns.
https://cdn.scribbr.com/wp-content/uploads//2020/02/heart.data_.zip
Steps to implement Linear Regression in R:
Step 1: Open RStudio and click on File > New File > R Script. Then install the packages
needed for the analysis, using following code ( only need to do this once):

install.packages("ggplot2")
install.packages("dplyr")
install.packages("broom")
install.packages("ggpubr")

Next, load the packages into R environment by running this code (you need to do this every
time you restart R):

library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)

Step 2: Load the data into R

Follow these four steps for each dataset:

 In RStudio, go to File > Import dataset  > From Text (base).


 Choose the data file you have downloaded (income.data or heart. Data), and an Import
Dataset window pops up.
 In the Data Frame window, X (index) column and columns listing the data for each of
the variables (income and happiness or biking, smoking, and heart.disease) is there.
 Click on the Import button and the file should appear in R Environment tab on the
upper right side of the RStudio screen.

Once the data is loaded, check that it has been read in correctly using summary().

1. Simple regression

summary(income.data)

Because both the variables are quantitative, when this function executed it shows
output as a table with a numeric summary of the data. This tells that
the minimum, median, mean, and maximum values of the independent variable
(income) and dependent variable (happiness):

2. Multiple regression
summary(heart.data)
Again, because the variables are quantitative, running the code produces a numeric
summary of the data for the independent variables (smoking and biking) and the
dependent variable (heart disease):

Step 3: To check whether the dependent variable follows a normal distribution, use
the hist() function.

hist(income.data$happiness)
Step 4: The relationship between the independent and dependent variable must be
linear. To test this visually with a scatter plot to see if the distribution of data points
could be described with a straight line or not.

plot(happiness ~ income, data = income.data)

Step 5: Use the cor() function to test the relationship between independent variables and
make sure they aren’t too highly correlated.

cor(heart.data$biking, heart.data$smoking)

When this code is executed, the output is 0.015. The correlation between biking and smoking
is small (0.015 is only a 1.5% correlation), so that include both parameters in our model.
Step 6: Use the hist() function to test whether your dependent variable follows a normal
distribution.

hist(heart.data$heart.disease)

Step 7: Linearity property is checked using two scatterplots: one for biking and heart
disease, and one for smoking and heart disease.

plot(heart.disease ~ biking, data=heart.data)

plot(heart.disease ~ smoking, data=heart.data)


Step 3: Perform the linear regression analysis

When the data meet the assumptions, perform a linear regression analysis to evaluate the
relationship between the independent and dependent variables.

A. Simple regression: income and happiness

Check if there’s a linear relationship between income and happiness in a survey of 500
people with incomes ranging from $15k to $75k, where happiness is measured on a scale of 1
to 10.

To perform a simple linear regression analysis and check the results, run the below two lines
of code. The first line of code makes the linear model, and the second line prints out the
summary of the model:

income.happiness.lm <- lm(happiness ~ income, data = income.data)

summary(income.happiness.lm)

The output looks like this:

This output table first presents the model equation, then summarizes the model residuals (see
step 4).

The Coefficients section shows:

The estimates (Estimate) for the model parameters – the value of the y-intercept (in this case
0.204) and the estimated effect of income on happiness (0.713).

The standard error of the estimated values (Std. Error).

The test statistic (t value, in this case the t-statistic).


The p-value ( Pr(>| t | ) ), aka the probability of finding the given t-statistic if the null
hypothesis of no relationship were true.

The final three lines are model diagnostics – the most important thing to note is the p-
value (here it is 2.2e-16, or almost zero), which will indicate whether the model fits the data
well.

Conclusion: The above result shows that there is a significant positive relationship between
income and happiness (p-value < 0.001), with a 0.713-unit (+/- 0.01) increase in happiness
for every unit increase in income.

You might also like