0% found this document useful (0 votes)
104 views5 pages

Univariate and Multivariate Analysis - Jupyter Notebook

This document discusses univariate and multivariate analysis of diabetes patient data. It performs summary statistics on the data, calculates correlations between variables, builds a linear regression model to predict diabetes outcome, and generates box and scatter plots. The analysis finds that higher BMI and glucose levels are associated with increased diabetes risk, while age also positively correlates with outcome.

Uploaded by

AnuvidyaKarthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views5 pages

Univariate and Multivariate Analysis - Jupyter Notebook

This document discusses univariate and multivariate analysis of diabetes patient data. It performs summary statistics on the data, calculates correlations between variables, builds a linear regression model to predict diabetes outcome, and generates box and scatter plots. The analysis finds that higher BMI and glucose levels are associated with increased diabetes risk, while age also positively correlates with outcome.

Uploaded by

AnuvidyaKarthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

3/3/22, 6:59 PM Univariate and Multivariate Analysis - Jupyter Notebook

In [18]:

options(repr.plot.width=8, repr.plot.height = 4, repr.plot.res = 200) #setting R environme

In [19]:

library(tidyverse) #for data manipulation and visualization

In [20]:

data <- read.csv(url("https://datalifex.in/dataml/diabetes.csv"))

In [21]:

str(data)

'data.frame': 768 obs. of 9 variables:

$ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...

$ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...

$ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...

$ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...

$ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...

$ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5
0 ...

$ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...

$ Age : int 50 31 32 21 33 30 26 29 53 54 ...

$ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...

In [22]:

head(data)

A data.frame: 6 × 9

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFuncti

<int> <int> <int> <int> <int> <dbl> <db

1 6 148 72 35 0 33.6 0.6

2 1 85 66 29 0 26.6 0.3

3 8 183 64 0 0 23.3 0.6

4 1 89 66 23 94 28.1 0.1

5 0 137 40 35 168 43.1 2.2

6 5 116 74 0 0 25.6 0.2

Checking for NULL values

localhost:8888/notebooks/Univariate and Multivariate Analysis.ipynb# 1/5


3/3/22, 6:59 PM Univariate and Multivariate Analysis - Jupyter Notebook

In [23]:

colSums(is.na(data))

Pregnancies: 0 Glucose: 0 BloodPressure: 0 SkinThickness: 0 Insulin: 0 BMI: 0


DiabetesPedigreeFunction: 0 Age: 0 Outcome: 0

Univariate Analysis
In [24]:

data$Outcome<-as.factor(data$Outcome)
summary(data)

Pregnancies Glucose BloodPressure SkinThickness

Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00

1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00

Median : 3.000 Median :117.0 Median : 72.00 Median :23.00

Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54

3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00

Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00

Insulin BMI DiabetesPedigreeFunction Age

Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00

1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00

Median : 30.5 Median :32.00 Median :0.3725 Median :29.00

Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24

3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00

Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00

Outcome

0:500

1:268

Bivariate Analysis
In [34]:

data2 <- read.csv(url("https://datalifex.in/dataml/diabetes.csv"))

localhost:8888/notebooks/Univariate and Multivariate Analysis.ipynb# 2/5


3/3/22, 6:59 PM Univariate and Multivariate Analysis - Jupyter Notebook

In [36]:

cor(data2)

A matrix: 9 × 9 of type dbl

Pregnancies Glucose BloodPressure SkinThickness Insulin

Pregnancies 1.00000000 0.12945867 0.14128198 -0.08167177 -0.07353461

Glucose 0.12945867 1.00000000 0.15258959 0.05732789 0.33135711

BloodPressure 0.14128198 0.15258959 1.00000000 0.20737054 0.08893338

SkinThickness -0.08167177 0.05732789 0.20737054 1.00000000 0.43678257

Insulin -0.07353461 0.33135711 0.08893338 0.43678257 1.00000000

BMI 0.01768309 0.22107107 0.28180529 0.39257320 0.19785906

DiabetesPedigreeFunction -0.03352267 0.13733730 0.04126495 0.18392757 0.18507093

Age 0.54434123 0.26351432 0.23952795 -0.11397026 -0.04216295

Outcome 0.22189815 0.46658140 0.06506836 0.07475223 0.13054795

Correlation Indicates
->The direction of the relationship between the 2 variables
->The strength of the
relationship between the 2 variables

Regarding the direction of the relationship: On the one hand, a negative correlation implies that the two
variables under consideration vary in opposite directions, that is, if a variable increases the other decreases and
vice versa. On the other hand, a positive correlation implies that the two variables under consideration vary in
the same direction, i.e., if a variable increases the other one increases and if one decreases the other one
decreases as well.

Regarding the strength of the relationship: The more extreme the correlation coefficient (the closer to -1 or 1),
the stronger the relationship. This also means that a correlation close to 0 indicates that the two variables are
independent, that is, as one variable increases, there is no tendency in the other variable to either decrease or
increase.

localhost:8888/notebooks/Univariate and Multivariate Analysis.ipynb# 3/5


3/3/22, 6:59 PM Univariate and Multivariate Analysis - Jupyter Notebook

In [41]:

lm(Outcome ~ Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFu

Warning message in model.response(mf, "numeric"):

"using type = "numeric" with a factor response will be ignored"

Warning message in Ops.factor(y, z$residuals):

"'-' not meaningful for factors"

Call:

lm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +

SkinThickness + Insulin + BMI + DiabetesPedigreeFunction +

Age, data = data)

Coefficients:

(Intercept) Pregnancies Glucose


0.1461057 0.0205919 0.0059203
BloodPressure SkinThickness Insulin
-0.0023319 0.0001545 -0.0001805
BMI DiabetesPedigreeFunction Age
0.0132440 0.1472374 0.0026214

In [ ]:

Box Plot
In [25]:

p1 <- ggplot(data, aes(x=Outcome, y=BMI, fill=Outcome)) + geom_boxplot()


print(p1)

INFERENCE: here 0 represents that diabetes is not present and 1 represents that diabetes is present It is clear
from the plot that people with less BMI are less prone to Diabetes that is people who are obese have more
diabetes

localhost:8888/notebooks/Univariate and Multivariate Analysis.ipynb# 4/5


3/3/22, 6:59 PM Univariate and Multivariate Analysis - Jupyter Notebook

Scatter Plot
In [29]:

p2 <- ggplot(data, aes(x=Age, y=Glucose, col=Outcome)) + geom_point()

#loess method:local regression fitting

p2 + geom_smooth(method="loess")

`geom_smooth()` using formula 'y ~ x'

INFERENCE: The blue line indicates presence of Diabetes and it is clearly seen that the Glucose level for that
is high compared to the case of Non-diabetes

In [ ]:

localhost:8888/notebooks/Univariate and Multivariate Analysis.ipynb# 5/5

You might also like