Univariate and Multivariate Analysis - Jupyter Notebook
Univariate and Multivariate Analysis - Jupyter Notebook
In [18]:
In [19]:
In [20]:
In [21]:
str(data)
$ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
$ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5
0 ...
In [22]:
head(data)
A data.frame: 6 × 9
2 1 85 66 29 0 26.6 0.3
4 1 89 66 23 94 28.1 0.1
In [23]:
colSums(is.na(data))
Univariate Analysis
In [24]:
data$Outcome<-as.factor(data$Outcome)
summary(data)
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
Outcome
0:500
1:268
Bivariate Analysis
In [34]:
In [36]:
cor(data2)
Correlation Indicates
->The direction of the relationship between the 2 variables
->The strength of the
relationship between the 2 variables
Regarding the direction of the relationship: On the one hand, a negative correlation implies that the two
variables under consideration vary in opposite directions, that is, if a variable increases the other decreases and
vice versa. On the other hand, a positive correlation implies that the two variables under consideration vary in
the same direction, i.e., if a variable increases the other one increases and if one decreases the other one
decreases as well.
Regarding the strength of the relationship: The more extreme the correlation coefficient (the closer to -1 or 1),
the stronger the relationship. This also means that a correlation close to 0 indicates that the two variables are
independent, that is, as one variable increases, there is no tendency in the other variable to either decrease or
increase.
In [41]:
lm(Outcome ~ Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFu
Call:
Coefficients:
In [ ]:
Box Plot
In [25]:
INFERENCE: here 0 represents that diabetes is not present and 1 represents that diabetes is present It is clear
from the plot that people with less BMI are less prone to Diabetes that is people who are obese have more
diabetes
Scatter Plot
In [29]:
p2 + geom_smooth(method="loess")
INFERENCE: The blue line indicates presence of Diabetes and it is clearly seen that the Glucose level for that
is high compared to the case of Non-diabetes
In [ ]: