Analysis of Internet Data
Analysis of Internet Data
www.datadb.com
The goals of this project are:
To analyze each variable of the data collected through data summarization to get a
basic understanding of the dataset and to prepare for further analysis.
Find out the probable factors from the dataset, which could affect the exits.
Find the variables which possibly have an effect on the time on page.
# To analyze each variable of the data collected through data summarization to get a basic
understanding of the dataset and to prepare for further analysis.
library(readxl)
InternetDataset <- read_excel("C:/Users/Tanya/Desktop/simplilearn test/Project/Projects for
Submission/Internet/Internet/Internet_Dataset.xlsm")
View(InternetDataset)
str(InternetDataset)
#from this we can see structure InternetDataset, we see continent and source group are character string
#to convert character to factor
internet<-as.data.frame(unclass(InternetDataset))
str(internet)
summary(internet)
#From the result of summarized dataset, it is observed that the numerical data includes
#information related to the maximum, minimum, and mean data.
#We can see that there is a maximum value of 30 bounces for the website.
#This site was accessed maximum number of times by visitors from N.America
##to know whether the unique page view value depends on visits.
cor(internet$Uniquepageviews,internet$Visits)
annova1<-aov(Uniquepageviews~Visits, data=internet)
summary(annova1)
#We can infer from the results that the visits variable has a significant impact on
#Unique.Pageviews. So the team can conclude that unique page values depend on visits.
#Find out the probable factors from the dataset, which could affect the exits.
annova2<-aov(Exits~.,data = internet)
summary(annova2)
#From the result of ANOVA given here, we can see that source.group, bounces,
#and unique.pageviews have more significance. Visits have comparatively less significance.
#Hence we can say that exit from the site is affected by the factors of source group,
#bounces, and unique.pageviews
#Find the variables which possibly have an effect on the time on page.
annova3<-aov(Timeinpage~.,data = internet)
summary(annova3)
# from the result of annova3 we can say except sourcegroup all affecting time on page.
# to determine the factors that are impacting the bounce
#this bounce rate is having variables,data for the variable bounces has to be between 0 and 1,
internet$Bounces=internet$Bounces*0.01
impactfactors<-glm(Bounces~Timeinpage+Continent+Exits+Sourcegroup+Uniquepageviews+Visits,data =
internet,family = "binomial")
summary(impactfactors)
#As can be inferred from the result shown, the BouncesNew, Unique.Pageviews and visits are the variables that
#impact the target variable bounces it has greater significance.
Codes
library(readxl)
InternetDataset <- read_excel("C:/Users/Tanya/Desktop/simplilearn test/Project/Projects for
Submission/Internet/Internet/Internet_Dataset.xlsm")
View(InternetDataset)
str(InternetDataset)
#from this we can see structure InternetDataset, we see continent and source group are character string
#to convert character to factor
internet<-as.data.frame(unclass(InternetDataset))
str(internet)
# Now the characters are converted to factors
#To analyze each variable of the data collected through data summarization
summary(internet)
#From the result of summarized dataset, it is observed that the numerical data includes
#information related to the maximum, minimum, and mean data.
#We can see that there is a maximum value of 30 bounces for the website.
#This site was accessed maximum number of times by visitors from N.America
##to know whether the unique page view value depends on visits.
cor(internet$Uniquepageviews,internet$Visits)
annova1<-aov(Uniquepageviews~Visits, data=internet)
summary(annova1)
#We can infer from the results that the visits variable has a significant impact on
#Unique.Pageviews. So the team can conclude that unique page values depend on visits.
#Find out the probable factors from the dataset, which could affect the exits.
annova2<-aov(Exits~.,data = internet)
summary(annova2)
#From the result of ANOVA given here, we can see that source.group, bounces,
#and unique.pageviews have more significance. Visits have comparatively less significance.
#Hence we can say that exit from the site is affected by the factors of source group,
#bounces, and unique.pageviews
#Find the variables which possibly have an effect on the time on page.
annova3<-aov(Timeinpage~.,data = internet)
summary(annova3)
# from the result of annova3 we can say except sourcegroup all affecting time on page
# to determine the factors that are impacting the bounce
#this bounce rate is having variables,data for the variable bounces has to be between 0 and 1,
internet$Bounces=internet$Bounces*0.01
impactfactors<-glm(Bounces~Timeinpage+Continent+Exits+Sourcegroup+Uniquepageviews+Visits,data =
internet,family = "binomial")
summary(impactfactors)
#As can be inferred from the result shown, the BouncesNew, Unique.Pageviews and visits are the variables that
#impact the target variable bounces it has greater significance.
Output of Codes
> library(readxl)
> InternetDataset <- read_excel("C:/Users/Tanya/Desktop/simplilearn test/Project/Projects for
Submission/Internet/Internet/Internet_Dataset.xlsm")
> View(InternetDataset)
> str(InternetDataset)
tibble [32,109 x 8] (S3: tbl_df/tbl/data.frame)
$ Bounces : num [1:32109] 0 0 0 0 0 0 0 0 0 0 ...
$ Exits : num [1:32109] 0 0 0 0 0 0 0 0 0 0 ...
$ Continent : chr [1:32109] "OC" "N.America" "N.America" "N.America" ...
$ Sourcegroup : chr [1:32109] "(direct)" "(direct)" "Others" "public.tableausoftware.com"
$ Timeinpage : num [1:32109] 18 4 35 70 81 75 186 710 712 344 ...
$ Uniquepageviews: num [1:32109] 1 1 1 1 1 1 1 1 1 1 ...
$ Visits : num [1:32109] 0 0 0 0 0 0 0 0 1 1 ...
$ BouncesNew : num [1:32109] 0 0 0 0 0 0 0 0 0 0 ...
> internet<-as.data.frame(unclass(InternetDataset))
> str(internet)
'data.frame': 32109 obs. of 8 variables:
$ Bounces : num 0 0 0 0 0 0 0 0 0 0 ...
$ Exits : num 0 0 0 0 0 0 0 0 0 0 ...
$ Continent : Factor w/ 6 levels "AF","AS","EU",..: 5 4 4 4 4 4 4 4 5 2 ...
$ Sourcegroup : Factor w/ 9 levels "(direct)","facebook",..: 1 1 4 5 5 5 5 1 1 4 ...
$ Timeinpage : num 18 4 35 70 81 75 186 710 712 344 ...
$ Uniquepageviews: num 1 1 1 1 1 1 1 1 1 1 ...
$ Visits : num 0 0 0 0 0 0 0 0 1 1 ...
$ BouncesNew : num 0 0 0 0 0 0 0 0 0 0 ...
> summary(internet)
Bounces Exits Continent Sourcegroup
Min. : 0.000 Min. : 0.000 AF : 321 google :11542
1st Qu.: 0.000 1st Qu.: 1.000 AS : 3171 (direct) : 7532
Median : 1.000 Median : 1.000 EU : 6470 Others : 5360
Mean : 0.713 Mean : 0.906 N.America:20043 tableausoftware.com : 2388
3rd Qu.: 1.000 3rd Qu.: 1.000 OC : 1356 t.co : 2249
Max. :30.000 Max. :36.000 SA : 748 public.tableausoftware.com: 1354
(Other) : 1684
Timeinpage Uniquepageviews Visits BouncesNew
Min. : 0.00 Min. : 1.000 Min. : 0.000 Min. :0.00000
1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.:0.00000
Median : 0.00 Median : 1.000 Median : 1.000 Median :0.01000
Mean : 73.18 Mean : 1.114 Mean : 0.906 Mean :0.00713
3rd Qu.: 10.00 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.:0.01000
Max. :46745.00 Max. :45.000 Max. :45.000 Max. :0.30000
> #From the result of summarized dataset, it is observed that the numerical data includes
> #information related to the maximum, minimum, and mean data.
> #We can see that there is a maximum value of 30 bounces for the website.
> #This site was accessed maximum number of times by visitors from N.America
> ##to know whether the unique page view value depends on visits.
> cor(internet$Uniquepageviews,internet$Visits)
[1] 0.8144457
> annova1<-aov(Uniquepageviews~Visits, data=internet)
> summary(annova1)
Df Sum Sq Mean Sq F value Pr(>F)
Visits 1 8052 8052 63257 <2e-16 ***
Residuals 32107 4087 0
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #We can infer from the results that the visits variable has a significant impact on
> #Unique.Pageviews. So the team can conclude that unique page values depend on visits.
> #Find out the probable factors from the dataset, which could affect the exits.
> annova2<-aov(Exits~.,data = internet)
> summary(annova2)
Df Sum Sq Mean Sq F value Pr(>F)
Bounces 1 10578 10578 1.043e+05 < 2e-16 ***
Continent 5 3 1 5.960e+00 1.62e-05 ***
Sourcegroup 8 7 1 8.760e+00 4.89e-12 ***
Timeinpage 1 130 130 1.279e+03 < 2e-16 ***
Uniquepageviews 1 1573 1573 1.552e+04 < 2e-16 ***
Visits 1 1 1 5.014e+00 0.0251 *
Residuals 32091 3254 0
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #From the result of ANOVA given here, we can see that source.group, bounces,
> #and unique.pageviews have more significance. Visits have comparatively less significance.
> #Hence we can say that exit from the site is affected by the factors of source group,
> #bounces, and unique.pageviews
> #Find the variables which possibly have an effect on the time on page.
> annova3<-aov(Timeinpage~.,data = internet)
> summary(annova3)
Df Sum Sq Mean Sq F value Pr(>F)
Bounces 1 5.947e+07 59466495 422.868 < 2e-16 ***
Exits 1 1.304e+08 130400662 927.283 < 2e-16 ***
Continent 5 4.767e+06 953431 6.780 2.51e-06 ***
Sourcegroup 8 1.545e+06 193153 1.374 0.202
Uniquepageviews 1 1.791e+08 179133934 1273.826 < 2e-16 ***
Visits 1 1.073e+08 107321113 763.163 < 2e-16 ***
Residuals 32091 4.513e+09 140627
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> internet$Bounces=internet$Bounces*0.01
> impactfactors<-glm(Bounces~Timeinpage+Continent+Exits+Sourcegroup+Uniquepageviews+Visits,da
internet,family = "binomial")
Warning messages:
1: In eval(family$initialize) : non-integer #successes in a binomial glm!
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(impactfactors)
Call:
glm(formula = Bounces ~ Timeinpage + Continent + Exits + Sourcegroup +
Uniquepageviews + Visits, family = "binomial", data = internet)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.26149 -0.02406 0.00206 0.00895 1.81288
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.9667681 0.6784678 -7.321 2.47e-13 ***
Timeinpage -0.0010294 0.0005774 -1.783 0.0746 .
ContinentAS 0.0022768 0.6932044 0.003 0.9974
ContinentEU -0.0069240 0.6786600 -0.010 0.9919
ContinentN.America 0.0101334 0.6674188 0.015 0.9879
ContinentOC 0.0201123 0.7333671 0.027 0.9781
ContinentSA 0.0237507 0.7914250 0.030 0.9761
Exits 1.3907608 0.3356504 4.143 3.42e-05 ***
Sourcegroupfacebook -0.0241949 1.1045171 -0.022 0.9825
Sourcegroupgoogle -0.0783631 0.1720157 -0.456 0.6487
SourcegroupOthers -0.0767919 0.2182692 -0.352 0.7250
Sourcegrouppublic.tableausoftware.com -0.2528285 0.4923123 -0.514 0.6076
Sourcegroupreddit.com -0.0092792 0.4709304 -0.020 0.9843
Sourcegroupt.co 0.0148690 0.2760157 0.054 0.9570
Sourcegrouptableausoftware.com -0.1129305 0.3190762 -0.354 0.7234
Sourcegroupvisualisingdata.com -0.0822525 0.4614866 -0.178 0.8585
Uniquepageviews -3.2363108 0.5791664 -5.588 2.30e-08 ***
Visits 2.1941121 0.5202216 4.218 2.47e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #As can be inferred from the result shown, the BouncesNew, Unique.Pageviews and visits are
that
> #impact the target variable bounces it has greater significance.
>