Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
• Data reduction
1. Variables
– Dimensional reduction
– Variable selection
2. Cases/samples
– Sampling
– Balancing / stratification
Data Preprocessing Tasks and Methods (1 of 3)
• Student retention
– Freshmen class
• Why it is important?
• What are the common
techniques to deal with
student attrition?
• Analytics versus theoretical
approaches to student
retention problem
Application Case 2.2 (3 of 6)
• Statistics
– A collection of mathematical techniques to
characterize and interpret data
• Descriptive Statistics
– Describing the data (as it is)
• Inferential statistics
– Drawing inferences about the population based on
sample data
• Descriptive statistics for descriptive analytics
Descriptive Statistics Measures of Centrality
Tendency
• Arithmetic mean
x1 + x2 + ⋅ ⋅ ⋅ + xn ∑
n
x
x = x = i =1 i
n n
• Median
– The number in the middle
• Mode
– The most frequent observation
Descriptive Statistics Measures of
Dispersion (1 of 2)
• Dispersion
– Degree of variation in a given variable
• Range
– Max - Min
• Variance Standard Deviation
∑
n
∑
n
( xi − x) 2 ( xi − x) 2
s =
2 i =1 s = i =1
n −1 n −1
• Mean Absolute Deviation (MAD)
– Average absolute deviation from the mean
Descriptive Statistics Measures of
Dispersion (2 of 2)
• Quartiles
• Box-and-Whiskers Plot
– a.k.a. box-plot
– Versatile / informative
– Can show variation
within data set
Descriptive Statistics Shape of a Distribution
∑i =1 i
n
( x − x ) 3
Skewness= S=
(n − 1) s 3
• Kurtosis
– Peak/tall/skinny nature of the distribution
∑i =1 i
n
( x − x ) 4
=
Kurtosis =
K 4
− 3
ns
Relationship Between Dispersion and
Shape Properties
Technology Insights 2.1 (1 of 2)
Descriptive Statistics in Excel
Technology Insights 2.1 (2 of 2)
Descriptive Statistics in Excel Creating box-plot in Microsoft Excel
Application Case 2.3
• x: input, y: output
• Simple Linear Regression
y β 0 + β1 x
=
• Multiple Linear Regression
y β 0 + β1 x1 + β 2 x2 + β3 x3 + ⋅ ⋅ ⋅ + β n xn
=
• The meaning of Beta ( β ) coefficients
– Sign (+ or -) and magnitude
Example: Linear Regression 1 of 2
– R 2 (R-Square)
– p Values
– Error measures (for
prediction problems)
▪ MSE, MAD, RMSE
Example: Linear Regression 2 of 2
• R-square = 0.98
Regression Modeling Assumptions
1
f ( y) =
1 + e − ( β0 + β1x )
Application Case 2.4 (1 of 6)
Predicting NCAA Bowl Game Outcomes
Application Case 2.4 (2 of 6)
• Dashboard design
– The fundamental challenge of dashboard design is to
display all the required information on a single screen,
clearly and without distraction, in a manner that can
be assimilated quickly
• Three layers of information
– Monitoring
– Analysis
– Management
Performance Dashboards (4 of 4)