Module 5 Programming Foundation and Exploratory Data Analysis
Module 5 Programming Foundation and Exploratory Data Analysis
easier_to_read_list_of_lists =
[ [1, 2, 3],
[4, 5, 6],
[7, 8, 9] ]
Alternatively:
long_winded_computation = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + \
9 + 10 + 11 + 12 + 13 + 14 + \
15 + 16 + 17 + 18 + 19 + 20
Modules
• Certain features of Python are not loaded by
default
• In order to use these features, you’ll need to
import the modules that contain them.
• E.g.
import matplotlib.pyplot as plt
import numpy as np
Variables and objects
• Variables are created the first time it is assigned a value
– No need to declare type
– Types are associated with objects not variables
• X=5
• X = [1, 3, 5]
• X = ‘python’
– Assignment creates references, not copies
X = [1, 3, 5]
Y= X
X[0] = 2
Print (Y) # Y is [2, 3, 5]
Assignment
• You can assign to multiple names at the same
time
x, y = 2, 3
• To swap values
x, y = y, x
• Assignments can be chained
x=y=z=3
• Accessing a name before it’s been created (by
assignment), raises an error
Arithmetic
• a=5+2 # a is 7
• b = 9 – 3. # b is 6.0
• c=5*2 # c is 10
• d = 5**2 # d is 25
• e=5%2 # e is 1
• Strings can be concatenated (glued together) with the + operator, and repeated
with *
s = 3 * 'un' + 'ium' # s is 'unununium'
• Two or more string literals (i.e. the ones enclosed between quotes) next to
each other are automatically concatenated
s1 = 'Py' 'thon'
s2 = s1 + '2.7'
real_long_string = ('this is a really long string. '
‘It has multiple parts, '
‘but all in one line.‘)
List - 1
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ integer_list, heterogeneous_list, [] ]
list_length = len(integer_list) # equals 3
list_sum = sum(integer_list) # equals 6
• Get the i-th element of a list
x = [i for i in range(10)] # is the list [0, 1, ..., 9]
zero = x[0] # equals 0, lists are 0-indexed
one = x[1] # equals 1
nine = x[-1] # equals 9, 'Pythonic' for last element
eight = x[-2] # equals 8, 'Pythonic' for next-to-last element
• Get a slice of a list
one_to_four = x[1:5] # [1, 2, 3, 4]
first_three = x[:3] # [0, 1, 2]
last_three = x[-3:] # [7, 8, 9]
three_to_end = x[3:] # [3, 4, ..., 9]
without_first_and_last = x[1:-1] # [1, 2, ..., 8]
copy_of_x = x[:] # [0, 1, 2, ..., 9]
another_copy_of_x = x[:3] + x[3:] # [0, 1, 2, ..., 9]
List - 2
• Check for memberships
1 in [1, 2, 3] # True
0 in [1, 2, 3] # False
• Concatenate lists
x = [1, 2, 3]
y = [4, 5, 6]
x.extend(y) # x is now [1,2,3,4,5,6]
x = [1, 2, 3]
y = [4, 5, 6]
z = x + y # z is [1,2,3,4,5,6]; x is unchanged.
• List unpacking (multiple assignment)
x, y = [1, 2] # x is 1 and y is 2
[x, y] = 1, 2 # same as above
x, y = [1, 2] # same as above
x, y = 1, 2 # same as above
_, y = [1, 2] # y is 2, didn't care about the first element
List - 3
• Modify content of list
x = [0, 1, 2, 3, 4, 5, 6, 7, 8]
x[2] = x[2] * 2 # x is [0, 1, 4, 3, 4, 5, 6, 7, 8]
x[-1] = 0 # x is [0, 1, 4, 3, 4, 5, 6, 7, 0]
x[3:5] = x[3:5] * 3 # x is [0, 1, 4, 9, 12, 5, 6, 7, 0]
x[5:6] = [] # x is [0, 1, 4, 9, 12, 7, 0]
del x[:2] # x is [4, 9, 12, 7, 0]
del x[:] # x is []
del x # referencing to x hereafter is a NameError
• Strings can also be sliced. But they cannot modified (they are immutable)
s = 'abcdefg'
a = s[0] # 'a'
x = s[:2] # 'ab'
y = s[-3:] # 'efg'
s[:2] = 'AB' # this will cause an error
s = 'AB' + s[2:] # str is now ABcdefg
The range() function
for i in range(5):
print (i) # will print 0, 1, 2, 3, 4 (in separate lines)
for i in range(2, 5):
print (i) # will print 2, 3, 4
for i in range(0, 10, 2):
print (i) # will print 0, 2, 4, 6, 8
for i in range(10, 2, -2):
print (i) # will print 10, 8, 6, 4
>>> a = ['Mary', 'had', 'a', 'little', 'lamb']
>>> for i in range(len(a)):
... print(i, a[i])
...
0 Mary
1 had
2 a
3 little
4 lamb
Range() in python 2 and 3
• In python 2, range(5) is equivalent to [0, 1, 2, 3, 4]
• In python 3, range(5) is an object which can be iterated,
but not identical to [0, 1, 2, 3, 4] (lazy iterator)
print (range(3)) # in python 3, will see "range(0, 3)"
print (range(3)) # in python 2, will see "[0, 1, 2]"
print (list(range(3))) # will print [0, 1, 2] in python 3
x = range(5)
print (x[2]) # in python 2, will print "2"
print (x[2]) # in python 3, will also print “2”
a = list(range(10))
b = a
b[0] = 100
print(a) [100, 1, 2, 3, 4, 5, 6, 7, 8, 9]
a = list(range(10))
b = a[:]
b[0] = 100
print(a) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
tuples
• Similar to lists, but are immutable
• a_tuple = (0, 1, 2, 3, 4) Note: tuple is defined by comma, not parens,
which is only used for convenience. So a = (1)
• Other_tuple = 3, 4 is not a tuple, but a = (1,) is.
try:
kates_grade = grades["Kate"]
except KeyError:
print "no grade for Kate!"
Dictionaries - 2
• Check for existence of key
joel_has_grade = "Joel" in grades # True
kate_has_grade = "Kate" in grades # False
• Get all items In python3, The following will not return lists but
iterable objects
all_keys = grades.keys() # return a list of all keys
all_values = grades.values() # return a list of all values
all_pairs = grades.items() # a list of (key, value) tuples
Difference between python 2 and python 3:
Iterable objects vs lists
• In Python 3, range() returns a lazy iterable object.
– Value created when needed x = range(10000000) #fast
– Can be accessed by index x[10000] #allowed. fast
• all all(a)
Out[136]: False
Comparison
Operation Meaning a = [0, 1, 2, 3, 4]
b = a
< strictly less than
c = a[:]
<= less than or equal
a == b
> strictly greater than Out[129]: True
>= greater than or equal
a is b
== equal Out[130]: True
!= not equal a == c
Out[132]: True
is object identity
a is c
is not negated object identity
Out[133]: False
Bitwise operators: & (AND), | (OR), ^ (XOR), ~(NOT), << (Left Shift), >> (Right Shift)
Control flow - 2
• loops
x = 0
while x < 10:
print (x, "is less than 10“)
x += 1
for x in range(10):
if x == 3:
continue # go immediately to the next iteration
if x == 5:
break # quit the loop entirely
print (x)
Exceptions
try:
print 0 / 0
except ZeroDivisionError:
print ("cannot divide by zero")
https://docs.python.org/3/tutorial/errors.html
Functions - 1
• Functions are defined using def
def double(x):
"""this is where you put an optional docstring
that explains what the function does.
for example, this function multiplies its
input by 2"""
return x * 2
• You can call a function after it is defined
z = double(10) # z is 20
• You can give default values to parameters
def my_print(message="my default message"):
print (message)
subtract(10, 5) # returns 5
subtract(0, 5) # returns -5
subtract(b = 5) # same as above
subtract(b = 5, a = 0) # same as above
Functions - 3
• Functions are objects too
In [12]: def double(x): return x * 2
...: DD = double;
...: DD(2)
...:
Out[12]: 4
In [16]: def apply_to_one(f):
...: return f(1)
...: x=apply_to_one(DD)
...: x
...:
Out[16]: 2
Functions – lambda expression
• Small anonymous functions can be created
with the lambda keyword.
In [18]: y=apply_to_one(lambda x: x+4)
In [19]: y
Out[19]: 5
In [68]: even_numbers = []
In [69]: for x in range(5):
...: if x % 2 == 0:
...: even_numbers.append(x)
...: even_numbers
Out[69]: [0, 2, 4]
List comprehension - 3
• More complex examples:
# create 100 pairs (0,0) (0,1) ... (9,8), (9,9)
pairs = [(x, y)
for x in range(10)
for y in range(10)]
In [204]: double(b)
Traceback (most recent call last):
…
TypeError: unsupported operand type(s) for *: 'int' and 'range'
Functools: map, reduce, filter
• Do not confuse with MapReduce in big data
• Convenient tools in python to apply function
to sequences of data
In [208]: def is_even(x): return x%2==0
...: a=[0, 1, 2, 3]
...: list(filter(is_even, a))
...:
Out[208]: [0, 2]
https://docs.python.org/3/tutorial/inputoutput.html
Files - output
https://docs.python.org/3/tutorial/inputoutput.html
Module math
Command name Description Constant Description
abs(value) absolute value e 2.7182818...
ceil(value) rounds up pi 3.1415926...
cos(value) cosine, in radians
floor(value) rounds down
log(value) logarithm, base e
log10(value) logarithm, base 10
max(value1, value2) larger of two values
min(value1, value2) smaller of two values
round(value) nearest whole number # preferred.
sin(value) sine, in radians import math
sqrt(value) square root math.abs(-0.5)
• Numpy
– Key module for scientific computing
– Convenient and efficient ways to handle multi dimensional arrays
• pandas
– DataFrame
– Flexible data structure of labeled tabular data
• Matplotlib: for plotting
• Scipy: solutions to common scientific computing problem
such as linear algebra, optimization, statistics, sparse
matrix
Module paths
• In order to be able to find a module called myscripts.py, the
interpreter scans the list sys.path of directory names.
• The module must be in one of those directories.
>>> [1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> “Hello” * 3
‘HelloHelloHello’
Mutability:
Tuples vs. Lists
Lists are mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]
• We can change lists in place.
• Name li still points to the same memory
reference when we’re done.
Tuples are immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14
Traceback (most recent call last):
File "<pyshell#75>", line 1, in -toplevel-
tu[2] = 3.14
TypeError: object doesn't support item assignment
>>> li.sort(some_function)
# sort in place using user-defined comparison
Tuple details
• The comma is the tuple creation operator, not parens
>>> 1,
(1,)
Descriptive statistics
Inferential statistics
80
Python Libraries for Data Science
Many popular Python toolboxes/libraries:
– NumPy
– SciPy
– Pandas
– SciKit-Learn
– TensorFlow
– Keras
– PyTorch
Visualization libraries
– matplotlib
– Seaborn
81
Python Libraries for Data Science
NumPy:
introduces objects for multidimensional arrays and
matrices, as well as functions that allow to easily
perform advanced mathematical and statistical
operations on those objects
82
Python Libraries for Data Science
SciPy:
collection of algorithms for linear algebra,
differential equations, numerical integration,
optimization, statistics and more
built on NumPy
Link: https://www.scipy.org/scipylib/
83
Python Libraries for Data Science
Pandas:
adds data structures and tools designed to work
with table-like data (similar to Series and Data
Frames in R)
http://pandas.pydata.org/
Link: allows handling missing data
84
Python Libraries for Data Science
SciKit-Learn:
provides machine learning algorithms:
classification, regression, clustering, model
validation etc.
Link: http://scikit-learn.org/
85
TensorFl
ow
Features:
• Better computational graph visualizations
• Reduces error by 50 to 60 percent in neural machine learning
• Parallel computing to execute complex models
• Seamless library management backed by Google
• Quicker updates and frequent new releases to provide you with the
latest features
• TensorFlow is particularly useful for the following applications:
• Speech and image recognition
• Text-based applications
• Time-series analysis
• Video detection
• Link : https://www.tensorflow.org
86
Keras
• Features:
• Keras provides a vast prelabeled datasets which can be used to directly
import and load.
• It contains various implemented layers and parameters that can be used
for construction, configuration, training, and evaluation of neural networks
• Applications:
• One of the most significant applications of Keras are the
deep learning models that are available with their pretrained weights. You
can use these models directly to make predictions or extract its features
without creating or training your own new model.
• Link https://keras.io
87
Python Libraries for Data Science
matplotlib:
python 2D plotting library which produces
publication quality figures in a variety of hardcopy
formats
88
Python Libraries for Data Science
Seaborn:
based on matplotlib
Link: https://seaborn.pydata.org/
89
Start Jupyter notebook
# On the Shared Computing Cluster
[scc1 ~] jupyter notebook
90
Loading Python Libraries
In [ #Import Python Libraries
]: import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns
91
Reading data using pandas
#Read csv file
In [ ]:
df =
pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/
Note: The above command has many optional arguments to fine-tune the
Salaries.csv")
data import process.
pd.read_excel('myfile.xlsx',sheet_name='Sheet1',
index_col=None, na_values=['NA'])
pd.read_stata('myfile.dta')
pd.read_sas('myfile.sas7bdat')
pd.read_hdf('myfile.h5','df')
92
Exploring data frames
#List first 5 records
In [3]:
df.head()
Out[3]:
93
Hands-on exercises
Hint:
94
Data Frame data types
Pandas Type Native Python Type Description
object string The most general dtype.
Will be assigned to your
column if column has
mixed types (numbers and
strings).
int64 int Numeric characters. 64
refers to the memory
allocated to hold this
character.
float64 float Numeric characters with
decimals. If a column
contains numbers and
NaNs(see below), pandas
will default to float64, in
case your missing value
has a decimal.
datetime64, timedelta[ns] N/A (but see the datetime Values meant to hold time
module in Python’s data. Look into these for
standard library) time series experiments.
95
Data Frame data types
#Check a particular column type
In [4]:
df['salary'].dtype
Out[4]: dtype('int64')
Out[4]:
rank object
discipline object
phd int64
service int64
sex object
salary int64
dtype: object
96
Data Frames attributes
Python objects have attributes and methods.
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
97
Hands-on exercises
98
Data Frames methods
Unlike attributes, python methods have parenthesis.
All attributes and methods can be listed with a dir()
function: dir(df)
df.method() description
head( [n] ), tail( [n] first/last n rows
)
describe() generate descriptive statistics (for numeric
columns only)
max(), min() return max/min values for all numeric
columns
mean(), median() return mean/median values for all numeric
columns
std() standard deviation
100
Selecting a column in a Data Frame
Method 1: Subset the data frame using column
name:
df['sex']
method);
102
Data Frames groupby method
103
Data Frames groupby method
Note: If single brackets are used to specify the column (e.g. salary), then the
output is Pandas Series object. When double brackets are used the output is a
104
Data Frame
Data Frames groupby method
105
Data Frame: filtering
107
Data Frames: Slicing
108
Data Frames: Selecting rows
Notice that the first row has a position 0, and the last value in
the range is omitted:
So for 0:10 range the first 10 rows are returned with the
positions starting with 0 and ending with 9
109
Data Frames: method loc
Out[ ]:
110
Data Frames: method iloc
Out[ ]:
111
Data Frames: method iloc
(summary)
df.iloc[0] # First row of a data frame
df.iloc[i] #(i+1)th row
df.iloc[-1] # Last row
113
Data Frames: Sorting
Out[ ]:
114
Missing Values
Missing values are marked as NaN
# Read a dataset with missing values
In [ ]:
flights =
pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/fli
ghts.csv")
In [ ]: Select the rows that have at least one missing value
#
flights[flights.isnull().any(axis=1)].head()
Out[ ]:
115
Missing Values
There are a number of methods to deal with missing values in
the data frame:
df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells is NA
dropna(axis=1, Drop column if all the values are missing
how='all')
dropna(thresh = 5) Drop rows that contain less than 5 non-
missing values
fillna(0) Replace missing values with zeros
117
Aggregation Functions in Pandas
Aggregation - computing a summary statistic about each
group, i.e.
• compute group sums or means
• compute group sizes/counts
min, max
count, sum, prod
mean, median, mode, mad
std, var
118
Aggregation Functions in Pandas
agg() method are useful when multiple statistics are
In
computed per column:
flights[['dep_delay','arr_delay']].agg(['min','mean','max'])
[ ]:
Out[ ]:
119
Basic Descriptive Statistics
df.method() description
describe Basic statistics (count, mean, std, min,
quantiles, max)
min, max Minimum and maximum values
kurt kurtosis
120
Graphics to explore the data
121
Graphics
description
distplot histogram
barplot estimate of central tendency for a numeric
variable
violinplot similar to boxplot, also shows the probability
density of the data
jointplot Scatterplot
regplot Regression plot
pairplot Pairplot
boxplot boxplot
swarmplot categorical scatterplot
factorplot General categorical plot
122
Basic statistical Analysis
statsmodel and scikit-learn - both have a number of function for
statistical analysis
The first one is mostly used for regular analysis using R style
formulas, while scikit-learn is more tailored for Machine Learning.
statsmodels:
• linear regressions
• ANOVA tests
• hypothesis testings
• many more ...
scikit-learn:
• kmeans
• support vector machines
• random forests
• many more ...
123
See examples in the Tutorial Notebook
Descriptive vs. Inferential Statistics
• Descriptive: e.g., Median; describes data you have
but can't be generalized beyond that
– We’ll talk about Exploratory Data Analysis
• Inferential: e.g., t-test, that enable inferences about
the population beyond our data
– These are the techniques we’ll leverage for
Machine Learning and Prediction
EDA Tools
• Python and R language are the two most commonly used data
science tools to create an EDA
• Perform k-means clustering. It is an unsupervised learning
algorithm where the data points are assigned to clusters, also
known as k-groups. K-means clustering is commonly used in market
segmentation, image compression, and pattern recognition.
• EDA can be used in predictive models such as linear regression,
where it is used to predict outcomes.
• It is also used in univariate, bivariate, and multivariate visualization
for summary statistics, establishing relationships between each
variable, and for understanding how different fields in the data
interact with each other.
125
Outline
• Exploratory Data Analysis
– Chart types
– Some important distributions
– Hypothesis Testing
Examples of Business Questions
• Simple (descriptive) Stats
– “Who are the most profitable customers?”
• Hypothesis Testing
– “Is there a difference in value to the company of these
customers?”
• Segmentation/Classification
– What are the common characteristics of these customers?
• Prediction
– Will this new customer become a profitable customer? If
so, how profitable?
133
The “R” Language
• An evolution of the “S” language developed at Bell
labs for EDA.
• Idea was to allow interactive exploration and
visualization of data.
• The preferred language for statisticians, used by
many other data scientists.
• Features:
– Probably the most comprehensive collection of statistical
models and distributions.
– CRAN: a very large resource of open source statistical
models.
• Dot plot
136
Chart types
• Jitter plot
• Noise added to the y-axis to spread the points
137
Chart types
• Error bars: usually based on confidence intervals (CI).
95% CI means 95% of points are in the range,
so 2.5% of points are above or below the bar.
• Not necessarily symmetric:
138
Chart types
• Box-and-whisker plot : a graphical form of 5-number
summary (Tukey)
139
Chart types
• Histogram
140
Chart types
• Kernel density estimate
141
Chart types
• Histogram and Kernel Density Estimates
– Histogram
• Proper selection of bin width is important
• Outliers should be discarded
– KDE (like a smooth histogram)
• Kernel function
– Box, Epanechnikov, Gaussian
• Kernel bandwidth
142
Chart types
• Cumulative distribution function
• Integral of the histogram – simpler to build than KDE
(don’t need smoothing)
143
Chart types
• Two variables
– Bar chart
– Scatter plot
– Line plot
– Log-log plot
144
Chart types
• Bar plot: one variable is discrete
145
Chart types
• Scatter plot
146
Chart types
• Line plot
147
Chart types
• Log-log plot: Very useful for power law data
Frequency of
words in tweets
slope ~ -1
149
Chart types
• Stacked plot: stack variable is discrete:
150
Chart types
• Parallel coordinate plot: one discrete variable, an
arbitrary number of other variables:
151
References
• The material is prepared by taking inputs from
various text books and other internet sources.