0% found this document useful (0 votes)
429 views139 pages

Cs3361 Data Science Laboratory

The document describes experiments for a data science laboratory course. The experiments include: 1) Installing and exploring Python data science packages like NumPy, Pandas, SciPy, and Matplotlib. 2) Working with NumPy arrays, including array attributes, indexing, slicing, reshaping, and joining/splitting arrays. 3) Using diabetes and Iris datasets to perform univariate, bivariate, and multiple regression analyses and comparing results. 4) Applying various plotting functions from Matplotlib and Seaborn to visualize data. 5) Visualizing geographic data with Basemap.

Uploaded by

karthickamsec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
429 views139 pages

Cs3361 Data Science Laboratory

The document describes experiments for a data science laboratory course. The experiments include: 1) Installing and exploring Python data science packages like NumPy, Pandas, SciPy, and Matplotlib. 2) Working with NumPy arrays, including array attributes, indexing, slicing, reshaping, and joining/splitting arrays. 3) Using diabetes and Iris datasets to perform univariate, bivariate, and multiple regression analyses and comparing results. 4) Applying various plotting functions from Matplotlib and Seaborn to visualize data. 5) Visualizing geographic data with Basemap.

Uploaded by

karthickamsec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 139

CS3361 DATA SCIENCE LABORATORY

LIST OF EXPERIMENTS:

1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.

2. Working with Numpy arrays

3. Working with Pandas data frames

4. Reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set.

5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:

a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,


Skewness and Kurtosis.

b. Bivariate analysis: Linear and logistic regression modeling

c. Multiple Regression analysis d. Also compare the results of the above analysis for the two
data sets.

6. Apply and explore various plotting functions on UCI data sets.

a. Normal curves

b. Density and contour plots

c. Correlation and scatter plots

d. Histograms

e. Three dimensional plotting

7. Visualizing Geographic Data with Basemap


Ex.No.1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
Pandas packages.

Install Python Data Science Packages

Python is a high-level and general-purpose programming language with data science


and machine learning packages. Use the video below to install on Windows, MacOS,
or Linux. As a first step, install Python for Windows, MacOS, or Linux.
Install Python Packages
The power of Python is in the packages that are available either through the pip or
conda package managers. This page is an overview of some of the best packages
for machine learning and data science and how to install them.

pip install <package_name>

1. Pandas

Pandas is an open-source Python package that provides high-performance, easy-to-


use data structures and data analysis tools for the labeled data in Python
programming language. Pandas stand for Python Data Analysis Library. Who ever
knew that?

Pandas is a perfect tool for data wrangling or munging. It is designed for quick and
easy data manipulation, reading, aggregation, and visualization.

Pandas take data in a CSV or TSV file or a SQL database and create a Python object
with rows and columns called a data frame. The data frame is very similar to a table
in statistical software, say Excel or SPSS.

Pandas is used for :


1. Indexing, manipulating, renaming, sorting, merging data frame
2. Update, Add, Delete columns from a data frame
3. Impute missing files, handle missing data or NANs
4. Plot data with histogram or box plot

This makes Pandas a foundation library in learning Python for Data Science.

2. NumPy

One of the most fundamental packages in Python, NumPy is a general-purpose


array-processing package. It provides high-performance multidimensional array
objects and tools to work with the arrays. NumPy is an efficient container of generic
multi-dimensional data.

NumPy’s main object is the homogeneous multidimensional array. It is a table of


elements or numbers of the same datatype, indexed by a tuple of positive integers. In
NumPy, dimensions are called axes and the number of axes is called rank.

NumPy is used to process arrays that store values of the same datatype. NumPy
facilitates math operations on arrays and their vectorization. This significantly
enhances performance and speeds up the execution time correspondingly.

NumPy is used for :

Basic array operations: add, multiply, slice, flatten, reshape, index arrays

1. Advanced array operations: stack arrays, split into sections, broadcast


arrays
2. Work with DateTime or Linear Algebra
3. Basic Slicing and Advanced Indexing in NumPy Python
3. SciPy

The SciPy library is one of the core packages that make up the SciPy stack. Now,
there is a difference between SciPy Stack and SciPy, the library. SciPy builds on the
NumPy array object and is part of the stack which includes tools like Matplotlib,
Pandas, and SymPy with additional tools,

SciPy library contains modules for efficient mathematical routines as linear algebra,
interpolation, optimization, integration, and statistics. The main functionality of the
SciPy library is built upon NumPy and its arrays. SciPy makes significant use of
NumPy.

SciPy uses arrays as its basic data structure. It has various modules to perform
common scientific programming tasks as linear algebra, integration, calculus,
ordinary differential equations, and signal processing.

4. Matplotlib

This is undoubtedly my favourite and a quintessential Python library. You can create
stories with the data visualized with Matplotlib. Another library from the SciPy Stack,
Matplotlib plots 2D figures.

Matplotlib is the plotting library for Python that provides an object-oriented API for
embedding plots into applications. It is a close resemblance to MATLAB embedded in
Python programming language.

Matplotlib is used for :

Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide
range of visualizations. With a bit of effort and tint of visualization capabilities, with
Matplotlib, you can create just any visualizations:

1. Line plots
2. Scatter plots
3. Area plots
4. Bar charts and Histograms
5. Pie charts
6. Stem plots
7. Contour plots
8. Quiver plots
9. Spectrograms

Matplotlib also facilitates labels, grids, legends, and some more formatting entities
with Matplotlib. Basically, everything that can be drawn!

5. Seaborn

So when you read the official documentation on Seaborn, it is defined as the data
visualization library based on Matplotlib that provides a high-level interface for
drawing attractive and informative statistical graphics. Putting it simply, seaborn is an
extension of Matplotlib with advanced features.

So, what is the difference between Matplotlib and Seaborn? Matplotlib is used for
basic plotting; bars, pies, lines, scatter plots and stuff whereas, seaborn provides a
variety of visualization patterns with less complex and fewer syntax.

Seaborn is used for :

1. Determine relationships between multiple variables (correlation)


2. Observe categorical variables for aggregate statistics
3. Analyze uni-variate or bi-variate distributions and compare them between
different data subsets
4. Plot linear regression models for dependent variables
5. Provide high-level abstractions, multi-plot grids

Seaborn is a great second-hand for R visualization libraries like corrplot and ggplot.
Ex.No 2. Working with Numpy arrays.

Basic array manipulations here:

 Attributes of arrays: Determining the size, shape, memory consumption, and data
types of arrays
 Indexing of arrays: Getting and setting the value of individual array elements
 Slicing of arrays: Getting and setting smaller subarrays within a larger array
 Reshaping of arrays: Changing the shape of a given array
 Joining and splitting of arrays: Combining multiple arrays into one, and splitting
one array into man

First let's discuss some useful array attributes. We'll start by defining three random arrays,
a one-dimensional, two-dimensional, and three-dimensional array. We'll use NumPy's
random number generator, which we will seed with a set value in order to ensure that the
same random arrays are generated each time this code is run:

In [1]:
import numpy as np
np.random.seed(0) # seed for reproducibility

x1 = np.random.randint(10, size=6) # One-dimensional array


x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array

Each array has attributes ndim (the number of dimensions), shape (the size of each
dimension), and size (the total size of the array):

In [2]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
Another useful attribute is the dtype, the data type of the array (which we discussed
previously in Understanding Data Types in Python):

In [3]:
print("dtype:", x3.dtype)
dtype: int64
Other attributes include itemsize, which lists the size (in bytes) of each array element,
and nbytes, which lists the total size (in bytes) of the array:

In [4]:
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")
itemsize: 8 bytes
nbytes: 480 bytes
In general, we expect that nbytes is equal to itemsize times size.

Array Indexing: Accessing Single Elements


If you are familiar with Python's standard list indexing, indexing in NumPy will feel quite
familiar. In a one-dimensional array, the ithith value (counting from zero) can be accessed
by specifying the desired index in square brackets, just as with Python lists:
In [5]:
x1
Out[5]:
array([5, 0, 3, 3, 7, 9])
In [6]:
x1[0]
Out[6]:
5
In [7]:
x1[4]
Out[7]:
7
To index from the end of the array, you can use negative indices:

In [8]:
x1[-1]
Out[8]:
9
In [9]:
x1[-2]
Out[9]:
7
In a multi-dimensional array, items can be accessed using a comma-separated tuple of
indices:

In [10]:
x2
Out[10]:
array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
In [11]:
x2[0, 0]
Out[11]:
3
In [12]:
x2[2, 0]
Out[12]:
1
In [13]:
x2[2, -1]
Out[13]:
7
Values can also be modified using any of the above index notation:

In [14]:
x2[0, 0] = 12
x2
Out[14]:
array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
Keep in mind that, unlike Python lists, NumPy arrays have a fixed type. This means, for
example, that if you attempt to insert a floating-point value to an integer array, the value
will be silently truncated. Don't be caught unaware by this behavior!

In [15]:
x1[0] = 3.14159 # this will be truncated!
x1
Out[15]:
array([3, 0, 3, 3, 7, 9])

Array Slicing: Accessing Subarrays


Just as we can use square brackets to access individual array elements, we can also use
them to access subarrays with the slice notation, marked by the colon (:) character. The
NumPy slicing syntax follows that of the standard Python list; to access a slice of an
array x, use this:

x[start:stop:step]

If any of these are unspecified, they default to the values start=0, stop=size of
dimension, step=1. We'll take a look at accessing sub-arrays in one dimension and in
multiple dimensions.

One-dimensional subarrays
In [16]:
x = np.arange(10)
x
Out[16]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [17]:
x[:5] # first five elements
Out[17]:
array([0, 1, 2, 3, 4])
In [18]:
x[5:] # elements after index 5
Out[18]:
array([5, 6, 7, 8, 9])
In [19]:
x[4:7] # middle sub-array
Out[19]:
array([4, 5, 6])
In [20]:
x[::2] # every other element
Out[20]:
array([0, 2, 4, 6, 8])
In [21]:
x[1::2] # every other element, starting at index 1
Out[21]:
array([1, 3, 5, 7, 9])
A potentially confusing case is when the step value is negative. In this case, the defaults
for start and stop are swapped. This becomes a convenient way to reverse an array:

In [22]:
x[::-1] # all elements, reversed
Out[22]:
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
In [23]:
x[5::-2] # reversed every other from index 5
Out[23]:
array([5, 3, 1])

Multi-dimensional subarrays
Multi-dimensional slices work in the same way, with multiple slices separated by commas.
For example:

In [24]:
x2
Out[24]:
array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In [25]:
x2[:2, :3] # two rows, three columns
Out[25]:
array([[12, 5, 2],
[ 7, 6, 8]])
In [26]:
x2[:3, ::2] # all rows, every other column
Out[26]:
array([[12, 2],
[ 7, 8],
[ 1, 7]])
Finally, subarray dimensions can even be reversed together:

In [27]:
x2[::-1, ::-1]
Out[27]:
array([[ 7, 7, 6, 1],
[ 8, 8, 6, 7],
[ 4, 2, 5, 12]])

Accessing array rows and columns

One commonly needed routine is accessing of single rows or columns of an array. This can
be done by combining indexing and slicing, using an empty slice marked by a single colon
(:):

In [28]:
print(x2[:, 0]) # first column of x2
[12 7 1]
In [29]:
print(x2[0, :]) # first row of x2
[12 5 2 4]
In the case of row access, the empty slice can be omitted for a more compact syntax:

In [30]:
print(x2[0]) # equivalent to x2[0, :]
[12 5 2 4]

Subarrays as no-copy views


One important–and extremely useful–thing to know about array slices is that they
return views rather than copies of the array data. This is one area in which NumPy array
slicing differs from Python list slicing: in lists, slices will be copies. Consider our two-
dimensional array from before:
In [31]:
print(x2)
[[12 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
Let's extract a 2×22×2 subarray from this:
In [32]:
x2_sub = x2[:2, :2]
print(x2_sub)
[[12 5]
[ 7 6]]
Now if we modify this subarray, we'll see that the original array is changed! Observe:

In [33]:
x2_sub[0, 0] = 99
print(x2_sub)
[[99 5]
[ 7 6]]
In [34]:
print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
This default behavior is actually quite useful: it means that when we work with large
datasets, we can access and process pieces of these datasets without the need to copy the
underlying data buffer.

Creating copies of arrays


Despite the nice features of array views, it is sometimes useful to instead explicitly copy
the data within an array or a subarray. This can be most easily done with
the copy() method:

In [35]:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)
[[99 5]
[ 7 6]]
If we now modify this subarray, the original array is not touched:

In [36]:
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)
[[42 5]
[ 7 6]]
In [37]:
print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]

Reshaping of Arrays
Another useful type of operation is reshaping of arrays. The most flexible way of doing this
is with the reshape method. For example, if you want to put the numbers 1 through 9 in
a 3×33×3 grid, you can do the following:
In [38]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]
Note that for this to work, the size of the initial array must match the size of the reshaped
array. Where possible, the reshape method will use a no-copy view of the initial array, but
with non-contiguous memory buffers this is not always the case.

Another common reshaping pattern is the conversion of a one-dimensional array into a


two-dimensional row or column matrix. This can be done with the reshape method, or
more easily done by making use of the newaxis keyword within a slice operation:

In [39]:
x = np.array([1, 2, 3])

# row vector via reshape


x.reshape((1, 3))
Out[39]:
array([[1, 2, 3]])
In [40]:
# row vector via newaxis
x[np.newaxis, :]
Out[40]:
array([[1, 2, 3]])
In [41]:
# column vector via reshape
x.reshape((3, 1))
Out[41]:
array([[1],
[2],
[3]])
In [42]:
# column vector via newaxis
x[:, np.newaxis]
Out[42]:
array([[1],
[2],
[3]])
We will see this type of transformation often throughout the remainder of the book.

Array Concatenation and Splitting


All of the preceding routines worked on single arrays. It's also possible to combine
multiple arrays into one, and to conversely split a single array into multiple arrays. We'll
take a look at those operations here.

Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the
routines np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of
arrays as its first argument, as we can see here:

In [43]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
Out[43]:
array([1, 2, 3, 3, 2, 1])
You can also concatenate more than two arrays at once:

In [44]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))
[ 1 2 3 3 2 1 99 99 99]
It can also be used for two-dimensional arrays:

In [45]:
grid = np.array([[1, 2, 3],
[4, 5, 6]])
In [46]:
# concatenate along the first axis
np.concatenate([grid, grid])
Out[46]:
array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
In [47]:
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)
Out[47]:
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
For working with arrays of mixed dimensions, it can be clearer to use
the np.vstack (vertical stack) and np.hstack (horizontal stack) functions:

In [48]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])

# vertically stack the arrays


np.vstack([x, grid])
Out[48]:
array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
In [49]:
# horizontally stack the arrays
y = np.array([[99],
[99]])
np.hstack([grid, y])
Out[49]:
array([[ 9, 8, 7, 99],
[ 6, 5, 4, 99]])
Similary, np.dstack will stack arrays along the third axis.

Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the
functions np.split, np.hsplit, and np.vsplit. For each of these, we can pass a list of
indices giving the split points:

In [50]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
Notice that N split-points, leads to N + 1 subarrays. The related
functions np.hsplit and np.vsplit are similar:
In [51]:
grid = np.arange(16).reshape((4, 4))
grid
Out[51]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [52]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
In [53]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
Similarly, np.dsplit will split arrays along the third axis.

Create a NumPy array:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(type(arr))
3. Working with Pandas data frames

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0 1
1 7
2 2
dtype: int64

Introducing the Pandas DataFrame


Pandas DataFrames are data structures that contain:

 Data organized in two dimensions, rows and columns


 Labels that correspond to the rows and columns

You can start working with DataFrames by importing Pandas:

>>>

>>> import pandas as pd


Say you’re interested in the candidates’ names, cities, ages, and scores on a
Python programming test, or py-score:

name city age py-score

101 Xavier Mexico City 41 88.0

102 Ann Toronto 28 79.0

103 Jana Prague 33 81.0

104 Yi Shanghai 34 80.0

105 Robin Manchester 38 68.0

106 Amal Cairo 31 61.0

107 Nori Osaka 37 84.0

In this table, the first row contains the column labels (name, city, age, and py-
score). The first column holds the row labels (101, 102, and so on). All other cells
are filled with the data values.

Now you have everything you need to create a Pandas DataFrame.

There are several ways to create a Pandas DataFrame. In most cases, you’ll use
the DataFrame constructor and provide the data, labels, and other information.
You can pass the data as a two-dimensional list, tuple, or NumPy array. You can
also pass it as a dictionary or Pandas Series instance, or as one of several other
data types not covered in this tutorial.

For this example, assume you’re using a dictionary to pass the data:

>>>

>>> data = {
... 'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
... 'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
... 'Manchester', 'Cairo', 'Osaka'],
... 'age': [41, 28, 33, 34, 38, 31, 37],
... 'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
... }

>>> row_labels = [101, 102, 103, 104, 105, 106, 107]


data is a Python variable that refers to the dictionary that holds your candidate
data. It also contains the labels of the columns:
 'name'
 'city'
 'age'
 'py-score'

>>> pd.DataFrame(d, index=[100, 200, 300], columns=['z', 'y', 'x'])


z y x
100 100 2 1
200 100 4 2
300 100 8 3
As you can see, you’ve specified the row labels 100, 200, and 300. You’ve also forced the
order of columns: z, y, x.

Creating a Pandas DataFrame With Lists


Another way to create a Pandas DataFrame is to use a list of dictionaries:

>>>

>>> l = [{'x': 1, 'y': 2, 'z': 100},


... {'x': 2, 'y': 4, 'z': 100},
... {'x': 3, 'y': 8, 'z': 100}]

>>> pd.DataFrame(l)
x y z
0 1 2 100
1 2 4 100
2 3 8 100
Again, the dictionary keys are the column labels, and the dictionary values are the data values
in the DataFrame.

You can also use a nested list, or a list of lists, as the data values. If you do, then it’s wise to
explicitly specify the labels of columns, rows, or both when you create the DataFrame:

>>>

>>> l = [[1, 2, 100],


... [2, 4, 100],
... [3, 8, 100]]

>>> pd.DataFrame(l, columns=['x', 'y', 'z'])


x y z
0 1 2 100
1 2 4 100
2 3 8 100

Creating a Pandas DataFrame With NumPy Arrays


You can pass a two-dimensional NumPy array to the DataFrame constructor the
same way you do with a list:

>>>

>>> arr = np.array([[1, 2, 100],


... [2, 4, 100],
... [3, 8, 100]])

>>> df_ = pd.DataFrame(arr, columns=['x', 'y', 'z'])


>>> df_
x y z
0 1 2 100
1 2 4 100
2 3 8 100

Read CSV Files


A simple way to store big data sets is to use CSV files (comma separated
files).

CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

Download data.csv. or Open data.csv

Example
Load the CSV into a DataFrame:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
12 60 106 128 345.3
13 60 104 132 379.3
14 60 98 123 275.0
15 60 98 120 215.2
16 60 100 120 300.0
17 45 90 112 NaN
18 60 103 123 323.0
19 45 97 125 243.0
20 60 108 131 364.2
21 45 100 119 282.0
22 60 130 101 300.0
23 45 105 132 246.0
24 60 102 126 334.5
25 60 100 120 250.0
26 60 92 118 241.0
27 60 103 132 NaN
28 60 100 132 280.0
29 60 102 129 380.3
30 60 92 115 243.0
31 45 90 112 180.1
32 60 101 124 299.0
33 60 93 113 223.0
34 60 107 136 361.0
35 60 114 140 415.0
36 60 102 127 300.5
37 60 100 120 300.1
38 60 100 120 300.0
39 45 104 129 266.0
40 45 90 112 180.1
41 60 98 126 286.0
42 60 100 122 329.4
43 60 111 138 400.0
44 60 111 131 397.0
45 60 99 119 273.0
46 60 109 153 387.6
47 45 111 136 300.0
48 45 108 129 298.0
49 60 111 139 397.6
50 60 107 136 380.2
51 80 123 146 643.1
52 60 106 130 263.0
53 60 118 151 486.0
54 30 136 175 238.0
55 60 121 146 450.7
56 60 118 121 413.0
57 45 115 144 305.0
58 20 153 172 226.4
59 45 123 152 321.0
60 210 108 160 1376.0
61 160 110 137 1034.4
62 160 109 135 853.0
63 45 118 141 341.0
64 20 110 130 131.4
65 180 90 130 800.4
66 150 105 135 873.4
67 150 107 130 816.0
68 20 106 136 110.4
69 300 108 143 1500.2
70 150 97 129 1115.0
71 60 109 153 387.6
72 90 100 127 700.0
73 150 97 127 953.2
74 45 114 146 304.0
75 90 98 125 563.2
76 45 105 134 251.0
77 45 110 141 300.0
78 120 100 130 500.4
79 270 100 131 1729.0
80 30 159 182 319.2
81 45 149 169 344.0
82 30 103 139 151.1
83 120 100 130 500.0
84 45 100 120 225.3
85 30 151 170 300.1
86 45 102 136 234.0
87 120 100 157 1000.1
88 45 129 103 242.0
89 20 83 107 50.3
90 180 101 127 600.1
91 45 107 137 NaN
92 30 90 107 105.3
93 15 80 100 50.5
94 20 150 171 127.4
95 20 151 168 229.4
96 30 95 128 128.2
97 25 152 168 244.2
98 30 109 131 188.2
99 90 93 124 604.1
100 20 95 112 77.7
101 90 90 110 500.0
102 90 90 100 500.0
103 90 90 100 500.4
104 30 92 108 92.7
105 30 93 128 124.0
106 180 90 120 800.3
107 30 90 120 86.2
108 90 90 120 500.3

4. Reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set.

Exploratory Data Analysis : Iris Dataset


The Iris flower data set or Fisher’s Iris data set is one of the most
famous multivariate data set used for testing various Machine
Learning Algorithms.

Data insights of each and every visualization step has been given

Importing relevant libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
sns.set()

Source Of Data

Data has been stored inside a csv file namely ‘iris.csv’

Loading data

iris_data = pd.read_csv(‘iris.csv’)
iris_data
Complete Iris Dataset
Visual description of various features of Iris Species

Gaining information from data

iris_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Data Insights:

1 All columns are not having any Null Entries

2 Four columns are numerical type

3 Only Single column categorical type

Statistical Insight

iris_data.describe()
Data Insights:

1. Mean values
2. Standard Deviation ,
3. Minimum Values
4. Maximum Values

Checking For Duplicate Entries

iris_data[iris_data.duplicated()]

Duplicate Entries

There are 3 duplicates, therefore we must check whether each


species data set is balanced in no's or no
Checking the balance

iris_data[‘species’].value_counts()
setosa 50
versicolor 50
virginica 50
Name: species, dtype: int64

Therefore we shouldn’t delete the entries as it might imbalance the


data sets and hence will prove to be less useful for valuable insights

Data Visualization

Species count

plt.title(‘Species Count’)
sns.countplot(iris_data[‘species’])

Data Insight:

1. This further visualizes that species are well balanced


2. Each species ( Iris virginica, setosa, versicolor) has 50 as it’s
count

Iris Flower Species

Uni-variate Analysis

Comparison between various species based on sepal


length and width

plt.figure(figsize=(17,9))
plt.title(‘Comparison between various species based on sapel
length and width’)
sns.scatterplot(iris_data[‘sepal_length’],iris_data[‘sepal_widt
h’],hue =iris_data[‘species’],s=50)
Data Insights:

1. Iris Setosa species has smaller sepal length but higher


width.
2. Versicolor lies in almost middle for length as well as width
3. Virginica has larger sepal lengths and smaller sepal widths

Comparison between various species based on petal


length and width

plt.figure(figsize=(16,9))
plt.title(‘Comparison between various species based on petal
lenght and width’)
sns.scatterplot(iris_data[‘petal_length’],
iris_data[‘petal_width’], hue = iris_data[‘species’], s= 50)
Data Insights

1. Setosa species have the smallest petal length as well as


petal width
2. Versicolor species have average petal length and petal
width
3. Virginica species have the highest petal length as well as
petal width

Bi-variate Analysis

sns.pairplot(iris_data,hue=”species”,height=4)
Data Insights:

1. High co relation between petal length and width columns.


2. Setosa has both low petal length and width
3. Versicolor has both average petal length and width
4. Virginica has both high petal length and width.
5. Sepal width for setosa is high and length is low.
6. Versicolor have average values for for sepal dimensions.
7. Virginica has small width but large sepal length

Checking Correlation
plt.figure(figsize=(10,11))
sns.heatmap(iris_data.corr(),annot=True)
plt.plot()

Heatmap

Data Insights:

1. Sepal Length and Sepal Width features are slightly


correlated with each other
Checking Mean & Median Values for each species

iris.groupby(‘species’).agg([‘mean’, ‘median’])

mean and median outputs

visualizing the distribution , mean and median using box plots &
violin plots

Box plots to know about distribution

boxplot to see how the categorical feature “Species” is distributed


with all other four input variables

fig, axes = plt.subplots(2, 2, figsize=(16,9))


sns.boxplot( y=”petal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 0])
sns.boxplot( y=”petal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 1])
sns.boxplot( y=”sepal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 0])
sns.boxplot( y=”sepal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 1])
plt.show()

Box Plots

Data Insights:

1. Setosa is having smaller feature and less distributed


2. Versicolor is distributed in a average manner and average
features
3. Virginica is highly distributed with large no .of values and
features
4. Clearly the mean/ median values are being shown by each
plots for various features(sepal length & width, petal length
& width)

Violin Plot for checking distribution

The violin plot shows density of the length and width in the species.
The thinner part denotes that there is less density whereas the fatter
part conveys higher density
fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 1],inner=’quartile’)
plt.show()

Violin Plot

Data Insights:

1. Setosa is having less distribution and density in case of


petal length & width
2. Versicolor is distributed in a average manner and average
features in case of petal length & width
3. Virginica is highly distributed with large no .of values and
features in case of sepal length & width
4. High density values are depicting the mean/median values,
for example: Iris Setosa has highest density at 5.0 cm
( sepal length feature) which is also the median value(5.0)
as per the table

Mean / Median Table for reference

Plotting the Histogram & Probability Density Function


(PDF)

plotting the probability density function(PDF) with each feature as a


variable on X-axis and it’s histogram and corresponding kernel
density plot on Y-axis.

sns.FacetGrid(iris, hue="species", height=5) \


.map(sns.distplot, "sepal_length") \
.add_legend()

sns.FacetGrid(iris, hue="species", height=5) \


.map(sns.distplot, "sepal_width") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_length") \
.add_legend()

sns.FacetGrid(iris, hue="species", height=5) \


.map(sns.distplot, "petal_width") \
.add_legend()
plt.show()

Plot 1 | Classification feature : Sepal Length


Plot 2 | Classification feature : Sepal Width

Plot 3 | Classification feature : Petal Length


Plot 4 | Classification feature : Petal Width

Data Insights:

1. Plot 1 shows that there is a significant amount of overlap


between the species on sepal length, so it is not an effective
Classification feature
2. Plot 2 shows that there is even higher overlap between the
species on sepal width, so it is not an effective Classification
feature
3. Plot 3 shows that petal length is a good Classification
feature as it clearly separates the species . The overlap is
extremely less (between Versicolor and Virginica) , Setosa
is well separated from the rest two
4. Just like Plot 3, Plot 4 also shows that petal width is a good
Classification feature . The overlap is significantly less
(between Versicolor and Virginica) , Setosa is well
separated from the rest two
Choosing Plot 3 (Classification feature as Petal
Length)to distinguish among the species

Plot 3 | Classification feature : Petal Length

The pdf curve of Iris Setosa ends roughly at 2.1

Data Insights:

1. The pdf curve of Iris Setosa ends roughly at 2.1


2. If petal length < 2.1, then species is Iris Setosa
3. The point of intersection between pdf curves of Versicolor
and Virginica is roughly at 4.8
4. If petal length > 2.1 and petal length < 4.8 then species is
Iris Versicolor
5. If petal length > 4.8 then species is Iris Virginica
5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:

a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,


Skewness and Kurtosis.

b. Bivariate analysis: Linear and logistic regression modeling

c. Multiple Regression analysis

d. Also compare the results of the above analysis for the two data sets.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')
In [2]:

df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df.head()
Out[2]:

Pregnanci Glucos BloodPressu SkinThickne Insuli BM DiabetesPedigreeFuncti Ag Outcom


es e re ss n I on e e

33.
0 6 148 72 35 0 0.627 50 1
6

26.
1 1 85 66 29 0 0.351 31 0
6

23.
2 8 183 64 0 0 0.672 32 1
3

28.
3 1 89 66 23 94 0.167 21 0
1
Pregnanci Glucos BloodPressu SkinThickne Insuli BM DiabetesPedigreeFuncti Ag Outcom
es e re ss n I on e e

43.
4 0 137 40 35 168 2.288 33 1
1

Introduction :
Well, In this notebook i am going to do some univariate Statistical Analysis. I will avoid cleaning
data and dealing with missing values , so that i can see real insights from data.

You can see 3rd Analysis on this dataset, (Making Inference about Population)

Making Inferences about Population

I will be using only first 3 or 4 variable to show the workflow of Univariate Statistical
Analysis

In [3]:

df.shape
Out[3]:

(768, 9)
In [4]:

df.dtypes
Out[4]:

Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
Datatypes are perfect , But we can change Outcome to boolean datatype which will save
space.

In [5]:

df['Outcome']=df['Outcome'].astype('bool')
In [6]:

df.dtypes['Outcome']
Out[6]:

dtype('bool')
In [7]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
DiabetesPedigreeFunction 768 non-null float64
Age 768 non-null int64
Outcome 768 non-null bool
dtypes: bool(1), float64(2), int64(6)
memory usage: 48.9 KB
In [8]:

df.describe().T
Out[8]:

coun
mean std min 25% 50% 75% max
t

768.
Pregnancies 3.845052 3.369578 0.000 1.00000 3.0000 6.00000 17.00
0

768. 120.89453 99.0000 117.000 140.2500 199.0


Glucose 31.972618 0.000
0 1 0 0 0 0

768. 62.0000 122.0


BloodPressure 69.105469 19.355807 0.000 72.0000 80.00000
0 0 0

768.
SkinThickness 20.536458 15.952218 0.000 0.00000 23.0000 32.00000 99.00
0

768. 115.24400 127.2500 846.0


Insulin 79.799479 0.000 0.00000 30.5000
0 2 0 0

768. 27.3000
BMI 31.992578 7.884160 0.000 32.0000 36.60000 67.10
0 0

DiabetesPedigreeFuncti 768. 0.471876 0.331329 0.078 0.24375 0.3725 0.62625 2.42


coun
mean std min 25% 50% 75% max
t

on 0

768. 21.00 24.0000


Age 33.240885 11.760232 29.0000 41.00000 81.00
0 0 0

Basic Summary :
Data is related to healthcare Industry having 768 observations with 9 variable. Target variable is
Outcome. It looks like there is no missing value, and boolean, float , integers are different
datatypes available. Well descriptive analysis shows that variable Glucose,
BoodPressure,SckinThickness, Insulin and BMI have minimum value 0 which does not make any
sense, these values are either missing or outliers, I will be treating them later. I can see in
Pregnancies column, minimum is 0 (May be this is sign for no pregnancy) which is considerable,
But maximum month of pregnancy is 17 which does not make any sense, I will be dealing later.
Variance among different predictor variable is varying at large scale , Scaling data will be helpful
for Predective modelling.

Pregnancy Variable
In [9]:

preg_proportion = np.array(df['Pregnancies'].value_counts())
preg_month = np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)
In [10]:

preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_
proportion':preg_proportion_perc})
preg.set_index(['month'],inplace=True)
preg.head(10)
Out[10]:

count_of_preg_pro
percentage_proportion
p

month

1 135 17
count_of_preg_pro
percentage_proportion
p

month

0 111 14

2 103 13

3 75 9

4 68 8

5 57 7

6 50 6

7 45 5

8 38 4

9 28 3

unfold_moreShow hidden code


Understanding Distribution
The distribution of Pregnancies in data is unimodal and skewed to the right, centered at
about 1 with most of the data between 0 and 15, A range of roughly 15, and outliers are
present on the higher end.

Note :- BoxPlot of both categories shows that People with higher pregnancy months have higher
risk of Diabetes (There is not statistical evidence, May be i will be testing a hypothesis in my 3rd
notebook of this data.)

Glucose Variable
In [12]:

df.Glucose.describe()
Out[12]:

count 768.000000
mean 120.894531
std 31.972618
min 0.000000
25% 99.000000
50% 117.000000
75% 140.250000
max 199.000000
Name: Glucose, dtype: float64
unfold_moreShow hidden code
Understanding Distribution
The distribution of Glucose level among patients is unimodal and roughly bell shaped,
centered at about 115 with most of the data between 90 and 140, A range of roughly 150,
and outliers are present on the lower end(Glucose ==0).

Note :- BoxPlot of both categories shows that People with higher Glucose level have higher risk
of Diabetes (There is not statistical evidence, May be i will be testing a hypothesis in my 3rd
notebook of this data.) We can also see that some outliers are present on non diabetic patient
observation.

Let us verify distribution by keeping only non zero entry of Glucose

unfold_moreShow hidden code


Blood Pressure variable
In [15]:

df.BloodPressure.describe()
Out[15]:

count 768.000000
mean 69.105469
std 19.355807
min 0.000000
25% 62.000000
50% 72.000000
75% 80.000000
max 122.000000
Name: BloodPressure, dtype: float64
unfold_moreShow hidden code
Understanding Distribution
The distribution of BloodPressure among patients is unimodal (This is not a bimodal
because BP=0 does not make any sense and it is Outlier) and bell shaped, centered at
about 65 with most of the data between 60 and 90, A range of roughly 100, and outliers are
present on the lower end(BP ==0).

Note :- BoxPlot of both categories shows that there is a little association of BP with Diabetic VS
Non-Diabetic patients. (There is not statistical evidence, May be i will be testing a hypothesis in
my 3rd notebook of this data.) We can also see that some outliers are present.

Let us verify distribution by keeping only non zero entry of Glucose

Please notice how outliers can change your whole understanding about data.

unfold_moreShow hidden code


SkinThickness Variable
In [19]:

df.SkinThickness.describe()
Out[19]:

count 768.000000
mean 20.536458
std 15.952218
min 0.000000
25% 0.000000
50% 23.000000
75% 32.000000
max 99.000000
Name: SkinThickness, dtype: float64
unfold_moreShow hidden code
Understanding Distribution
The distribution of SkinThickness among patients is looking like Bimodal (But i think,
This is not a bimodal because ST=0 does not make any sense and it may effect
distribution, I will verify it :)) and it is bell shaped, centered at about 20 with most of the
data between 15 and 45, A range of roughly 60, and outliers are present on the lower
end(ST ==0).

Note :- BoxPlot of both categories shows that there is a little association of ST among Diabetic
VS Non-Diabetic patients. (There is not statistical evidence, May be i will be testing a hypothesis
in my 3rd notebook of this data.)

Let us verify distribution by keeping only non zero Sking Thickness

unfold_moreShow hidden code


PIMA Indian Diabetes Prediction
Predicting the onset of diabetes

Diabetes is a chronic condition in which the body develops a


resistance to insulin, a hormone which converts food into glucose.
Diabetes affect many people worldwide and is normally divided into
Type 1 and Type 2 diabetes. Both have different characteristics. This
article intends to analyze and create a model on the PIMA Indian
Diabetes dataset to predict if a particular observation is at a risk of
developing diabetes, given the independent factors. This article
contains the methods followed to create a suitable model, including
EDA along with the model.

Dataset

The dataset can be found on the Kaggle website. This dataset is


originally from the National Institute of Diabetes and Digestive and
Kidney Diseases and can be used to predict whether a patient has
diabetes based on certain diagnostic factors. Starting off, I use
Python 3.3 to implement the model. It is important to perform some
basic analysis to get an overall idea of the dataset.
#Importing basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport#Importing the Dataset
diabetes = pd.read_csv(“diabetes.csv”)
dataset = diabetes#EDA using Pandas Profiling
file = ProfileReport(dataset)
file.to_file(output_file=’output.html’)

Pandas profiling is an efficient way to get an overall as well as in-


depth information about the dataset and the variables in it.
However, caution must be exercised if the dataset is very large
as Pandas Profiling is time-consuming. Since the dataset has only
768 observations and 9 columns, we use this function. The output
gets saved as an HTML report in the working directory.

Overview of the Dataset


We can see the basic information about the dataset such as the size,
missing values, etc. On the top right, we see 8 of numerical columns
and 1 Boolean column (which is our dependent variable). In the
lower panel, (%) of zeros are given in every column, which will be an
useful information for us later. We do not have any categorical
variable as an independent variable.

Exploratory Data Analysis (EDA)

Having observed the basic characteristics of the dataset, we now


move on to observe characteristics of the variables involved in the
study. Again, Pandas Profiling comes to our rescue. The same
HTML report gives information on the variables.

Independent Variable: Age

Let’s have a look at Age. By having a quick look at the histogram on


the top right, we can observe much of the characteristics. The
variable does not follow a normal distribution as it is skewed to the
right (or positively skewed). The average age is is 33 whereas the
median age is 29. This further confirms our analysis (as in case of
normal distribution, the mean should be approximately equal to the
median). Most importantly, none of the values seem abnormal, that
is, the minimum age of 21 and the maximum age of 81 are possible
for the study. Let’s have a look at other variables.

Independent Variable: Blood Pressure

Doing the same for Blood Pressure, we can see that the variable can
approximate to a normal distribution. However, we can not confirm
that visually. We hence perform the Shapiro-Wilk test of normality.
The null hypothesis (Ho) is that the data is normal.
from scipy.stats import shapirostat, p =
shapiro(dataset['BloodPressure'])
print('Statistics=%.3f, p=%.3f' % (stat, p))Statistics=0.819,
p=0.000

The p-value is less than 0.001, implying that at 5% Level of


Significance (LOS) we reject our null hypothesis (Ho). Therefore, the
variable does not follow a normal distribution.
Secondly, if we observe the minimum value for Blood Pressure, it is
given as 0 (which is not possible). Therefore, there is incorrect
information given. We have two choices now. One is to drop such
observations (which results in 4.6% loss of data) or we can replace
such values with median (imputing). I prefer imputing as we have a
small dataset (768 observation only). Hence, every information is
important.

We can simply replace the zero values by median or we can classify


the variable based on outcome and then replace the two separately.
The latter seems more efficient. We draw a violin graph to have a
look at the behavior.
#Classifying the Blood Pressure based on class
ax = sns.violinplot(x=”Outcome”, y=”BloodPressure”,
data=dataset, palette=”muted”, split=True)
We get the above graph. If we observe minutely, we can see that the
box plot for 1 (Diabetic) inside the violin is a little more away from
the horizontal axis than the box plot for 0 (Non Diabetic). It can be
implied that diabetics seem to have a higher blood pressure than the
non-diabetics. The bottom tail of the violins indicates the zero
values we need to replace. We will replace the zeros for 1 with
median of 1 and same for 0.
#Replacing the zero-values for Blood Pressure
df1 = dataset.loc[dataset['Outcome'] == 1]
df2 = dataset.loc[dataset['Outcome'] == 0]df1 =
df1.replace({'BloodPressure':0},
np.median(df1['BloodPressure']))
df2 = df2.replace({'BloodPressure':0},
np.median(df2['BloodPressure']))dataframe = [df1, df2]
dataset = pd.concat(dataframe)

There won’t be any zero-values in BloodPressure column post this.


Let’s move to the next variable.

Independent Variable: BMI

The variable seems to be closely following the normal distribution as


the mean and median are approximately equal. However, it faces the
same problem as before, that is, the existence of zero-values. Let’s
draw a violin plot for this variable.

We can observe that the 1 follows normal distribution,


while 0 doesn’t. Also, BMI for diabetics is more than BMI for non-
diabetics (can be observed using box plot). We do the same
treatment for zero-values in BMI that we followed for zero-values for
Blood Pressure. In this case, we can also replace by mean, however, I
will stick with median. We run the same code after changing the
column name. Moving on to the next variable.
Independent Variable: DiabetesPedigreeFunction

Diabetes Pedigree Function is a positively skewed variable with no


zero values. We use the same violin plot to observe the
characteristics.
Same hypothesis can be formed. Diabetics seem to have a higher
pedigree function that the non-diabetics. Moving on to the 5th
independent variable.

Independent Variable: Glucose

Analyzing Glucose, we observe the variable not following the normal


distribution. We encounter zero-values in this instance as well.
There are 5 such values for which treatment is required. We perform
the same treatment as before, replacing by median (class-wise).
Observing the violin plot, we see a massive vertical distance between
the box-plot for Diabetics and Non-Diabetics. This indicates that
Glucose can be a very important variable in model-building. After
treating the zero values, we move to the next variable.

Independent Variable: Insulin


As mentioned in the data dictionary available with the
dataset, Insulin is the 2-Hour serum insulin (mu U/ml). The
variable is positively skewed. However, the occurrence of zero-
values is high in this case, making up 48.7% of the data. This data
has to be imputed. We first treat these.

After treating the zero-values, we derive at this graph. We can still


see 0 as the median for Insulin for Diabetics. However, for Non-
Diabetics, Insulin is a little higher. It can be roughly hypothesized
that Insulin for Diabetics is lower than Non-Diabetics. Moving
forward to Pregnancies.
Independent Variable: Pregnancies

The variable is positively skewed with 14.5% zero values. We need


not treat zero values as its not an abnormal occurrence. However,
given the fact that 85.5% of the values is non-zero, we can infer that
this study is done for females only. Looking at the behavior using
violin plots we observe diabetic women had more pregnancies than
non-diabetic.
Independent Variable: SkinThickness

Moving on to our last independent variable, we observe the same


patterns as most of the previous variable. The data is positively
skewed with 29.6% of zero values. After treating, we use the violin
plot to observe a crude relationship.

Skin Thickness for Diabetics is more than that of Non-Diabetics.

Our Dependent Variable ‘Outcome’ takes Boolean values 0 and 1. 0


indicates non-diabetic and 1 indicates diabetic. To examine the
occurrence, we use a simple bar plot

Frequency of 0 and 1 in Outcome

The imbalance in the data can be clearly seen with 0 (Non-


Diabetic) being the modal class. We will treat this imbalance later in
the process.
Checking for Multi collinearity

The correlation matrix below uses Pearson’s correlation coefficient


to illustrate the relationship between variables. From the figure, a
significant correlation can be observed
between Pregnancies and Age. To further confirm, we calculate the
correlation coefficient.

from scipy.stats import pearsonrcorr, _ =


pearsonr(dataset[‘Age’], dataset[‘Pregnancies’])
print(‘Pearsons correlation: %.3f’ % corr)
Pearsons correlation: 0.544

The correlation coefficient (r) is 0.544. By a rule of thumb, in case


of an r above 0.70, multi-collinearity is expected. Hence, no
significant case of multi-collinearity is observed.
Treating Outliers and Non-Normality

Outliers are extreme values existing in the dataset. It is necessary to


treat outliers if a distance-based algorithm (logistic regression, SVM,
etc) is applied on the dataset. Outliers do not affect a tree-based
algorithm. Since we will use both distance and tree-based
algorithms, we will scale our data to treat outliers. We use Standard
Scaler for the process. Standard Scaler transforms the feature by
subtracting the mean and dividing with the standard deviation. This
way the feature also gets close to standard normal distribution with
mean 0.
#Splitting the data into dependent and independent variables
Y = dataset.Outcome
x = dataset.drop(‘Outcome’, axis = 1)
columns = x.columnsfrom sklearn.preprocessing import
StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(x)data_x = pd.DataFrame(X, columns =
columns)

We have scaled our X values.

Splitting the dataset into Training and Test data

We now split our processed dataset into Training and Test data. The
Test data size is take to be 15% of the entire data (which means 115
observations) and the model will be trained on 653 observations.
#Splitting the data into training and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_x, Y,
test_size = 0.15, random_state = 45)
Plot showing frequency of 0 and 1 in Y train

A huge imbalance can be observed in the y_train values. To


overcome this problem, we use the SMOTE technique.

Synthetic Minority Oversampling Technique (SMOTE) is used to


remove the imbalance in the training data by creating samples using
the current data. It does not create duplicates. Remember it is
always done on the Training Date and not on the original data as the
Test Data should only contain real-life values and not the synthetic
sample.
from imblearn.over_sampling import SMOTEsmt = SMOTE()x_train,
y_train = smt.fit_sample(x_train, y_train)np.bincount(y_train)
Out[74]: array([430, 430])

We now have a balanced Training data.


Our data is now prepared to fit a model

Model Fitting: Logistic Regression

The first model we fit on the training data is the Logistic Regression.
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)y_pred = logreg.predict(x_test)
print('Accuracy of logistic regression classifier on test set:
{:.2f}'.format(logreg.score(x_test, y_test)))Out[76]: Accuracy
of logistic regression classifier on test set: 0.73

We get a 73% accuracy score on the test data.


print(f1_score(y_test, y_pred, average=”macro”))
print(precision_score(y_test, y_pred, average=”macro”))
print(recall_score(y_test, y_pred, average=”macro”))
0.723703419131771
0.7220530003045994
0.7263975155279503

Our Precision for the model stands at 0.722. This indicates that
72% of the time our model classified the patients in a high risk
category when they actually had a high risk of getting diabetes.

The Recall/Sensitivity is 0.726, implying of 72% of the time people


having actually having high risk were classified correctly by our
model.

Model Fitting: Support Vector Machine (Kernel:


rbf)

The first model we fit on the training data is the Support Vector
Machine (SVM). SVM uses many kernels to classify the data. We
use rbf/Gaussian kernel to fit the first model.
from sklearn.svm import SVCclassifier_rbf = SVC(kernel = ‘rbf’)
classifier_rbf.fit(x_train, y_train)y_pred =
classifier_rbf.predict(x_test)print('Accuracy of SVC (RBF)
classifier on test set:
{:.2f}'.format(classifier_rbf.score(x_test, y_test)))
Out[76]: Accuracy of SVC (RBF) classifier on test set:
0.75print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
0.7431080565101182
0.7410256410256411
0.7481366459627329

We have an improved accuracy using SVM with rbf kernel. The


model accuracy comes to 75%, with improved Precision and Recall
values compared to Logistic Regression.

Model Fitting: Random Forest

We use Random Forest Classifier, with 300 trees (derived at after


tuning the model) to fit a model on the data.
from sklearn.ensemble import RandomForestClassifiermodel =
RandomForestClassifier(n_estimators=300, bootstrap = True,
max_features = ‘sqrt’)model.fit(x_train, y_train)y_pred =
model.predict(x_test)
print('Accuracy of Random Forest on test set:
{:.2f}'.format(model.score(x_test, y_test)))Out[95]: Accuracy of
Random Forest on test set: 0.88print(f1_score(y_test, y_pred,
average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
0.8729264475743349
0.8762626262626263
0.8701863354037267

We get the highest accuracy for Random Forest, with the score
reaching 88%. This implies, our model predicted classified correctly
88% of the times.
The Precision score stood at 0.876, implying our model correctly
classified observations with high risk in the high risk
category 87.6% of the times. The Recall stood at 0.870.

We also have an F1 score of 0.872. The F1 score is the harmonic


mean of precision and recall. It assigns equal weight to both the
metrics. However, for our analysis it is relatively more important for
the model to have low false negative cases (as it will be dangerous to
classify high risk patients in low risk category). Therefore, we
individually look at Precision and Recall.
The figure above shows the relative importance of features and their
contribution to the model. Since it was a small dataset with less
columns, I didn’t use Feature Selection technique such as PCA.

Conclusion

We thus select the Random Forest Classifier as the right model due
to high accuracy, precision and recall score. One reason why
Random Forest Classifier showed an improved performance was
because of the presence of outliers. As mentioned before, since
Random Forest is not a a distance based algorithm, it is not much
affected by outliers, whereas distance based algorithm such as
Logistic Regression and Support Vector showed a lower
performance.

Based on the feature importance:

1. Glucose is the most important factor in determining the


onset of diabetes followed by BMI and Age.
2. Other factors such as Diabetes Pedigree Function,
Pregnancies, Blood Pressure, Skin Thickness and Insulin
also contributes to the prediction.

As we can see, the results derived from Feature Importance makes


sense as one of the first things that actually is monitored in high-risk
patients is the Glucose level. An increased BMI might also indicate a
risk of developing Type II Diabetes. Normally, especially in case of
Type II Diabetes, there is a high risk of developing as the age of a
person increases (given other factors).
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves

b. Density and contour plots

c. Correlation and scatter plots

d. Histograms

e. Three dimensional plotting

Exploratory Data Analysis on Heart Disease UCI


data set**

Motivation

 Exploratory Data Analysis (EDA) is a pre-processing step to


understand the data. There are numerous methods and
steps in performing EDA, however, most of them are
specific, focusing on either visualization or distribution,
and are incomplete. Therefore, here, I will walk-through
step-by-step to understand, explore, and extract the
information from the data to answer the questions or
assumptions. There are no structured steps or method to
follow, however, this project will provide an insight on EDA
for you and my future self.

Introduction

Cardiovascular diseases (CVDs) or heart disease are the number one


cause of death globally with 17.9 million death cases each year. CVDs
are concertedly contributed by hypertension, diabetes, overweight
and unhealthy lifestyles. You can read more on the heart disease
statistics and causes for self-understanding. This project covers
manual exploratory data analysis and using pandas profiling in
Jupyter Notebook, on Google Colab. The dataset used in this project
is UCI Heart Disease dataset, and both data and code for this project
are available on my GitHub repository.

Data Set Explanations

Initially, the dataset contains 76 features or attributes from 303


patients; however, published studies chose only 14 features that are
relevant in predicting heart disease. Hence, here we will be using the
dataset consisting of 303 patients with 14 features set.

The outline for EDA are as follows;

1. Import and get to know the data


2. Data Cleaning

a) Check the data type

b) Check for the data characters mistakes

c) Check for missing values and replace them

d) Check for duplicate rows

e) Statistics summary

f) Outliers and how to remove them

3. Distributions and Relationship

a) Categorical variable distribution


b) Continuous variable distribution

c) Relationship between categorical and continuous variables

4. Automated EDA using pandas profiling report

Variables or features explanations:

1. age (Age in years)


2. sex : (1 = male, 0 = female)
3. cp (Chest Pain Type): [ 0: asymptomatic, 1: atypical angina,
2: non-anginal pain, 3: typical angina]
4. trestbps (Resting Blood Pressure in mm/hg )
5. chol (Serum Cholesterol in mg/dl)
6. fps (Fasting Blood Sugar > 120 mg/dl): [0 = no, 1 = yes]
7. restecg (Resting ECG): [0: showing probable or definite left
ventricular hypertrophy by Estes’ criteria, 1: normal, 2:
having ST-T wave abnormality]
8. thalach (maximum heart rate achieved)
9. exang (Exercise Induced Angina): [1 = yes, 0 = no]
10. oldpeak (ST depression induced by exercise relative to
rest)
11.slope (the slope of the peak exercise ST segment): [0:
downsloping; 1: flat; 2: upsloping]
12. ca [number of major vessels (0–3)
13. thal : [1 = normal, 2 = fixed defect, 3 = reversible defect]
14. target: [0 = disease, 1 = no disease]

The dataset description is provided here.

Let’s begin…!!
1. Import and get to know the data

Image snapshot by author

Here we have 303 rows with 14 variables.

2. Data Cleaning

a) Check the data type.


The variables types are

 Binary: sex, fbs, exang, target


 Categorical: cp, restecg, slope, ca, thal
 Continuous: age, trestbps, chol, thalac, oldpeak

Is the type of variable correctly classified by python ? Let’s


get to know the data type.

Image snapshot by author


Note here that the binary and categorical variable are classified as
different integer type by python. We will need to change them to
‘object’ type.

b. Check for the data characters mistakes

1. feature ‘ca’ ranges from 0–3, however, df.nunique() listed 0–


4. So lets find the ‘4’ and change them to NaN.

Image snapshot by author

2. Feature ‘thal’ ranges from 1–3, however, df.nunique() listed 0–3.


There are two values of ‘0’. So lets change them to NaN.
Image snapshot by author

c) Check for missing values and replace them

Image snapshot by author

and visualize the missing values using Missingno library. The


missing values are represented by the horizontal lines. This library
provide an informative way of visualizing the missing values located
in each column, and to see whether there is any correlation between
missing values of different columns. Here’s a shout out to a great
article on Missingno.
msno.matrix(df)

Image snapshot by author

Replace the NaN with median.


df = df.fillna(df.median())
df.isnull().sum()

d) Check for duplicate rows

Image snapshot by author

e) Statistics summary
Image snapshot by author

Basically, with df.describe(), we should check on the min and max


value for the categorical variables (min-max). Sex (0–1), cp (0–3),
fbs (0–1), restecg (0–2), exang (0–1), slope (0–2), ca (0–3), thal (0–
3). We should also observe the mean, std, 25% and 75% on the
continuous variables.

Before we plot the outliers, let's change the labeling


for better visualization and interpretation.

f) Outliers and how to remove them

Image snapshot by author

There are also other several ways of plotting boxplot.


fig = px.box(df, x=”target”, y=”chol”)
fig.show()

Image snapshot by author

or using seaborn
sns.boxplot(x=’target’, y=’oldpeak’, data=df)

Image snapshot by author

Now, let’s define and list out the outliers..!!


and here are the listed outliers.

Image snapshot by author

Let’s drop the outliers.

Image snapshot by author

3. Distributions and relationship.

a) target variable distribution

Image snapshot by author


There are more diseased than healthy patients.

b) Age variable distribution


# print(df.age.value_counts())
df[‘age’].hist().plot(kind=’bar’)
plt.title(‘Age Distribution’)

Image snapshot by author

The age are normally distributed.


# Analyze distribution in age in range 10
print(df.age.value_counts()[:10])
sns.barplot(x=df.age.value_counts()[:10].index,
y=df.age.value_counts()[:10].values,
palette=’Set2')
plt.xlabel(‘Age’)
plt.ylabel(‘Age distribution’)

Image snapshot by author


Most of the patients are in the age between 50s to 60s. Let’s take a
quick look basic stats. The mean age is about 54 years with ±9.08
std, the youngest is at 29 and the oldest is at 77.

Image snapshot by author

c) Gender distribution according to target variable

Image snapshot by author

From the bar graph, we can observe that among disease patients,
male are higher than female.

d) Chest pain distribution according to target variable


Image snapshot by author

Chest pain (cp) or angina is a type of discomfort caused when heart


muscle doesn’t receive enough oxygen rich blood, which triggered
discomfort in arms, shoulders, neck, etc.

However, looking at the bar graph above, its raised a question of


higher number of healthy subject having typical_angina. Or in other
word, most of the healthy subject having chest pain, which is also
discussed here. Chest pain can be subjective due to stress, physical
activities and many more and varies between gender. Women and
elderly patients usually have atypical symptoms with a history of
disease. This article provide analysis comparing typical anginal vs
nontypical angina patients in a clinical trial.
e) Fasting blood sugar distribution according to target
variable

Image snapshot by author

Fasting blood sugar or fbs is a diabetes indicator with fbs >120 mg/d
is considered diabetic (True class). Here, we observe that the
number for class true, is lower compared to class false. However, if
we look closely, there are higher number of heart disease patient
without diabetes. This provide an indication that fbs might not be a
strong feature differentiating between heart disease an non-disease
patient.

f) Slope distribution according to target variable


Image snapshot by author

g) Distribution plot on continuous variables.

Image snapshot by author


 normal distribution for: age, trestbps and almost for chol
 oldpeak is left-skewed
 thalac is right-skewed

h) Sns pairplot to visualize the distribution.

Image snapshot by author

 oldpeak having a linear separation relation between disease


and non-disease.
 thalach having a mild separation relation between disease
and non-disease.
 Other features don’t form any clear separation
i) correlation

Image snapshot by author

 ‘cp’, ‘thalach’, ‘slope’ shows good positive correlation with


target
 ‘oldpeak’, ‘exang’, ‘ca’, ‘thal’, ‘sex’, ‘age’ shows a good
negative correlation with target
 ‘fbs’ ‘chol’, ‘trestbps’, ‘restecg’ has low correlation with our
target

4. Automated EDA using pandas profiling report


on Jupyter notebook, Google Colab

1. ! pip install the pandas profiling


! pip install https://github.com/pandas-profiling/pandas-

profiling/archive/master.zip

and here is a snapshot of the automated EDA


7. Visualizing Geographic Data with Basemap

7. Visualizing Geographic Data with Basemap

Mapping Geographical Data with Basemap Python


Package

Basemap is a matplotlib extension used to visualize and create


geographical maps in python. The main purpose of this tutorial is to
provide basic information on how to plot and visualize geographical
data with the help of Basemap package. If you need further
information on basemap, please visit basemap page.

 Installing Basemap package

 Adding vector layers to a map

 Projection, bounding box, & resolution

 Plotting a specific region

 Background relief maps

 Plotting netCDF data using Basemap

Installing Basemap package

To install Basemap, the conda package manager will be utilized. If


you haven’t already installed Anaconda/conda package manager on
your PC , there is a blog post on how to install Anaconda/conda
package manager. After you’ve installed the conda package manager,
follow the steps below to install basemap package.

1) Start an Ubuntu terminal or an Anaconda prompt

2) Add a new environment variable named basemap_stable


conda create --name basemap_stable

3) Activate the basemap_stable environment:


conda activate basemap_stable

4) Install basemap package and it’s :


conda install -c anaconda basemap
5) View a list of python dependencies by typing conda list
conda list

6) Now open your favorite Python Notebook or IDE in the active conda
environment, In my case, I used jupyter notebook

7) Finally, import the Basemap and Matplotlib libraries:


from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

Adding vector layers to a map

Coastlines

 First, let’s initialize a map with Basemap() function

 Then, use the drawcoastlines() function to add coastlines


on the map
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines()
plt.title("Coastlines", fontsize=20)
plt.show()
The drawcoastlines function has the following main arguments:

 linewidth: 1.0, 2.0, 3.0…

 linestyle: solid, dashed…

 color: black, red…

Let’s apply some changes to the coastlines


fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='dashed', color='red')
plt.title("Coastlines", fontsize=20)
plt.show()

Countries

Use the drawcountries() function to add countries on the map


fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black')
m.drawcountries()
plt.title("Country boundaries", fontsize=20)
plt.show()
The drawcountries() function uses similar arguments
like drawcountries() as shown below:
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
plt.title("Country boundaries", fontsize=20)
plt.show()
Draw major rivers

 Use the drawrivers() function to add major rivers on the


map

 The drawrivers() function can


take linewidth, linestyle, color arguments
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.drawrivers(linewidth=0.5, linestyle='solid', color='#0000ff')
plt.title("Major rivers", fontsize=20)
plt.show()

Fill continents

 This function is used to draw color filled continents

 Use fillcontinents() function to fill continents


fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents()
plt.title("Color filled continents", fontsize=20)
plt.show()

The fillcontinents() function can take the following arguments:

 color: fills continents (default gray)

 lake_color: fills inland lakes

 alpha: sets transparency for continents


fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua', alpha=0.9)
plt.title("Color filled continents", fontsize=20)
plt.show()
e) Draw map boundary

 The drawmapboundary() function is used to draw the earth


boundary on the map

 The drawmapboundary() function can take the following


arguments:

 linewidth: sets line width for boundary line (default: 1)

 color: sets the color of the boundary line (default: black)

 fill_color: fills the map background region


fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmapboundary(color='b', linewidth=2.0, fill_color='aqua')
plt.title("Filled map boundary", fontsize=20)
plt.show()
Draw and label longitude lines

 The drawmeridians() function is used to draw & label


meridians/longitude lines

 The drawmeridians()function can take the following


arguments:

 List of longitude values created with range() for integer


values & np.arange() for floats values

 color: sets the color of the line longitude lines (default:


black)

 textcolor: sets the color of labels (default: black)

 linewidth: sets the line width for the longitude lines

 dashes: sets the dash pattern for the longitude lines


(default: [1,1])
 labels:sets the label’s position with four values [0,0,0,0]
representing left, right, top, & bottom. Change these values
to 1 where you want the labels to appear
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmeridians(range(0, 360, 20), color='k', linewidth=1.0,
dashes=[4, 4], labels=[0, 0, 1, 1])
plt.title("Longitude lines", fontsize=20, pad=30)
plt.show()

Draw and label latitude lines

 The drawparallels() function is used to draw & label


parallels/latitude lines

 The drawparallels()function uses similar arguments


like drawmeridians()
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawparallels(range(-90, 100, 10), color='k', linewidth=1.0,
dashes=[4, 4], labels=[1, 1, 0, 0])
plt.title("Latitude lines", fontsize=20)
plt.show()

Let’s put the drawmeridians and drawparallels functions together


fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmeridians(range(0, 360, 20), color='k', linewidth=1.0,
dashes=[4, 4], labels=[0, 0, 0, 1])
m.drawparallels(range(-90, 100, 10), color='k', linewidth=1.0,
dashes=[4, 4], labels=[1, 0, 0, 0])
plt.ylabel("Latitude", fontsize=15, labelpad=35)
plt.xlabel("Longitude", fontsize=15, labelpad=20)
plt.show()
Projection, bounding box, & resolution

The Basemap() function is used to set projection, bounding box, & resolution of
a map

Map projection:

 Inside the Basemap() function, the projection=" argument can


take several pre-defined projections listed in the table
below or visit this site to get more information.

 To specify the desired projection, use the general syntax


shown below:
m = Basemap(projection='aeqd')
m = Basemap(projection='cyl')
Some projections require setting bounding box, map center, & map size of
the map using the following arguments:

a) Bounding box/map corners:

 llcrnrlat: lower-left corner geographical latitude

 urcrnrlat: upper-right corner geographical latitude

 llcrnrlon: lower-left corner geographical longitude

 urcrnrlon: upper-right corner geographical longitude


Example:
m = Basemap(projection='cyl', ,llcrnrlat=-
80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180)

b) Map center:

 lon_0: central longitude

 lat_0: central latitude

Example:
m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)

c) Map resolution: The map resolution argument determines the


quality of vector layers such as coastlines, lakes, & rivers etc. The
available options are:

 c: crude

 l: low

 i: intermediate

 h: high

 f: full

Let’s see some examples on how the map projection, bounding

box, map center, & map resolution arguments used to create and
modify maps:

Create a global map with a Mercator Projection


fig = plt.figure(figsize = (10,8))
m = Basemap(projection='merc',llcrnrlat=-
80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title("Mercator Projection", fontsize=20)

Create a global map with a Cylindrical Equidistant


Projection
fig = plt.figure(figsize = (10,8))
m = Basemap(projection='cyl',llcrnrlat=-
80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title(" Cylindrical Equidistant Projection", fontsize=20)
Create a global map with Orthographic Projection
fig = plt.figure(figsize = (10,8))
m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title("Orthographic Projection", fontsize=18)
Create a global map with a Robinson Projection
fig = plt.figure(figsize = (10,8))
m = Basemap(projection='robin',llcrnrlat=-
80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180, lon_0 = 0, lat_0 =
0)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title(" Robinson Projection", fontsize=20)
Plotting a specific region

a) By passing bounding box information (llcrnrlon,


llcrnrlat, urcrnrlon,urcrnrlat)
fig = plt.figure(figsize = (10,8))
m =
Basemap(projection='cyl',llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=49
,urcrnrlat=15, resolution = 'h')
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
m.drawcoastlines()
plt.show()
b) By passing the central longitude & central latitude, as well as
the width, & height values.
Here, the width and height are given in meters.
fig = plt.figure(num=None, figsize=(12, 8) )
m = Basemap(projection='aea', width=1700000, height=1500000,
resolution='h',lon_0= 40.5,lat_0=8.7)
m.fillcontinents(color='tan',lake_color='aqua')
m.drawmapboundary(fill_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawcoastlines()
plt.show()
c) Make a regional map using the EPSG projection code

To find out more about the EPSG projection code,


visit spatialreference.org
fig = plt.figure(figsize = (10,8))
m = Basemap(llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=48.5,
urcrnrlat=15, lat_0=7, lon_0 =37, resolution = 'l')
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title("Make a regional map using the EPSG projection code",
fontsize=18)
Let’s see how the resolution argument affects coastlines, lakes, country
boundary lines

The available options are:

 c: crude

 l: low

 i: intermediate

 h: high

 f: full
fig = plt.figure(figsize = (16,16))ax1 = plt.subplot2grid((3,2),
(0,0))
ax2 = plt.subplot2grid((3,2), (0,1))
ax3 = plt.subplot2grid((3,2), (1,0))
ax4 = plt.subplot2grid((3,2), (1,1))m1 =
Basemap(llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=48.5, urcrnrlat=15,
ax=ax1, resolution = 'c')
m1.drawcoastlines()
m1.fillcontinents(color='tan',lake_color='lightblue')
m1.drawcountries(linewidth=1, linestyle='solid', color='k' )
m1.drawmapboundary(fill_color='lightblue')
plt.title("Crude resolution", fontsize=18)
ax1.set_title("Crude resolution", fontsize=18)m2 =
Basemap(llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=48.5, urcrnrlat=15,
ax=ax2, resolution = 'l')
m2.drawcoastlines()
m2.fillcontinents(color='tan',lake_color='lightblue')
m2.drawcountries(linewidth=1, linestyle='solid', color='k' )
m2.drawmapboundary(fill_color='lightblue')
ax2.set_title("Low resolution", fontsize=18)m3 =
Basemap(llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=48.5, urcrnrlat=15,
ax=ax3, resolution = 'i')
m3.drawcoastlines()
m3.fillcontinents(color='tan',lake_color='lightblue')
m3.drawcountries(linewidth=1, linestyle='solid', color='k' )
m3.drawmapboundary(fill_color='lightblue')
ax3.set_title("Intermediate resolution", fontsize=18)m4 =
Basemap(llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=48.5, urcrnrlat=15,
ax=ax4, resolution = 'h')
m4.drawcoastlines()
m4.fillcontinents(color='tan',lake_color='lightblue')
m4.drawcountries(linewidth=1, linestyle='solid', color='k' )
m4.drawmapboundary(fill_color='lightblue')
ax4.set_title("High resolution", fontsize=18)
Background relief maps

Basemap provides several background relief images including:

a) Land-sea mask image

b) NASA blue marble

c) NOAA etopo

d) Shaded relief

a) Draw land-sea mask image

 The drawlsmask() function used to draw lakes, land,


& oceans at once

 It avoids fillcontinents & drawmapboundary functions


 The drawlsmask() function can take the following
arguments:

 land_color: sets the color of the land (by default it’s gray)

 ocean_color: sets the color of the oceans (by default it’s


white)

 resolution:sets the coastline resolution (by default it’s l,


you can change it to: c, l, i, h, or f )

 lakes: plots lakes & ponds (by it default it’s True)

 grid:sets the land/sea mask grid spacing in minutes (by


default it’s 5; You can change it to: 10, 2.5, & 1.25)
fig = plt.figure(figsize = (10,8))
m =
Basemap(projection='cyl',llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=49
,urcrnrlat=15, resolution = 'i')
m.drawlsmask(land_color = "#ddaa66", ocean_color="#7777ff",
resolution = 'i', lakes=True, grid=1.25)
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
plt.title("Land-sea mask image", fontsize=20)
plt.show()
b) NASA Blue Marble image

 Use the bluemarble() argument to apply the NASA blue


marble image as a background map

 To downgrade or upgrade image resolution use the scale


argument
fig = plt.figure(figsize = (10,8))
m =
Basemap(projection='cyl',llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=49
,urcrnrlat=15, resolution='i')
m.bluemarble(scale=1.0)
m.drawcoastlines()
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
plt.title("NASA Blue Marble image as background map",
fontsize=18)
c) Shaded relief image

Use shadedrelief() argument to apply shaded relief image as a


background map
fig = plt.figure(figsize = (10,8))
m =
Basemap(projection='cyl',llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=49
,urcrnrlat=15, resolution='i')
m.shadedrelief()
m.drawcoastlines()
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
plt.title("Shaded relief image as background map", fontsize=18)
d) NOAAetopo relief image

Use etopo() argument to apply shaded relief image as a background


map
fig = plt.figure(figsize = (10,8))
m =
Basemap(projection='cyl',llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=49
,urcrnrlat=15, resolution='i')
m.etopo(scale=1.2)
m.drawcoastlines()
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
plt.title("Shaded relief image as background map", fontsize=18)
Plotting netCDF data using Basemap

Reading ECMWF temperature data

To read the netCDF data, we’ll use the Dataset class from
the netcdf4-python library
from netCDF4 import Dataset as dataset

The netCDF data used in this tutorial have been saved at


this URL. The netCDF file has 2-meter air temperature reanalysis
data for the entire globe.
nc = dataset('.../ECMWF_temp2m.nc')

Let’s read the coordinate variables (latitude, longitude, & time) &
data variable (2-meter temperature) of the netCDF file.
lat = nc.variables['latitude'][:]
lon = nc.variables['longitude'][:]
time = nc.variables['time'][:]
t2 = nc.variables['p2t'][:]

Plotting filled contour maps:

 Use the contourf() function to draw filled contour maps

 Create a 2D meshgridmatrix from 1D longitude & latitude


arrays

 The contourf() function mainly takes:

 longitude & latitude arrays

 data

 colormap

 levels
fig = plt.figure(num=None, figsize=(7, 7) ) m =
Basemap(projection='cyl', llcrnrlon=32.5, llcrnrlat=3,
urcrnrlon=49, urcrnrlat=15, resolution='i')
x, y = m(*np.meshgrid(lon,lat))cs = m.contourf(x,
y ,np.squeeze(t2[4,:,:]), levels = 100,
cmap=plt.cm.jet)m.drawcoastlines()
m.drawmapboundary()
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmeridians(range(33, 48, 2), color='k', linewidth=1.0,
dashes=[4, 4], labels=[0, 0, 0, 1])
m.drawparallels(range(3, 15, 2), color='k', linewidth=1.0,
dashes=[4, 4], labels=[1, 0, 0, 0])plt.ylabel("Latitude",
fontsize=15, labelpad=35)
plt.xlabel("Longitude", fontsize=15, labelpad=20)cbar =
m.colorbar(cs, location='right', pad="3%")
cbar.set_label('Temperature (K)', fontsize=13)plt.title('2-meter
temperature filled contour map', fontsize=15)
plt.show()
Plotting contour maps:

 Use the contour() function to draw contour maps

 The contour() function uses similar arguments


like contourf()
fig = plt.figure(num=None, figsize=(7, 7)) m =
Basemap(projection='cyl', llcrnrlon=32.5, llcrnrlat=3,
urcrnrlon=49, urcrnrlat=15, resolution='i')
x, y = m(*np.meshgrid(lon,lat))cs = m.contour(x,
y ,np.squeeze(t2[4,:,:]), levels =
25 ,cmap=plt.cm.jet)m.drawcoastlines()
m.drawmapboundary()
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmeridians(range(33, 48, 2), color='k', linewidth=1.0,
dashes=[4, 4], labels=[0, 0, 0, 1])
m.drawparallels(range(3, 15, 2), color='k', linewidth=1.0,
dashes=[4, 4], labels=[1, 0, 0, 0])plt.ylabel("Latitude",
fontsize=15, labelpad=35)
plt.xlabel("Longitude", fontsize=15, labelpad=20)cbar =
m.colorbar(cs, location='right', pad="3%")
cbar.set_label('Temperature (K)', fontsize=13)plt.clabel(cs,
inline=True, fmt='%1.0f', fontsize=12, colors='k')
plt.title('2-meter temperature contour map', fontsize=15)
plt.show()
Plotting pseudo color maps:

 Use the pcolor() or pcolormesh() function to draw pseudo


color maps
fig = plt.figure(figsize=(7, 7) ) m = Basemap(projection='cyl',
llcrnrlon=32.5, llcrnrlat=3, urcrnrlon=49, urcrnrlat=15,
resolution='i')cs = m.pcolor(x, y ,np.squeeze(t2[4,:,:]))
cs = m.pcolormesh(x, y ,np.squeeze(t2[4,:,:]))pcolor(): draw a
pseudocolor plot.
# pcolormesh(): draw a pseudocolor plot (faster version for
regular meshes).m.drawmeridians(range(33, 48, 2), color='k',
linewidth=1.0, dashes=[4, 4], labels=[0, 0, 0, 1])
m.drawparallels(range(3, 15, 2), color='k', linewidth=1.0,
dashes=[4, 4], labels=[1, 0, 0, 0])m.drawcoastlines()
m.drawmapboundary()
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
plt.ylabel("Latitude", fontsize=15, labelpad=35)
plt.xlabel("Longitude", fontsize=15, labelpad=20)cbar =
m.colorbar(cs, location='right', pad="3%")
cbar.set_label('Temperature (K)', fontsize=13)plt.title('2-meter
temperature pseudo-color map', fontsize=15)
plt.show()
Ultimate guide for Data Exploration in Python
using NumPy, Matplotlib and Pandas
Introduction

Exploring data sets and developing deep understanding about the data is one of the
most important skills every data scientist should possess. People estimate that the
time spent on these activities can go as high as 80% of the project time in some
cases.

Python has been gaining a lot of ground as preferred tool for data scientists lately,
and for the right reasons. Ease of learning, powerful libraries with integration of
C/C++, production readiness and integration with web stack are some of the main
reasons for this move lately.

In this guide, I will use NumPy, Matplotlib, Seaborn, and Pandas to perform data
exploration. These are powerful libraries to perform data exploration in Python.
The idea is to create a ready reference for some of the regular operations required
frequently. I am using an iPython Notebook to perform data exploration and would
recommend the same for its natural fit for exploratory analysis.

In case you missed it, I would suggest you to refer to the baby steps series of
Python to understand the basics of python programming.

 Learning Python for data analysis – with instructions on installation and


creating the environment
 Libraries and data structures
 Exploratory analysis in Python (using Pandas)
 Data Munging in Python (using Pandas)

Contents – Data Exploration

Here are the operations I’ll cover in this article (Refer to this article for similar
operations in SAS):
1.
1. How to load data file(s) using Pandas?
2. How to convert a variable to different data type?
3. How to transpose a table/Pandas dataframe?
4. How to sort Data?
5. How to create plots (Histogram, Scatter, Box Plot)?
6. How to generate frequency tables?
7. How to do sampling of Data set?
Part 1: How to load data file(s) using Pandas?

Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ). In Python, it
is easy to load data from any source, due to its simple syntax and availability of
predefined libraries, such as Pandas. Here I will make use of Pandas itself.

Pandas features a number of functions for reading tabular data as a Pandas


DataFrame object. Below are the common functions that can be used to read data
(including read_csv in Pandas):

Loading data from a CSV file(s):

Code

Loading data from excel file(s):

Code
df=pd.read_excel("E:/EMP.xlsx", "Data") # Load Data sheet of excel
file EMP
Output
print df

Loading data from a txt file(s):

Code:

df=pd.read_csv("E:/Test.txt",sep='\t') # Load Data from text file


having tab '\t' delimeter print df

Output

Part 2: How to convert a variable to a different data type?

Converting a variable data type to others is an important and common procedure


we perform after loading data. Let’s look at some of the commands to perform
these conversions:
 Convert numeric variables to string variables and vice
versa

srting_outcome = str(numeric_input) #Converts numeric_input to


string_outcome
integer_outcome = int(string_input) #Converts string_input to
integer_outcome
float_outcome = float(string_input) #Converts string_input to
integer_outcome

The later operations are especially useful when you input value from user using
raw_input(). By default, the values are read at string.

 Convert character date to Date:

There are multiple ways to do this. The simplest would be to use


the datetime library and strptime function. Here is the code:

from datetime import datetime


char_date = 'Apr 1 2015 1:20 PM' #creating example character date
date_obj = datetime.strptime(char_date, '%b %d %Y %I:%M%p')
print date_obj

Part 3: How to transpose a Data set or dataframe using Pandas?

Here, I want to transpose Table A into Table B on the variable Product. This task
can be accomplished by using Pandas dataframe.pivot:
Code

#Transposing Pandas dataframe by a variable

df=pd.read_excel("E:/transpose.xlsx", "Sheet1") # Load Data sheet of


excel file EMP
print df
result= df.pivot(index= 'ID', columns='Product', values='Sales')
result

Output

Part 4: How to sort a Pandas DataFrame?


Sorting of data can be done using dataframe.sort(). It can be based on multiple
variables and ascending or descending both orders.

Code

#Sorting Pandas Dataframe


df=pd.read_excel("E:/transpose.xlsx", "Sheet1") #Add by variable
name(s) to sort
print df.sort(['Product','Sales'], ascending=[True, False])

Above,
we have a table with variables ID, Product and Sales. Now, we want to sort it by
Product and Sales (in descending order) as shown in table 2.

Part 5: How to create plots (Histogram, Scatter, Box Plot)?


Data visualization always helps to understand the data easily. Python has libraries
like matplotlib and seaborn to create multiple graphs effectively. Let’s look at the
some of the visualizations to understand below behavior of variable(s) .

 The distribution of age


 Relation between age and sales; and
 If sales are normally distributed or not?

Histogram:

Code

#Plot Histogram
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_excel("E:/First.xlsx", "Sheet1")
#Plots in matplotlib reside within a figure object, use plt.figure
to create new figure
fig=plt.figure()
#Create one or more subplots using add_subplot, because you can't
create blank figure
ax = fig.add_subplot(1,1,1)
#Variable
ax.hist(df['Age'],bins = 5)
#Labels and Tit
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('#Employee')
plt.show()

Output
Scatter plot:

Code

#Plots in matplotlib reside within a figure object, use plt.figure


to create new figure fig=plt.figure()
#Create one or more subplots using add_subplot, because you can't
create blank figure
ax = fig.add_subplot(1,1,1)
#Variable
ax.scatter(df['Age'],df['Sales'])
#Labels and Tit
plt.title('Sales and Age distribution')
plt.xlabel('Age')
plt.ylabel('Sales')
plt.show()

Output
Box-plot:

Code

import seaborn as sns


sns.boxplot(df['Age'])
sns.despine()

Output
Part 6: How to generate frequency tables with Pandas?

Frequency Tables can be used to understand the distribution of a categorical


variable or n categorical variables using frequency tables.

Code

import pandas as pd
df=pd.read_excel("E:/First.xlsx", "Sheet1")
print df
test= df.groupby(['Gender','BMI'])
test.size()

Output

Part 7: How to do sample Data set in Python?


To select sample of a data set, we will use library numpy and random. Sampling of
data set always helps to understand data quickly.

Let’s say, from EMP table, I want to select random sample of 5 employees.

Code

#Create Sample dataframe


import numpy as np
import pandas as pd
from random import sample
# create random index
rindex = np.array(sample(xrange(len(df)), 5))
# get 5 random rows from the dataframe df
dfr = df.ix[rindex]
print dfr

Output

Part 8: How to remove duplicate values of a variable in a Pandas Dataframe?

Often, we encounter duplicate observations. To tackle this in Python, we can use


dataframe.drop_duplicates().
Code

#Remove Duplicate Values based on values of variables "Gender" and


"BMI"
rem_dup=df.drop_duplicates(['Gender', 'BMI'])
print rem_dup

Output

Part 9: How to group variables in Pandas to calculate count, average, sum?

To understand the count, average and sum of variable, I would suggest you use
dataframe.describe() with Pandas groupby().

Let’s look at the code:

Code

test= df.groupby(['Gender'])
test.describe()

Output
Part 10: How to recognize and Treat missing values and outliers in Pandas?

To identify missing values , we can use dataframe.isnull(). You can also refer
article “Data Munging in Python (using Pandas)“, here we have done a case study
to recognize and treat missing and outlier values.

Code

# Identify missing values of dataframe


df.isnull()

Output
To treat missing values, there are various imputation methods available. You can
refer these articles for methods to detect Outlier and Missing values. Imputation
methods for both missing and outlier values are almost similar. Here we will
discuss general case imputation methods to replace missing values. Let’s do it
using an example:

Code:

#Example to impute missing values in Age by the mean


import numpy as np
meanAge = np.mean(df.Age) #Using numpy mean function to
calculate the mean value
df.Age = df.Age.fillna(meanAge) #replacing missing values in the
DataFrame

Part 11: How to merge / join data sets and Pandas dataframes?

Joining / merging is one of the common operation required to integrate datasets


from different sources. They can be handled effectively in Pandas using merge
function:
Code:

df_new = pd.merge(df1, df2, how = 'inner', left_index = True,


right_index = True) # merges df1 and df2 on index
# By changing how = 'outer', you can do outer join.
# Similarly how = 'left' will do a left join
# You can also specify the columns to join instead of indexes, which
are used by default.

You might also like