Cs3361 Data Science Laboratory
Cs3361 Data Science Laboratory
LIST OF EXPERIMENTS:
1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.
4. Reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
c. Multiple Regression analysis d. Also compare the results of the above analysis for the two
data sets.
a. Normal curves
d. Histograms
1. Pandas
Pandas is a perfect tool for data wrangling or munging. It is designed for quick and
easy data manipulation, reading, aggregation, and visualization.
Pandas take data in a CSV or TSV file or a SQL database and create a Python object
with rows and columns called a data frame. The data frame is very similar to a table
in statistical software, say Excel or SPSS.
This makes Pandas a foundation library in learning Python for Data Science.
2. NumPy
NumPy is used to process arrays that store values of the same datatype. NumPy
facilitates math operations on arrays and their vectorization. This significantly
enhances performance and speeds up the execution time correspondingly.
Basic array operations: add, multiply, slice, flatten, reshape, index arrays
The SciPy library is one of the core packages that make up the SciPy stack. Now,
there is a difference between SciPy Stack and SciPy, the library. SciPy builds on the
NumPy array object and is part of the stack which includes tools like Matplotlib,
Pandas, and SymPy with additional tools,
SciPy library contains modules for efficient mathematical routines as linear algebra,
interpolation, optimization, integration, and statistics. The main functionality of the
SciPy library is built upon NumPy and its arrays. SciPy makes significant use of
NumPy.
SciPy uses arrays as its basic data structure. It has various modules to perform
common scientific programming tasks as linear algebra, integration, calculus,
ordinary differential equations, and signal processing.
4. Matplotlib
This is undoubtedly my favourite and a quintessential Python library. You can create
stories with the data visualized with Matplotlib. Another library from the SciPy Stack,
Matplotlib plots 2D figures.
Matplotlib is the plotting library for Python that provides an object-oriented API for
embedding plots into applications. It is a close resemblance to MATLAB embedded in
Python programming language.
Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide
range of visualizations. With a bit of effort and tint of visualization capabilities, with
Matplotlib, you can create just any visualizations:
1. Line plots
2. Scatter plots
3. Area plots
4. Bar charts and Histograms
5. Pie charts
6. Stem plots
7. Contour plots
8. Quiver plots
9. Spectrograms
Matplotlib also facilitates labels, grids, legends, and some more formatting entities
with Matplotlib. Basically, everything that can be drawn!
5. Seaborn
So when you read the official documentation on Seaborn, it is defined as the data
visualization library based on Matplotlib that provides a high-level interface for
drawing attractive and informative statistical graphics. Putting it simply, seaborn is an
extension of Matplotlib with advanced features.
So, what is the difference between Matplotlib and Seaborn? Matplotlib is used for
basic plotting; bars, pies, lines, scatter plots and stuff whereas, seaborn provides a
variety of visualization patterns with less complex and fewer syntax.
Seaborn is a great second-hand for R visualization libraries like corrplot and ggplot.
Ex.No 2. Working with Numpy arrays.
Attributes of arrays: Determining the size, shape, memory consumption, and data
types of arrays
Indexing of arrays: Getting and setting the value of individual array elements
Slicing of arrays: Getting and setting smaller subarrays within a larger array
Reshaping of arrays: Changing the shape of a given array
Joining and splitting of arrays: Combining multiple arrays into one, and splitting
one array into man
First let's discuss some useful array attributes. We'll start by defining three random arrays,
a one-dimensional, two-dimensional, and three-dimensional array. We'll use NumPy's
random number generator, which we will seed with a set value in order to ensure that the
same random arrays are generated each time this code is run:
In [1]:
import numpy as np
np.random.seed(0) # seed for reproducibility
Each array has attributes ndim (the number of dimensions), shape (the size of each
dimension), and size (the total size of the array):
In [2]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
Another useful attribute is the dtype, the data type of the array (which we discussed
previously in Understanding Data Types in Python):
In [3]:
print("dtype:", x3.dtype)
dtype: int64
Other attributes include itemsize, which lists the size (in bytes) of each array element,
and nbytes, which lists the total size (in bytes) of the array:
In [4]:
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")
itemsize: 8 bytes
nbytes: 480 bytes
In general, we expect that nbytes is equal to itemsize times size.
In [8]:
x1[-1]
Out[8]:
9
In [9]:
x1[-2]
Out[9]:
7
In a multi-dimensional array, items can be accessed using a comma-separated tuple of
indices:
In [10]:
x2
Out[10]:
array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
In [11]:
x2[0, 0]
Out[11]:
3
In [12]:
x2[2, 0]
Out[12]:
1
In [13]:
x2[2, -1]
Out[13]:
7
Values can also be modified using any of the above index notation:
In [14]:
x2[0, 0] = 12
x2
Out[14]:
array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
Keep in mind that, unlike Python lists, NumPy arrays have a fixed type. This means, for
example, that if you attempt to insert a floating-point value to an integer array, the value
will be silently truncated. Don't be caught unaware by this behavior!
In [15]:
x1[0] = 3.14159 # this will be truncated!
x1
Out[15]:
array([3, 0, 3, 3, 7, 9])
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of
dimension, step=1. We'll take a look at accessing sub-arrays in one dimension and in
multiple dimensions.
One-dimensional subarrays
In [16]:
x = np.arange(10)
x
Out[16]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [17]:
x[:5] # first five elements
Out[17]:
array([0, 1, 2, 3, 4])
In [18]:
x[5:] # elements after index 5
Out[18]:
array([5, 6, 7, 8, 9])
In [19]:
x[4:7] # middle sub-array
Out[19]:
array([4, 5, 6])
In [20]:
x[::2] # every other element
Out[20]:
array([0, 2, 4, 6, 8])
In [21]:
x[1::2] # every other element, starting at index 1
Out[21]:
array([1, 3, 5, 7, 9])
A potentially confusing case is when the step value is negative. In this case, the defaults
for start and stop are swapped. This becomes a convenient way to reverse an array:
In [22]:
x[::-1] # all elements, reversed
Out[22]:
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
In [23]:
x[5::-2] # reversed every other from index 5
Out[23]:
array([5, 3, 1])
Multi-dimensional subarrays
Multi-dimensional slices work in the same way, with multiple slices separated by commas.
For example:
In [24]:
x2
Out[24]:
array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In [25]:
x2[:2, :3] # two rows, three columns
Out[25]:
array([[12, 5, 2],
[ 7, 6, 8]])
In [26]:
x2[:3, ::2] # all rows, every other column
Out[26]:
array([[12, 2],
[ 7, 8],
[ 1, 7]])
Finally, subarray dimensions can even be reversed together:
In [27]:
x2[::-1, ::-1]
Out[27]:
array([[ 7, 7, 6, 1],
[ 8, 8, 6, 7],
[ 4, 2, 5, 12]])
One commonly needed routine is accessing of single rows or columns of an array. This can
be done by combining indexing and slicing, using an empty slice marked by a single colon
(:):
In [28]:
print(x2[:, 0]) # first column of x2
[12 7 1]
In [29]:
print(x2[0, :]) # first row of x2
[12 5 2 4]
In the case of row access, the empty slice can be omitted for a more compact syntax:
In [30]:
print(x2[0]) # equivalent to x2[0, :]
[12 5 2 4]
In [33]:
x2_sub[0, 0] = 99
print(x2_sub)
[[99 5]
[ 7 6]]
In [34]:
print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
This default behavior is actually quite useful: it means that when we work with large
datasets, we can access and process pieces of these datasets without the need to copy the
underlying data buffer.
In [35]:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)
[[99 5]
[ 7 6]]
If we now modify this subarray, the original array is not touched:
In [36]:
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)
[[42 5]
[ 7 6]]
In [37]:
print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
Reshaping of Arrays
Another useful type of operation is reshaping of arrays. The most flexible way of doing this
is with the reshape method. For example, if you want to put the numbers 1 through 9 in
a 3×33×3 grid, you can do the following:
In [38]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]
Note that for this to work, the size of the initial array must match the size of the reshaped
array. Where possible, the reshape method will use a no-copy view of the initial array, but
with non-contiguous memory buffers this is not always the case.
In [39]:
x = np.array([1, 2, 3])
Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the
routines np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of
arrays as its first argument, as we can see here:
In [43]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
Out[43]:
array([1, 2, 3, 3, 2, 1])
You can also concatenate more than two arrays at once:
In [44]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))
[ 1 2 3 3 2 1 99 99 99]
It can also be used for two-dimensional arrays:
In [45]:
grid = np.array([[1, 2, 3],
[4, 5, 6]])
In [46]:
# concatenate along the first axis
np.concatenate([grid, grid])
Out[46]:
array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
In [47]:
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)
Out[47]:
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
For working with arrays of mixed dimensions, it can be clearer to use
the np.vstack (vertical stack) and np.hstack (horizontal stack) functions:
In [48]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the
functions np.split, np.hsplit, and np.vsplit. For each of these, we can pass a list of
indices giving the split points:
In [50]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
Notice that N split-points, leads to N + 1 subarrays. The related
functions np.hsplit and np.vsplit are similar:
In [51]:
grid = np.arange(16).reshape((4, 4))
grid
Out[51]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [52]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
In [53]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
Similarly, np.dsplit will split arrays along the third axis.
import numpy as np
print(arr)
print(type(arr))
3. Working with Pandas data frames
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
0 1
1 7
2 2
dtype: int64
>>>
In this table, the first row contains the column labels (name, city, age, and py-
score). The first column holds the row labels (101, 102, and so on). All other cells
are filled with the data values.
There are several ways to create a Pandas DataFrame. In most cases, you’ll use
the DataFrame constructor and provide the data, labels, and other information.
You can pass the data as a two-dimensional list, tuple, or NumPy array. You can
also pass it as a dictionary or Pandas Series instance, or as one of several other
data types not covered in this tutorial.
For this example, assume you’re using a dictionary to pass the data:
>>>
>>> data = {
... 'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
... 'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
... 'Manchester', 'Cairo', 'Osaka'],
... 'age': [41, 28, 33, 34, 38, 31, 37],
... 'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
... }
>>>
>>> pd.DataFrame(l)
x y z
0 1 2 100
1 2 4 100
2 3 8 100
Again, the dictionary keys are the column labels, and the dictionary values are the data values
in the DataFrame.
You can also use a nested list, or a list of lists, as the data values. If you do, then it’s wise to
explicitly specify the labels of columns, rows, or both when you create the DataFrame:
>>>
>>>
CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
Example
Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
12 60 106 128 345.3
13 60 104 132 379.3
14 60 98 123 275.0
15 60 98 120 215.2
16 60 100 120 300.0
17 45 90 112 NaN
18 60 103 123 323.0
19 45 97 125 243.0
20 60 108 131 364.2
21 45 100 119 282.0
22 60 130 101 300.0
23 45 105 132 246.0
24 60 102 126 334.5
25 60 100 120 250.0
26 60 92 118 241.0
27 60 103 132 NaN
28 60 100 132 280.0
29 60 102 129 380.3
30 60 92 115 243.0
31 45 90 112 180.1
32 60 101 124 299.0
33 60 93 113 223.0
34 60 107 136 361.0
35 60 114 140 415.0
36 60 102 127 300.5
37 60 100 120 300.1
38 60 100 120 300.0
39 45 104 129 266.0
40 45 90 112 180.1
41 60 98 126 286.0
42 60 100 122 329.4
43 60 111 138 400.0
44 60 111 131 397.0
45 60 99 119 273.0
46 60 109 153 387.6
47 45 111 136 300.0
48 45 108 129 298.0
49 60 111 139 397.6
50 60 107 136 380.2
51 80 123 146 643.1
52 60 106 130 263.0
53 60 118 151 486.0
54 30 136 175 238.0
55 60 121 146 450.7
56 60 118 121 413.0
57 45 115 144 305.0
58 20 153 172 226.4
59 45 123 152 321.0
60 210 108 160 1376.0
61 160 110 137 1034.4
62 160 109 135 853.0
63 45 118 141 341.0
64 20 110 130 131.4
65 180 90 130 800.4
66 150 105 135 873.4
67 150 107 130 816.0
68 20 106 136 110.4
69 300 108 143 1500.2
70 150 97 129 1115.0
71 60 109 153 387.6
72 90 100 127 700.0
73 150 97 127 953.2
74 45 114 146 304.0
75 90 98 125 563.2
76 45 105 134 251.0
77 45 110 141 300.0
78 120 100 130 500.4
79 270 100 131 1729.0
80 30 159 182 319.2
81 45 149 169 344.0
82 30 103 139 151.1
83 120 100 130 500.0
84 45 100 120 225.3
85 30 151 170 300.1
86 45 102 136 234.0
87 120 100 157 1000.1
88 45 129 103 242.0
89 20 83 107 50.3
90 180 101 127 600.1
91 45 107 137 NaN
92 30 90 107 105.3
93 15 80 100 50.5
94 20 150 171 127.4
95 20 151 168 229.4
96 30 95 128 128.2
97 25 152 168 244.2
98 30 109 131 188.2
99 90 93 124 604.1
100 20 95 112 77.7
101 90 90 110 500.0
102 90 90 100 500.0
103 90 90 100 500.4
104 30 92 108 92.7
105 30 93 128 124.0
106 180 90 120 800.3
107 30 90 120 86.2
108 90 90 120 500.3
4. Reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set.
Data insights of each and every visualization step has been given
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
sns.set()
Source Of Data
Loading data
iris_data = pd.read_csv(‘iris.csv’)
iris_data
Complete Iris Dataset
Visual description of various features of Iris Species
iris_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Data Insights:
Statistical Insight
iris_data.describe()
Data Insights:
1. Mean values
2. Standard Deviation ,
3. Minimum Values
4. Maximum Values
iris_data[iris_data.duplicated()]
Duplicate Entries
iris_data[‘species’].value_counts()
setosa 50
versicolor 50
virginica 50
Name: species, dtype: int64
Data Visualization
Species count
plt.title(‘Species Count’)
sns.countplot(iris_data[‘species’])
Data Insight:
Uni-variate Analysis
plt.figure(figsize=(17,9))
plt.title(‘Comparison between various species based on sapel
length and width’)
sns.scatterplot(iris_data[‘sepal_length’],iris_data[‘sepal_widt
h’],hue =iris_data[‘species’],s=50)
Data Insights:
plt.figure(figsize=(16,9))
plt.title(‘Comparison between various species based on petal
lenght and width’)
sns.scatterplot(iris_data[‘petal_length’],
iris_data[‘petal_width’], hue = iris_data[‘species’], s= 50)
Data Insights
Bi-variate Analysis
sns.pairplot(iris_data,hue=”species”,height=4)
Data Insights:
Checking Correlation
plt.figure(figsize=(10,11))
sns.heatmap(iris_data.corr(),annot=True)
plt.plot()
Heatmap
Data Insights:
iris.groupby(‘species’).agg([‘mean’, ‘median’])
visualizing the distribution , mean and median using box plots &
violin plots
Box Plots
Data Insights:
The violin plot shows density of the length and width in the species.
The thinner part denotes that there is less density whereas the fatter
part conveys higher density
fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 1],inner=’quartile’)
plt.show()
Violin Plot
Data Insights:
Data Insights:
Data Insights:
d. Also compare the results of the above analysis for the two data sets.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')
In [2]:
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df.head()
Out[2]:
33.
0 6 148 72 35 0 0.627 50 1
6
26.
1 1 85 66 29 0 0.351 31 0
6
23.
2 8 183 64 0 0 0.672 32 1
3
28.
3 1 89 66 23 94 0.167 21 0
1
Pregnanci Glucos BloodPressu SkinThickne Insuli BM DiabetesPedigreeFuncti Ag Outcom
es e re ss n I on e e
43.
4 0 137 40 35 168 2.288 33 1
1
Introduction :
Well, In this notebook i am going to do some univariate Statistical Analysis. I will avoid cleaning
data and dealing with missing values , so that i can see real insights from data.
You can see 3rd Analysis on this dataset, (Making Inference about Population)
I will be using only first 3 or 4 variable to show the workflow of Univariate Statistical
Analysis
In [3]:
df.shape
Out[3]:
(768, 9)
In [4]:
df.dtypes
Out[4]:
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
Datatypes are perfect , But we can change Outcome to boolean datatype which will save
space.
In [5]:
df['Outcome']=df['Outcome'].astype('bool')
In [6]:
df.dtypes['Outcome']
Out[6]:
dtype('bool')
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
DiabetesPedigreeFunction 768 non-null float64
Age 768 non-null int64
Outcome 768 non-null bool
dtypes: bool(1), float64(2), int64(6)
memory usage: 48.9 KB
In [8]:
df.describe().T
Out[8]:
coun
mean std min 25% 50% 75% max
t
768.
Pregnancies 3.845052 3.369578 0.000 1.00000 3.0000 6.00000 17.00
0
768.
SkinThickness 20.536458 15.952218 0.000 0.00000 23.0000 32.00000 99.00
0
768. 27.3000
BMI 31.992578 7.884160 0.000 32.0000 36.60000 67.10
0 0
on 0
Basic Summary :
Data is related to healthcare Industry having 768 observations with 9 variable. Target variable is
Outcome. It looks like there is no missing value, and boolean, float , integers are different
datatypes available. Well descriptive analysis shows that variable Glucose,
BoodPressure,SckinThickness, Insulin and BMI have minimum value 0 which does not make any
sense, these values are either missing or outliers, I will be treating them later. I can see in
Pregnancies column, minimum is 0 (May be this is sign for no pregnancy) which is considerable,
But maximum month of pregnancy is 17 which does not make any sense, I will be dealing later.
Variance among different predictor variable is varying at large scale , Scaling data will be helpful
for Predective modelling.
Pregnancy Variable
In [9]:
preg_proportion = np.array(df['Pregnancies'].value_counts())
preg_month = np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)
In [10]:
preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_
proportion':preg_proportion_perc})
preg.set_index(['month'],inplace=True)
preg.head(10)
Out[10]:
count_of_preg_pro
percentage_proportion
p
month
1 135 17
count_of_preg_pro
percentage_proportion
p
month
0 111 14
2 103 13
3 75 9
4 68 8
5 57 7
6 50 6
7 45 5
8 38 4
9 28 3
Note :- BoxPlot of both categories shows that People with higher pregnancy months have higher
risk of Diabetes (There is not statistical evidence, May be i will be testing a hypothesis in my 3rd
notebook of this data.)
Glucose Variable
In [12]:
df.Glucose.describe()
Out[12]:
count 768.000000
mean 120.894531
std 31.972618
min 0.000000
25% 99.000000
50% 117.000000
75% 140.250000
max 199.000000
Name: Glucose, dtype: float64
unfold_moreShow hidden code
Understanding Distribution
The distribution of Glucose level among patients is unimodal and roughly bell shaped,
centered at about 115 with most of the data between 90 and 140, A range of roughly 150,
and outliers are present on the lower end(Glucose ==0).
Note :- BoxPlot of both categories shows that People with higher Glucose level have higher risk
of Diabetes (There is not statistical evidence, May be i will be testing a hypothesis in my 3rd
notebook of this data.) We can also see that some outliers are present on non diabetic patient
observation.
df.BloodPressure.describe()
Out[15]:
count 768.000000
mean 69.105469
std 19.355807
min 0.000000
25% 62.000000
50% 72.000000
75% 80.000000
max 122.000000
Name: BloodPressure, dtype: float64
unfold_moreShow hidden code
Understanding Distribution
The distribution of BloodPressure among patients is unimodal (This is not a bimodal
because BP=0 does not make any sense and it is Outlier) and bell shaped, centered at
about 65 with most of the data between 60 and 90, A range of roughly 100, and outliers are
present on the lower end(BP ==0).
Note :- BoxPlot of both categories shows that there is a little association of BP with Diabetic VS
Non-Diabetic patients. (There is not statistical evidence, May be i will be testing a hypothesis in
my 3rd notebook of this data.) We can also see that some outliers are present.
Please notice how outliers can change your whole understanding about data.
df.SkinThickness.describe()
Out[19]:
count 768.000000
mean 20.536458
std 15.952218
min 0.000000
25% 0.000000
50% 23.000000
75% 32.000000
max 99.000000
Name: SkinThickness, dtype: float64
unfold_moreShow hidden code
Understanding Distribution
The distribution of SkinThickness among patients is looking like Bimodal (But i think,
This is not a bimodal because ST=0 does not make any sense and it may effect
distribution, I will verify it :)) and it is bell shaped, centered at about 20 with most of the
data between 15 and 45, A range of roughly 60, and outliers are present on the lower
end(ST ==0).
Note :- BoxPlot of both categories shows that there is a little association of ST among Diabetic
VS Non-Diabetic patients. (There is not statistical evidence, May be i will be testing a hypothesis
in my 3rd notebook of this data.)
Dataset
Doing the same for Blood Pressure, we can see that the variable can
approximate to a normal distribution. However, we can not confirm
that visually. We hence perform the Shapiro-Wilk test of normality.
The null hypothesis (Ho) is that the data is normal.
from scipy.stats import shapirostat, p =
shapiro(dataset['BloodPressure'])
print('Statistics=%.3f, p=%.3f' % (stat, p))Statistics=0.819,
p=0.000
We now split our processed dataset into Training and Test data. The
Test data size is take to be 15% of the entire data (which means 115
observations) and the model will be trained on 653 observations.
#Splitting the data into training and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_x, Y,
test_size = 0.15, random_state = 45)
Plot showing frequency of 0 and 1 in Y train
The first model we fit on the training data is the Logistic Regression.
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)y_pred = logreg.predict(x_test)
print('Accuracy of logistic regression classifier on test set:
{:.2f}'.format(logreg.score(x_test, y_test)))Out[76]: Accuracy
of logistic regression classifier on test set: 0.73
Our Precision for the model stands at 0.722. This indicates that
72% of the time our model classified the patients in a high risk
category when they actually had a high risk of getting diabetes.
The first model we fit on the training data is the Support Vector
Machine (SVM). SVM uses many kernels to classify the data. We
use rbf/Gaussian kernel to fit the first model.
from sklearn.svm import SVCclassifier_rbf = SVC(kernel = ‘rbf’)
classifier_rbf.fit(x_train, y_train)y_pred =
classifier_rbf.predict(x_test)print('Accuracy of SVC (RBF)
classifier on test set:
{:.2f}'.format(classifier_rbf.score(x_test, y_test)))
Out[76]: Accuracy of SVC (RBF) classifier on test set:
0.75print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
0.7431080565101182
0.7410256410256411
0.7481366459627329
We get the highest accuracy for Random Forest, with the score
reaching 88%. This implies, our model predicted classified correctly
88% of the times.
The Precision score stood at 0.876, implying our model correctly
classified observations with high risk in the high risk
category 87.6% of the times. The Recall stood at 0.870.
Conclusion
We thus select the Random Forest Classifier as the right model due
to high accuracy, precision and recall score. One reason why
Random Forest Classifier showed an improved performance was
because of the presence of outliers. As mentioned before, since
Random Forest is not a a distance based algorithm, it is not much
affected by outliers, whereas distance based algorithm such as
Logistic Regression and Support Vector showed a lower
performance.
d. Histograms
Motivation
Introduction
e) Statistics summary
Let’s begin…!!
1. Import and get to know the data
2. Data Cleaning
e) Statistics summary
Image snapshot by author
or using seaborn
sns.boxplot(x=’target’, y=’oldpeak’, data=df)
From the bar graph, we can observe that among disease patients,
male are higher than female.
Fasting blood sugar or fbs is a diabetes indicator with fbs >120 mg/d
is considered diabetic (True class). Here, we observe that the
number for class true, is lower compared to class false. However, if
we look closely, there are higher number of heart disease patient
without diabetes. This provide an indication that fbs might not be a
strong feature differentiating between heart disease an non-disease
patient.
profiling/archive/master.zip
6) Now open your favorite Python Notebook or IDE in the active conda
environment, In my case, I used jupyter notebook
Coastlines
Countries
Fill continents
The Basemap() function is used to set projection, bounding box, & resolution of
a map
Map projection:
b) Map center:
Example:
m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)
c: crude
l: low
i: intermediate
h: high
f: full
box, map center, & map resolution arguments used to create and
modify maps:
c: crude
l: low
i: intermediate
h: high
f: full
fig = plt.figure(figsize = (16,16))ax1 = plt.subplot2grid((3,2),
(0,0))
ax2 = plt.subplot2grid((3,2), (0,1))
ax3 = plt.subplot2grid((3,2), (1,0))
ax4 = plt.subplot2grid((3,2), (1,1))m1 =
Basemap(llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=48.5, urcrnrlat=15,
ax=ax1, resolution = 'c')
m1.drawcoastlines()
m1.fillcontinents(color='tan',lake_color='lightblue')
m1.drawcountries(linewidth=1, linestyle='solid', color='k' )
m1.drawmapboundary(fill_color='lightblue')
plt.title("Crude resolution", fontsize=18)
ax1.set_title("Crude resolution", fontsize=18)m2 =
Basemap(llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=48.5, urcrnrlat=15,
ax=ax2, resolution = 'l')
m2.drawcoastlines()
m2.fillcontinents(color='tan',lake_color='lightblue')
m2.drawcountries(linewidth=1, linestyle='solid', color='k' )
m2.drawmapboundary(fill_color='lightblue')
ax2.set_title("Low resolution", fontsize=18)m3 =
Basemap(llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=48.5, urcrnrlat=15,
ax=ax3, resolution = 'i')
m3.drawcoastlines()
m3.fillcontinents(color='tan',lake_color='lightblue')
m3.drawcountries(linewidth=1, linestyle='solid', color='k' )
m3.drawmapboundary(fill_color='lightblue')
ax3.set_title("Intermediate resolution", fontsize=18)m4 =
Basemap(llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=48.5, urcrnrlat=15,
ax=ax4, resolution = 'h')
m4.drawcoastlines()
m4.fillcontinents(color='tan',lake_color='lightblue')
m4.drawcountries(linewidth=1, linestyle='solid', color='k' )
m4.drawmapboundary(fill_color='lightblue')
ax4.set_title("High resolution", fontsize=18)
Background relief maps
c) NOAA etopo
d) Shaded relief
land_color: sets the color of the land (by default it’s gray)
To read the netCDF data, we’ll use the Dataset class from
the netcdf4-python library
from netCDF4 import Dataset as dataset
Let’s read the coordinate variables (latitude, longitude, & time) &
data variable (2-meter temperature) of the netCDF file.
lat = nc.variables['latitude'][:]
lon = nc.variables['longitude'][:]
time = nc.variables['time'][:]
t2 = nc.variables['p2t'][:]
data
colormap
levels
fig = plt.figure(num=None, figsize=(7, 7) ) m =
Basemap(projection='cyl', llcrnrlon=32.5, llcrnrlat=3,
urcrnrlon=49, urcrnrlat=15, resolution='i')
x, y = m(*np.meshgrid(lon,lat))cs = m.contourf(x,
y ,np.squeeze(t2[4,:,:]), levels = 100,
cmap=plt.cm.jet)m.drawcoastlines()
m.drawmapboundary()
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmeridians(range(33, 48, 2), color='k', linewidth=1.0,
dashes=[4, 4], labels=[0, 0, 0, 1])
m.drawparallels(range(3, 15, 2), color='k', linewidth=1.0,
dashes=[4, 4], labels=[1, 0, 0, 0])plt.ylabel("Latitude",
fontsize=15, labelpad=35)
plt.xlabel("Longitude", fontsize=15, labelpad=20)cbar =
m.colorbar(cs, location='right', pad="3%")
cbar.set_label('Temperature (K)', fontsize=13)plt.title('2-meter
temperature filled contour map', fontsize=15)
plt.show()
Plotting contour maps:
Exploring data sets and developing deep understanding about the data is one of the
most important skills every data scientist should possess. People estimate that the
time spent on these activities can go as high as 80% of the project time in some
cases.
Python has been gaining a lot of ground as preferred tool for data scientists lately,
and for the right reasons. Ease of learning, powerful libraries with integration of
C/C++, production readiness and integration with web stack are some of the main
reasons for this move lately.
In this guide, I will use NumPy, Matplotlib, Seaborn, and Pandas to perform data
exploration. These are powerful libraries to perform data exploration in Python.
The idea is to create a ready reference for some of the regular operations required
frequently. I am using an iPython Notebook to perform data exploration and would
recommend the same for its natural fit for exploratory analysis.
In case you missed it, I would suggest you to refer to the baby steps series of
Python to understand the basics of python programming.
Here are the operations I’ll cover in this article (Refer to this article for similar
operations in SAS):
1.
1. How to load data file(s) using Pandas?
2. How to convert a variable to different data type?
3. How to transpose a table/Pandas dataframe?
4. How to sort Data?
5. How to create plots (Histogram, Scatter, Box Plot)?
6. How to generate frequency tables?
7. How to do sampling of Data set?
Part 1: How to load data file(s) using Pandas?
Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ). In Python, it
is easy to load data from any source, due to its simple syntax and availability of
predefined libraries, such as Pandas. Here I will make use of Pandas itself.
Code
Code
df=pd.read_excel("E:/EMP.xlsx", "Data") # Load Data sheet of excel
file EMP
Output
print df
Code:
Output
The later operations are especially useful when you input value from user using
raw_input(). By default, the values are read at string.
Here, I want to transpose Table A into Table B on the variable Product. This task
can be accomplished by using Pandas dataframe.pivot:
Code
Output
Code
Above,
we have a table with variables ID, Product and Sales. Now, we want to sort it by
Product and Sales (in descending order) as shown in table 2.
Histogram:
Code
#Plot Histogram
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_excel("E:/First.xlsx", "Sheet1")
#Plots in matplotlib reside within a figure object, use plt.figure
to create new figure
fig=plt.figure()
#Create one or more subplots using add_subplot, because you can't
create blank figure
ax = fig.add_subplot(1,1,1)
#Variable
ax.hist(df['Age'],bins = 5)
#Labels and Tit
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('#Employee')
plt.show()
Output
Scatter plot:
Code
Output
Box-plot:
Code
Output
Part 6: How to generate frequency tables with Pandas?
Code
import pandas as pd
df=pd.read_excel("E:/First.xlsx", "Sheet1")
print df
test= df.groupby(['Gender','BMI'])
test.size()
Output
Let’s say, from EMP table, I want to select random sample of 5 employees.
Code
Output
Output
To understand the count, average and sum of variable, I would suggest you use
dataframe.describe() with Pandas groupby().
Code
test= df.groupby(['Gender'])
test.describe()
Output
Part 10: How to recognize and Treat missing values and outliers in Pandas?
To identify missing values , we can use dataframe.isnull(). You can also refer
article “Data Munging in Python (using Pandas)“, here we have done a case study
to recognize and treat missing and outlier values.
Code
Output
To treat missing values, there are various imputation methods available. You can
refer these articles for methods to detect Outlier and Missing values. Imputation
methods for both missing and outlier values are almost similar. Here we will
discuss general case imputation methods to replace missing values. Let’s do it
using an example:
Code:
Part 11: How to merge / join data sets and Pandas dataframes?