0% found this document useful (0 votes)
31 views61 pages

Data science lab

The document is a lab manual for the CS3361 Data Science Laboratory course at Anna University, outlining the course objectives, experiments, and outcomes. It aims to teach students about Python libraries for data science, statistical measures, and data visualization techniques. The manual includes detailed instructions for various experiments involving data analysis using libraries like NumPy, Pandas, and visualization tools.

Uploaded by

Akshaya R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views61 pages

Data science lab

The document is a lab manual for the CS3361 Data Science Laboratory course at Anna University, outlining the course objectives, experiments, and outcomes. It aims to teach students about Python libraries for data science, statistical measures, and data visualization techniques. The manual includes detailed instructions for various experiments involving data analysis using libraries like NumPy, Pandas, and visualization tools.

Uploaded by

Akshaya R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

DEPARTMENT OF COMPUTER

SCIENCE ANDENGINEERING

CS3361- DATA SCIENCE


LABORATORY

II YEAR/III SEMESTER
REGULATION 2021

LAB MANUAL

Name of the Student : …………………………………….

Register Number : ……………………………………

Dept/Section : ……………………………………

Year/Semester : ……………………………………
VISION OF THE INSTITUTION

To be a globally recognized Centre for Engineering and Technology by

imparting knowledge, relevant skills and human values to address global

challenges through applied research, innovation and entrepreneurship.

MISSION OF THE INSTITUTION

To impart quality technical education and inculcate a culture for

life long learning.

To nurture industrial collaboration for building aspirations for

career in emerging technologies.

To create next generation socially responsible leaders through

holistic development and building professional competencies.

To facilitate stimulating environment, conducive for research,

innovation and entrepreneurship.


ANNA UNIVERSITY, CHENNAI
NON-AUTONOMOUS AFFILIATED COLLEGES
REGULATIONS 2021
CHOICE BASED CREDIT SYSTEM

CS3361 DATA SCIENCE LABORATORY


L T P C
0 0 4 2

COURSE OBJECTIVES

1. To understand the python libraries for data science.

2. To understand the basic statistical and probability measures for data science.

3. To learn descriptive analytics on the benchmark data sets.

4. To apply correlation and recursion analytics on standard data sets.

5.To present and interpret data using visualization packages in python.

LIST OF EXPERIMENTS:

1.Download ,install and explore the features of Numpy, Sci-py,Jupter, Statsmodels and Pandas
packages.

2.Working with Numpy arrays

3.Working with Pandas data frames.

4.Reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set.

5.Use the diabetes data set from UCI and Prima Indians Diabetes data set for performing the
following :

a. Univariate analysis :Frequency ,Mean, Median ,Mode ,Variance ,Standard Deviation,


Skewness and Kurtosis.

b. Bivariate analysis :Linear and logistic regression modelling

c. Multiple Regression analysis.

d. Also compare the results of the above analysis for the two data sets.
6.Apply and explore various plotting functions on UCI data sets.

a. Normal curves

b. Density and contour ploys.

c. Correlation and Scatter plots

d. Histograms

e. Three dimensional plotting.

7.Visualizing Geographic Data with Basemap

COURSE OUTCOMES:

At the end of this course ,the students will be able to :

CO1: Make use of the python libraries for data science.

CO2: Make use of the basic Statistical and Probability measures for data science.

CO3: Perform descriptive analytics on the benchmark data sets.

CO4: Perform correlation and regression analytics on standard data sets.

CO5:Present and interpret data using visualization packages in python.


INDEX PAGE

EX.NO DATE TITLE CO’S PO’S & MARKS SIGN


MAPPED PSO
MAPPED

1 Download ,install and


explore the features of
Numpy, Sci-py,
Jupter, Statsmodels
and Pandas packages.
2 Working with Numpy
arrays
3 Working with Pandas
data frame
4 Reading dat.a from
text files, Excel and
the web and exploring
various commands for
doing descriptive
analytics on the Iris
data set.
5a Use the diabetes data
set from UCI and Prima
Indians Diabetes data
set for performing the
following:
Univariate analysis:
Frequency, Mean,
Median, Mode,
Variance, Standard
Deviation, Skewness
and Kurtosis.

b Bivariate analysis:
Linear and logistic
regression modelling
c Multiple Regression
analysis.
d Also compare the
results of the above
analysis for the two data
sets.
6a Apply and explore
various plotting
functions on UCI data
sets.

Normal curves
b Density and contour
ploys.
c Correlation and Scatter
plots
d Histograms

e Three dimensional
plotting.
7 Visualizing
Geographic Data with
Base map
Content Beyond the Syllabus
8 Develop python
program for
Correlation coefficient
9 Develop python
program for Simple
Linear Regression
List of Experiments
1.Download, install and explore the features of NumPy, SciPy,
Jupyter, Statsmodelsand Pandas packages.

2.Working with Numpy arrays

3.Working with Pandas data frames

4.Reading data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris
data set.

5.Use the diabetes data set from UCI and Pima Indians
Diabetes data set for performing the following:
a. Univariate analysis: Frequency, Mean,
Median, Mode, Variance, StandardDeviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.

6.Apply and explore various plotting functions on UCI data sets.


e. Normal curves
f. Density and contour plots
c.Correlation and scatter plots
g. Histograms
h. Three dimensional plotting

7.Visualizing Geographic Data with Basemap


Ex.No:1 Download, install and explore the features of NumPy,
SciPy, Jupyter, Statsmodels and Pandas Packages

Aim : To study about the packages of python

Algorithm:
Step 1: Go to command prompt.
Step 2: Type pip install Numpy.
Step 3: Numpy packages have been installed.
Step 4: Type pip Scipy, Scipypackages get installed
Step 5: Type pip install jupyter, Jupyterpackages get installed.
Step 6: Type pip install Stastsmodel, the packages get installed.
Step 7: Type pip install pandas, the packages get installed.

Anaconda:

Anaconda is a distribution of the Python and R programming languages for scientific


computing (data science, machine learning applications, large-scale data processing, predictive
analytics, etc.), that aims to simplify package management and deployment.

Jupyter NoteBook:
The Jupyter Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations, and narrative text. Its uses
include data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.

NumPy:
NumPy is a Python library used for working with arrays. It also has functions for
working in domain of linear algebra, fourier transform, and matrices.

SciPy:
SciPy is a scientific computation library that uses NumPy underneath. SciPy stands
for Scientific Python. It provides more utility functions for optimization, stats and signal
processing. Like NumPy, SciPy is open source so we can use it freely

NumPy, stands for Numerical Python, is used for the manipulation of elements of numerical
arraydata. SciPy, stands for Scientific Python, is used for numerical computations in Python.
Both thesepackages provide extended functionality to work with Python.

Statsmodels :
Statsmodels is a popular library in Python that enables us to estimate and analyze
variousstatistical models. It is built on numeric and scientific libraries like NumPy and SciPy.
It includes various models of linear regression like ordinary least squares, generalized least
squares, weightedleast squares, etc

Pandas
Pandas are really powerful. They provide you with a huge set of important commands and
featureswhich are used to easily analyze your data. We can use Pandas to perform various tasks
like filtering your data according to certain conditions, or segmenting and segregating the data
according to preference, etc.

1
STEP 1: Install IDLE Phyton (Preferably 3.12.10 above )

STEP 2: Click the downloaded .exe file

2
Step 3.Now, the installation is set up.

3
STEP 4:.The IDLE is shown as in the figure.

STEP 5. Now, the pip installation to be done , type the following link for pip
http :// bootstrap.pypa.io/get-pip.py

Step 6: open the command prompt and install the libraries for numpy,statsmodel ,jupyter and pandas
4
Type “Anaconda Download” in Google Chrome.

5
Scroll the anaconda products website below.Click the 64bit installer.

Click the downloaded .exe file and install Anaconda

6
1. Click the Just Me option

1. Anaconda Stored in C: \ Users path

7
2.After installation open the anaconda navigator

3.Launch Jupyter Notebook for typing programs

8
4. Click “ New”

5. Click the Run button to get the output

Inference:

Result :

9
Ex.No:2 WORKING WITH NUMPY ARRAYS

Aim: To work with numpy arrays

Algorithm :
Step 1: Create a numpy
Step 2: Import numpy as np
Step 3: Add create an array
Step 4: Variable name= np.array{[]}

Python Program to demonstrate


# basic array characteristics :

import numpy as np

# Creating array object arr =


np.array( [[ 1, 2, 3],
[ 4, 2, 5]] )

# Printing type of arr object


print("Array is of type: ", type(arr))

# Printing array dimensions (axes)


print("No. of dimensions: ", arr.ndim)

# Printing shape of array


print("Shape of array: ", arr.shape)

# Printing size (total number of elements) of arrayprint("Size of


array: ", arr.size)

# Printing type of elements in array


print("Array stores elements of type: ", arr.dtype)

Output:
array is of type:
No. of dimensions: 2
Shape of array: (2, 3) Size
of array: 6
Array stores elements of type: int64

Program to Perform Array Slicing

a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a)
print("After slicing")
print(a[1:])

10
Output

[[1 2 3]
[3 4 5]
[4 5 6]]
After slicing
[[3 4 5]
[4 5 6]]

Program to Perform Array Slicing

# array to begin with


import numpy as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print('Our array is:' )
print(a)
# this returns array of items in the second column
print('The items in the second column are:' )
print(a[...,1])
print('\n' )
# Now we will slice all items from the second row
print ('The items in the second row are:' )
print(a[1,...])
print('\n' )
# Now we will slice all items from column 1 onwards
print('The items column 1 onwards are:' )
print(a[...,1:])

Output:
Our array is:
[[1 2 3]
[3 4 5]
[4 5 6]]
The items in the second column are:
[2 4 5]
The items in the second row are:
[3 4 5]
The items column 1 onwards are:
[[2 3]
[4 5]
[5 6]]

Inference:

Result:

11
Ex.No:3 Working with Pandas Data Frames

Aim:

To work with Pandas data frame.

Algorithm :

Step 1: import pandas.


Step 2: Create a data frame using List.
Step 3: Create a Data frame from diet of ndarray/List.
Step 4: Delete the rows and columns.

Pandas Data Frames:

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis"
and wascreated by Wes McKinney in 2008.

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a


tablewith rows and columns.

Program:
# importing the required module
import matplotlib.pyplot as plt

# x axis values
x = [1,2,3]
# corresponding y axis values
y = [2,4,1]

# plotting the points


plt.plot(x, y)

# naming the x axis


plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')

12
# giving a title to my graph
plt.title('My first graph!')

# function to show the plot


plt.show()

OUTPUT

Col1 Col2
Row1 1 2
Row2 3 4
0 1 2
0 1 2 3
1 4 5 61 2 3
0 1 1 2
1 3 2 4A

0 4
1 5
2 6
3 7
0
United Kingdom London
India New Delhi
United States Washington
Belgium Brussels
(2, 3)
2

Inference:

Result:
.

13
Ex.No:4 Reading data from text files , Excel and the web and exploring
various commands for doing descriptive analysison the Iris data
set.

Aim:
To Read data from text files , Excel and the web and exploring various commands for
doing descriptive analysis on the Iris data set.

Algorithm :

Step 1: Open Notepad and type a test.


Step 2: Save that text to Desktop or any other folder.
Step 3: Open pycharm and type code.
Step 4: Run the program
Step 5: Output displays

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical
summary of the data.

Iris Dataset
Iris Dataset is considered as the Hello World for data science. It contains five columns namely
– Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a
flowering plant, the researchers have measured various features of the different iris flowers
and recorded them digitally.

Read a web dataset:

import pandas as pd

# Reading the CSV file


df = pd.read_csv("Iris.csv")

# Printing top 5rows


df.head()

14
Output:

LGetting Information about the Dataset


1. Shape: the shape parameter to get the shape of the
datasetdf.shape

Output:
(150, 6)
The dataframe contains 6 columns and 150 rows

2. Info(): the columns and their data


typesdf.info()
Output:

3. Describe():The describe() function applies basic statistical computations on the dataset


like extreme values, count of data points standard deviation, etc. Any missing value or
NaN value is automatically skipped. describe() function gives a good picture of the
distribution of data
df.describe()
Output:

II Checking Missing Values


1. Isnull():We will check if our data contains any missing values or not. Missing values
15
canoccur when no information is provided for one or more items or for a whole unit.

df.isnull().sum()
Output:

2. Checking Duplicates:
Pandas drop_duplicates() method helps in removing duplicates from the data frame
data = df.drop_duplicates(subset ="class")

Output:

3 Count:

Series.value_counts() function. This function returns a Series containing counts of


unique values.
df.value_counts("class")

Output:

III Data Visualization: We will use Matplotlib and Seaborn library for the data
visualization. Matplotlib is easy to use and an amazing visualizing library in Python. It is
built on NumPy arrays and designed to work with the broader SciPy stack and consists of
several plots like line, bar, scatter, histogram, etc
Seaborn is a library mostly used for statistical plotting in Python. It is built on top of
Matplotlib and provides beautiful default styles and color palettes to make statistical plots
more attractive

Data visualization using Matplot and Seeborn


Library:# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='class', data=df, )
plt.show()

16
Output:

Relation between variables

Hue: Hue parameter denotes which column decides the kind of color.
Legend(): A legend() is an area describing the elements of the graph.
Bounding Box: bbox_to_anchor=[x0, y0] will create a bounding box with lower left corner
at position [x0, y0] . The legend will then be placed 'inside' this box and overlapp it according
tothe specified loc parameter.
Loc:The attribute Loc in legend() is used to specify the location of the legend. Default value of
loc is loc=”best” (upper left)

Example 1: Comparing Sepal Length and Sepal

Width# importing packages


import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='sepallength',
y='sepalwidth', hue='class',
data=df, )

# Placing Legend outside the Figure


plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:
From the above plot, we can infer that –
 Species Setosa has smaller sepal lengths but larger sepal widths.
 Versicolor Species lies in the middle of the other two species in terms of sepal length
andwidth
 Species Virginica has larger sepal lengths but smaller sepal widths.

Example 2: Comparing Petal Length and Petal

Width# importing packages


import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='petallength',
y='petalwidth', hue='class',
data=df, )

# Placing Legend outside the Figure


plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:

From the above plot, we can infer that –


 Species Setosa has smaller petal lengths and widths.
 Versicolor Species lies in the middle of the other two species in terms of petal length
andwidth
 Species Virginica has the largest of petal lengths and widths.
Let’s plot all the column’s relationships using a pairplot. It can be used for multivariate
analysis.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df.drop(['Id'], axis =
1), hue='class', height=2)
Histograms

Histograms allow seeing the distribution of data for various columns. It can be used for
uni aswell as bi-variate analysis.
Example:
import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(10,10))

axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['sepallength'], bins=7)
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['sepalwidth'], bins=5);

axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['petallength'],
bins=6);

axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['petalwidth'], bins=6);
Output:

Handling Correlation

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the
dataframe. Any NA values are automatically excluded. For any non-numeric data type
columns in the dataframe it is ignored.

data = df.drop_duplicates(subset ="class",)


data.corr(method='pearson')

Output

Box Plots

We can use boxplots to see how the categorical value os distributed with other numerical
values.

Example:
# importing
packages import
seaborn as sns
import matplotlib.pyplot as plt

def graph(y):
sns.boxplot(x="class", y=y, data=df)

plt.figure(figsize=(10,10))

# Adding the subplot at the


specified# grid position
plt.subplot(221)
graph('sepallength')

plt.subplot(222)
graph('sepalwidth')

plt.subplot(223)
graph('petallength')

plt.subplot(224)
graph('petalwidth')

plt.show()
Output:

Handling Outliers
An Outlier is a data-item/object that deviates significantly from the rest of the (so-called normal)objects. They
can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier
mining. There are many ways to detect the outliers, and the removal process is the data frame same as removing
a data item from the panda’s dataframe.
Let’s consider the iris dataset and let’s plot the boxplot for the SepalWidthCm column.
Example:
# importing
packages import
seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset


df = pd.read_csv('iris_csv.csv')

sns.boxplot(x='sepalwidth',

data=df)

Output:
Removing Outliers

For removing the outlier, one must follow the same process of removing an entry from the
dataset using its exact position in the dataset because in all the above methods of detecting
theoutliers end result is the list of all those data items that satisfy the outlier definition
according to the method used.
Example: We will detect the outliers using IQR and then we will remove them. We will
also draw the boxplot to see if the outliers are removed or not.
import sklearn
from sklearn.datasets import
load_bostonimport pandas as pd
import seaborn as snsimport numpy as np
# Load the dataset
df = pd.read_csv('iris_csv.csv')

# IQR
Q1 = np.percentile(df['sepalwidth'],
25, interpolation =
'midpoint')

Q3 = np.percentile(df['sepalwidth'],
75, interpolation =
'midpoint')
IQR = Q3 - Q1
print("Old Shape: ",

df.shape) # Upper bound


upper = np.where(df['sepalwidth'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['sepalwidth'] <= (Q1-1.5*IQR))

# Removing the Outliers


df.drop(upper[0], inplace =
True)df.drop(lower[0],
inplace = True)

print("New Shape: ", df.shape)

sns.boxplot(x='sepalwidth',

data=df)

Output:
Ex.No:4b:Various Commands in data frame:

1. Pandas.DataFrame

pandas.DataFrame() used to create a DataFrame in pandas. There are two ways to use
this function. You can form a DataFrame column-wise by passing a dictionary into
the pandas.DataFrame() function. Here, each key is a column, while the values are the rows:

import pandas
DataFrame = pandas.DataFrame({"A" : [1, 3, 4], "B": [5, 9,
12]})print(DataFrame)

A B
0 1 5
1 3 9
2 4 12

2. Read From and Write to Excel or CSV in pandas

You can read or write to Excel or CSV files with pandas.

import pandas as pd
df =
pd.read_csv("iris_csv.csv")
print(df)

3. Get the Mean, Median, and Mode

We can also compute the central tendencies of each column in a DataFrame using
pandas.:

DataFrame.mean()

df.median()

df.mode()

4. DataFrame.transform

pandas' DataFrame.transform() modifies the values of a DataFrame. It accepts a function


as anargument.

data = df.transform(lambda y:
y*3)print(data)

5. DataFrame.isnull

This function returns a Boolean value and flags all rows containing null values as

True:df.isnull().sum()

sepallength 0
sepalwidth 0
petallength 0
petalwidth 0
class 0
dtype: int64

6. Dataframe.info

It returns the summary of non-missing values for each column instead:

df.info()

df.describe()

7. DataFrame.loc

loc to used find the elements in a particular index. To view all items in the third row,
for instance:

data=df.lo

c[2]

print(data)

8. DataFrame.max, min

Getting the maximum and minimum values using pandas is easy:

df.min()

df.max()

9. DataFrame.astype

The astype() function changes the data type of a particular column or DataFrame.

DataFrame.astype(str)

10. DataFrame.insert

' insert() function used to add a new column to a DataFrame. It accepts three
keywords,the column name, a list of its data, and its location, which is a column index.

DataFrame.insert(column = 'C', value = [3, 4, 6], loc=0)

print(DataFrame)

11. DataFrame.sum

The sum() function in pandas returns the sum of the values in each column

DataFrame.cumsum()
12. Correlation:

Want to find the correlation between integer or float columns? pandas can help you
achieve that using the corr() function.

DataFrame.corr()

13. DataFrame.add

The add() function used to add a specific number to each value in DataFrame. It
worksby iterating through a DataFrame and operating on each item.

DataFrame['A'].add(20)

14. DataFrame.sub

Like the addition function, you can also subtract a number from each value in a
DataFrame or specific column:

DataFrame['A'].sub(10)

15. DataFrame.mul

This is a multiplication version of the addition function of pandas:

DataFrame['A'].mul(10)

16. DataFrame.div

We can divide each data point in a column or DataFrame by a specific number:

DataFrame['A'].div(2)

17. DataFrame.std

Using the std() function, pandas also lets you compute the standard deviation for
each column in a DataFrame. It works by iterating through each column in a dataset and
calculatingthe standard deviation for each:

DataFrame.std()

18. DataFrame.melt

The melt() function in pandas flips the columns in a DataFrame to individual rows. It's like exposing the
anatomy of a DataFrame. So it lets you view the value assigned to each column explicitly.

newDataFrame =

DataFrame.melt()

print(newDataFrame)
19. DataFrame.pop

This function lets you remove a specified column from a pandas DataFrame. It
acceptsan item keyword, returns the popped column, and separates it from the rest of the
DataFrame:

DataFrame.pop(item= 'B')

print(DataFrame)

20. DataFrame.dropna

The dropna() method removes all rows containing null

values:DataFrame.dropna(inplace = True)

print(DataFrame)

Inference:

Result:
Ex.No:5 Use the diabetes dataset from UCI and Pima Indians Diabetes data set
for performing the following a)Univariate analysis: Frequency, Mean,
Median, Mode,

Aim :
To use the diabetes dataset from UCI and Pima Indians Diabetes data set for performing the following
a)Univariate analysis: Frequency, Mean, Median, Mode,Variance, Standard Deviation, Skewness and Kurtosis
B) Bivariate analysis: Linear and Logistics regression modeling
c) Multiple Regression analysis
d) Also compare the results of the above analysis for the two data sets.

Algorithm:

Step 1: Download the Pima Indians Diabetes dataset


Step 2: Install Packages.
Step 3; Open the pycharm and type the following commands
Step 4: The output will display.

Read Diabetes data set:

Import pandas as pd
df = pd.read_csv('diabetes.csv')
df.head()
df.shape
OutPut
(768, 9)

df.dtypes
Output:
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object

df['Outcome']=df['Outcome'].astype('bool')
df.dtypes['Outcome']
Output:
dtype('bool')

df.info()
Output:
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
DiabetesPedigreeFunction 768 non-null
float64Age 768 non-null
int64
Outcome 768 non-null
booldtypes: bool(1), float64(2), int64(6)
memory usage: 48.9 KB

df.describe().T

Pregnency Propagation:
import numpy as np
preg_proportion = np.array(df['Pregnancies'].value_counts())
preg_month = np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc = np.array(np.round(preg_proportion/sum(preg_proportion),3)
*100,dtype=int)
preg = pd.DataFrame({'month': Pregnancies,'count_of_preg_prop':preg_proportion,'
percentage_proportion':preg_proportion_perc})
preg.set_index(['month'],inplace=True)
preg.head(10)

import seaborn as sns


import matplotlib.pyplot as plt
fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='green')
axes[0][0].set_title('Count',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][
1]) axes[0][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count',fontdict={'fontsize':7})
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0])
axes[1][0].set_title('Pregnancies
Distribution',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plt.tight_layout()

plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.')
axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend text
plt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend title
plt.tight_layout()

plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8})
axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()

plot21 =
sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2][1])
axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()

Understanding Distribution
The distribution of Pregnancies in data is unimodal and skewed to the right, centered at
about 1 with most of the data between 0 and 15, A range of roughly 15, and outliers are
present on the higher end.
Glucose Variable
df.Glucose.describe()
Output:
count 768.000000
mean 120.894531
std 31.972618
min 0.000000
25% 99.000000
50% 117.000000
75% 140.250000
max 199.000000
Name: Glucose, dtype: float64
#sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.distplot(df['Glucose'],ax=axes[0][0],color='green')
#axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of
Glucose',fontdict={'fontsize':8})axes[0][0].set_xlabel('Glucose
Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['Glucose'],ax=axes[0][1],color='g
reen',label='Non Diab.')
sns.distplot(df[df.Outcome==True]['Glucose'],ax=axes[0][1],color='red',label='
Diab')
axes[0][1].set_title('Distribution of
Glucose',fontdict={'fontsize':8})axes[0][1].set_xlabel('Glucose
Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
#axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10=sns.boxplot(df['Glucose'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point
Summary(Glucose)',fontdict={'fontsize':7}) plt.tight_layout()

plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary
(Outcome)',fontdict={'fontsize':8}) axes[1][1].set_ylabel(r'Five Point
Summary(Glucose)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

plt.show()
Understanding Distribution
The distribution of Glucose level among patients is unimodal and roughly bell shaped,
centered at about 115 with most of the data between 90 and 140, A range of roughly 150,
andoutliers are present on the lower end(Glucose ==0).

verify distribution by keeping only non zero entry of


Glucose: fig,axes =
plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

plot0=sns.distplot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
#axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point
Summary(Glucose)',fontdict={'fontsize':7})plt.tight_layout()

Blood Pressure variable


df.BloodPressure.describe()
count 768.000000
mean 69.105469
std 19.355807
min 0.000000
25% 62.000000
50% 72.000000
75% 80.000000
max 122.000000
Name: BloodPressure, dtype: float64
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.distplot(df['BloodPressure'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0][1],colo
r='green',label='Non Diab.')
sns.distplot(df[df.Outcome==True]['BloodPressure'],ax=axes[0][1],color='red',lab
el='Diab')
axes[0][1].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][1].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10=sns.boxplot(df['BloodPressure'],ax=axes[1][0],orient='v')

axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()

plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

plt.show()
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()

plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

plt.show()
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()

plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

plt.show()
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()

plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

plt.show()
Understanding Distribution
The distribution of BloodPressure among patients is unimodal (This is not a bimodal
because BP=0 does not make any sense and it is Outlier) and bell shaped, centered at
about65 with most of the data between 60 and 90, A range of roughly 100, and outliers
are presenton the lower end(BP ==0).

5b) Bivariate analysis: Linear and logistic regression


modeling

import os
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

os.chdir("C:/Users/Administrator/Desktop/DS")
df = pd.read_csv('diabetes.csv')
df.head()
sns.scatterplot(df.DiabetesPedigreeFunction,df.Glucose)
plt.ylim(0,20000)
sns.scatterplot(df.BMI,df.Age)
plt.ylim(0,20000)

sns.scatterplot(df.BloodPressure,df.Glucose)
plt.ylim(0,20000)

plt.figure(figsize=(12,8))
sns.kdeplot(data=df,x=df.Glucose,hue=df.Outcome,fill=True)
5C) Multiple Regression analysis

Algorithm:
Step 1: Import Libraries
Step 2: Import dataset.
Step 3: Define x and y.
Step 4: Train the model on the training set.
Step 5: Predict the test set results.
Step 6: Evaluate the model.
Step 7: Plot the results.
df.isnull().values.any()
False

(df.Pregnancies ==
0).sum(),(df.Glucose==0).sum(),(df.BloodPressure==0).sum(),(df.SkinThickness==0).sum(),(
df.
Insulin==0).sum(),(df.BMI==0).sum(),(df.DiabetesPedigreeFunction==0).sum(),(df.Age==0)
.su m()

## Counting cells with 0 Values for each variable and publishing the counts below

Output:
(111, 5, 35, 227, 374, 11, 0, 0)

drop_Glu=df.index[df.Glucose == 0].tolist()
drop_BP=df.index[df.BloodPressure ==
0].tolist() drop_Skin =
df.index[df.SkinThickness==0].tolist() drop_Ins
= df.index[df.Insulin==0].tolist() drop_BMI =
df.index[df.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
dia=df.drop(df.index[c])
dia.info()

Output:
class
'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 3 to 765
Data columns (total 9 columns):
# Column Non-Null Count Dtype
- -
0 Pregnancies 392 non-null int64
1 Glucose 392 non-null int64
2 BloodPressure 392 non-null int64
3 SkinThickness 392 non-null int64
4 Insulin 392 non-null int64
5 BMI 392 non-null float64
6 DiabetesPedigreeFunction 392 non-null float64
7 Age 392 non-null int64
8 Outcome 392 non-null bool

dtypes: bool(1), float64(2),


int64(6)memory usage: 27.9 KB

cor=dia.corr(method='pearson')
cor

Output
sns.heatmap(

cor)

Inferance:

Result:
.
Ex.No:6 Apply and explore various plotting functions
on UCI data sets

Aim:
To Apply and explore various plotting functions on UCI data sets of Normal curve, Density and
contour plots, correlation and scatter plots, Histograms and Three dimensional plotting.

a. Normal curves
A probability distribution is a statistical function that describes the likelihood of obtaining
the possible values that a random variable can take. By this, we mean the range of values that
a parameter can take when we randomly pick upvalues from it. If we were asked to pick up 1
adult randomly and asked what his/her (assuming gender does not affect height) height would
be? There’s no way to know what the height will be. But if we have the distribution of heights
of adults in the city, we can bet on the most probable outcome.A Normal Distribution is also
known as a Gaussian distribution or famously Bell Curve. People use both words
interchangeably, but it means the same thing. It is a continuous probability distribution.

Algorithm :

Step 1: Download Heart dataset from kaggle.


Step 2: Save that in downloads or any other folder and install packages.
Step 3: Apply these following commands on the dataset.
Step 4: The output will display.

Code:
import numpy as np

import matplotlib.pyplot as

plt # Creating a series of

data of inrange of 1-50. x =

np.linspace(1,50,200)

#Creating a Function.

def normal_dist(x , mean , sd):

prob_density = (np.pi*sd) * np.exp(-

0.5*((x-mean)/sd)**2) return

prob_density
#Calculate mean and

Standarddeviation. mean =

np.mean(x) sd = np.std(x)

#Apply function to

thedata. pdf =

normal_dist(x,mea

n,sd

)#Plotting the

Results

plt.plot(x,pdf ,

color ='red')

plt.xlabel('Data

points')

plt.ylabel('Probability Density')

b. Density and contour plots


Contour plots also called level plots are a tool for doing multivariate analysis and visualizing
3-D plots in 2-D space. If we consider X and Y as our variables we want to plot then the
response Z will be plotted as slices on the X-Y plane due to which contours are sometimes
referred as Z-slices or iso-response.

Contour plots are widely used to visualize density, altitudes or heights of the mountain as
well as in the meteorological department. Due to such wide usage matplotlib.pyplot
provides a
method contour to make it easy for us to draw contour plots.

Code:
import

matplotlib.pyplot as

plt import numpy as

np

feature_x = np.arange(0,

50, 2)feature_y =

np.arange(0, 50,

3)# Creating 2-D

grid offeatures

[X, Y] =

np.meshgrid(feature_x,

feature_y) fig, ax =

plt.subplots(1, 1)

Z = np.cos(X /

2) + np.sin(Y /

4)# plotscontour

lines

ax.contour(X, Y,

Z)

ax.set_title('Contour Plot') ax.set_xlabel('featu re_x')


ax.set_ylabel('f

eature_y')

plt.show()

C. Correlation and scatter plots


Correlation means an association, It is a measure of the extent to which two variables are related.

1. Positive Correlation: When two variables increase together and decrease together. They are positively
correlated. ‘1’ is a perfect positive correlation. For example – demand and profit are positively correlated the
morethe demand for the product, the more profit hence positive correlation.

2. Negative Correlation: When one variable increases and the other variable decreases together and vice-
versa. They are negatively correlated. For example, If the distance between magnet increases their attraction
decreases, and vice-versa. Hence, a negative correlation. ‘-1’ is no correlation

3. Zero Correlation( No Correlation): When two variables don’t seem to be linked at all. ‘0’ is a perfect
negative correlation. For Example, the amount of tea you take and level of intelligence.

Code:
import pandas as pd

con =

pd.read_csv('concrete.

csv') con

list(con.colu
mns)
con.head()
con['cement'] =

con['cement'].astype('category')

con.describe(include='category')

import seaborn as sns

sns.scatterplot(x="water", y="coarseagg", data=con);

ax = sns.scatterplot(x="water",

y="coarseagg", data=con)

ax.set_title("Concrete Strength vs.


Fly ash") ax.set_xlabel("coarseagg");

sns.lmplot(x="water", y="coarseagg", data=con);

d. Histograms:
A histogram is basically used to represent data provided in a form of some groups.It is
accuratemethod for the graphical representation of numerical data distribution.It is a type of
bar plot where X-axis represents the bin ranges while Y-axis gives information about
frequency.

Creating a Histogram
To create a histogram the first step is to create bin of the ranges, then distribute the whole
rangeof the values into a series of intervals, and count the values which fall into each of the
intervals.Bins are clearly identified as consecutive, non-overlapping intervals of variables.
The matplotlib.pyplot.hist() function is used to compute and create histogram of x.

Code:

from matplotlib import

pyplot aspltimport numpy as

np

# Creating dataset

a = np.array([22, 87, 5, 43, 56,

73, 55, 54, 11,

20, 51, 5, 79, 31, 27])


# Creating histogram

fig, ax = plt.subplots(figsize =(10, 7))

ax.hist(a, bins = [0, 25, 50, 75,

100])# Show

plotplt.show()

Code:

import

matplotlib.pyplot as

pltimport numpy as np

from matplotlib import

colorsfrom

matplotlib.ticker import

PercentFormatter# Creating

dataset

np.random.seed(23685752)

N_points = 10000

n_bins = 20

# Creating distribution
x = np.random.randn(N_points)
y = .8 ** x +

np.random.randn(10000) +25#

Creating histogram

fig, axs = plt.subplots(1, 1,figsize =(10, 7),tight_layout = True)

axs.hist(x, bins = n_bins)

Sho

plot

plt.show()

e. Three dimensional plotting

Matplotlib was introduced keeping in mind, only two-dimensional plotting. But at the time
whenthe release of 1.0 occurred, the 3d utilities were developed upon the 2d and thus, we
have 3d implementation of data available today! The 3d plots are enabled by importing the
mplot3d toolkit. In this article, we will deal with the 3d plots using matplotlib.

Code:

from mpl_toolkits

import mplot3dimport

numpy as np

import matplotlib.pyplot as plt

fig = plt.figure()
# syntax for 3-D

projection ax =

plt.axes(projection ='3d') #

defining axes

z = np.linspace(0, 1,

100)x = z * np.sin(25

* z)

y = z * np.cos(25

* z) c = x + y

ax.scatter(x, y, z, c

= c)# syntax for

plotting

ax.set_title('3d Scatter

plot') plt.show()

Inference:

Result:
Ex.No:7 Visualizing Geographic Data with Basemap

Aim:

To Visualizing Geographic Data with Basemap

One common type of visualization in data science is that of geographic data. Matplotlib's
main tool for this type of visualization is the Basemap toolkit, which is one of several
Matplotlib toolkits which lives under the mpl_toolkits namespace. Admittedly, Basemap
feels a bit clunkyto use, and often even simple visualizations take much longer to render than
you might hope. More modern solutions such as leaflet or the Google Maps API may be a
better choice for moreintensive map visualizations. Still, Basemap is a useful tool for Python
users to have in their virtual toolbelts. In this section, we'll show several examples of the type
of map visualization that is possible with this toolkit.

Algorithm:
Step 1: Install Basemap. The zip file occurs extract the original file.
Step 2: import packages.
Step 3: Save that in downloads or any other Folder.
Step 4: Apply these following Commands.
Step 5: The output will display.

Code:
fig = plt.figure(figsize=(8,

8)) m =

Basemap(projection='lcc',

resolution=None,

width=8E6, height=8E6,

lat_0=45, lon_0=-
100,) m.etopo(scale=0.5,
alpha=0.5)

# Map (long, lat) to

(x, y) for plotting x,

y = m(-122.3,

47.6)

plt.plot(x, y, 'ok',

markersize=5)
plt.text(x, y, '

Seattle',

fontsize=12);

from

mpl_toolkits.basemap

import Basemap

import

matplotlib.pyplot asplt

fig =

plt.figure(figsiz

e =(12,12)) m =

Basemap()

m.drawcoastlin

es()

m.drawcoastlines(linewidth=1.0,

linestyle='dashed', color='red')
plt.title("Coastlines",

fontsize=20) plt.show()

import numpy as

np import pandas

as pd

import matplotlib.pyplot as plt

import seaborn as sns

import geopandas as

gpd import shapefile

as shp from

shapely.geometry

import Point sns.set_style('whitegrid')

fp = r'Maps_with_python\india-polygon.shp' map_df

=gpd.read_file(fp) map_df_copy = gpd.read_file(fp)

plt.plot(map_df , markersize=5)
Inference:

Result :
Ex. No.: 8 Correlation coefficient

Aim:
To write a python program to compute correlation coefficient.

ALGORITHM
Step 1: Start the Program
Step 2: Import math
package
Step 3: Define correlation coefficient
functionStep 4: Calculate correlation using
formula Step 5:Print the result
Step 6 : Stop the process

Program:
# Python Program to find correlation
coefficient. import math

# function that returns correlation


coefficient. def correlationCoefficient(X, Y,
n) :
sum_X = 0
sum_Y = 0
sum_XY = 0
squareSum_X = 0
squareSum_Y = 0
i=0
while i < n :
# sum of elements of array
X.sum_X = sum_X + X[i]

# sum of elements of array


Y.sum_Y = sum_Y + Y[i]

# sum of X[i] * Y[i].


sum_XY = sum_XY + X[i] * Y[i]

# sum of square of array elements.


squareSum_X = squareSum_X + X[i] *
X[i]squareSum_Y = squareSum_Y + Y[i] *
Y[i]
i=i+1

# use formula for calculating


correlation# coefficient.
corr = (float)(n * sum_XY - sum_X *
sum_Y)/(float)(math.sqrt((n *
squareSum_X - sum_X * sum_X)* (n *
squareSum_Y - sum_Y * sum_Y)))
return corr

# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32
# Find the size of
array. n = len(X)

# Function call to correlationCoefficient.


print ('{0:.6f}'.format(correlationCoefficient(X, Y, n)))

Output :
0.953463

Inference:

Result:
Ex. No.: 9 Simple Linear Regression

Aim:
To write a python program for Simple Linear Regression

ALGORITHM

Step 1: Start the Program


Step 2: Import numpy and matplotlib package
Step 3: Define coefficient function
Step 4: Calculate cross-deviation and deviation about x
Step 5: Calculate regression coefficients
Step 6: Plot the Linear regression and define main function
Step 7: Print the result
Step 8: Stop the process

Program:
import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


# number of observations/pointsn =
np.size(x)

# mean of x and y vector


m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


plt.show()

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficientsb =
estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if name == " main ":


main()

Output :

Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437

Graph:

Inference:

Result:

You might also like