0% found this document useful (0 votes)
32 views

CIOT-701 Lab Manual DATA SCIENCE

The document is a lab manual for the Introduction to Data Science course at IPS Academy, detailing the institute's vision, mission, and educational objectives. It outlines program education objectives, outcomes, specific outcomes, and includes laboratory regulations and safety rules. Additionally, it provides an index of experiments and introductory information about the R programming language for data analysis.

Uploaded by

PALAK SONI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

CIOT-701 Lab Manual DATA SCIENCE

The document is a lab manual for the Introduction to Data Science course at IPS Academy, detailing the institute's vision, mission, and educational objectives. It outlines program education objectives, outcomes, specific outcomes, and includes laboratory regulations and safety rules. Additionally, it provides an index of experiments and introductory information about the R programming language for data analysis.

Uploaded by

PALAK SONI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

IPS ACADEMY INDORE


INSTITUTE OF ENGINEERING & SCIENCECOMPUTER
SCIENCE & ENGINEERING (IOT) DEPARTMENT

LAB MANUAL
(2024-25)

Introduction to Data Science


(PCC-CIOT701)

Name:

Year: Semester:
ClassRollNo.: Enrollment No.:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

CONTENTS

1. Vision & Mission of theInstitute


2. Vision & Mission of theDepartment
3. PEOs
4. POs
5. COs
6. PSOs
7. Laboratory Regulations and SafetyRules
8. Index
9. Experiments

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Vision of the Institute

To be the fountainhead of novel ideas & innovations in science & technology & persist to
be a foundation of pride for all Indians.

Mission of the Institute

• To provide value based broad Engineering, Technology and Science where education in
students are urged to develop their professionalskills.

• To inculcate dedication, hard work, sincerity, integrity and ethics in building up overall
professional personality of our student andfaculty.

• To inculcate a spirit of entrepreneurship and innovation in passing outstudents.

• To instigate sponsored research and provide consultancy services in technical, educational


and industrialareas.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Vision of the Department


Attaining global recognition in computer science and engineering education, research and
training to meet the growing needs of the industry and society.

Mission of the Department

Provide quality undergraduate and postgraduate education, in both the theoretical and applied
foundations of computer science, and train students to effectively apply this education to solve
real-world problems, thus amplifying their potential for lifelong high-quality careers.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Program Education Objectives (PEOs)

1. To prepare students for successful careers in software industry that meet the needs
of Indian and multinational companies.

2. To develop the skills among students to analyze real world problem & implement
with computer engineering solution and in multidisciplinary projects.

3. To provide students with solid foundation in mathematical, scientific and engineering


fundamentals to solve engineering problems and required to pursue higher studies.

4. To develop the ability to work with the core competence of computer science &
engineering i.e. software engineering, hardware structure & networking concepts so
that one can find feasible solution to real world problems.

5. To inseminate in student professional and ethical attitude, effective communication


skills, team work skills, multidisciplinary approach, and an ability to relate
engineering issues to broader social context.

6. To motivate students perseverance for lifelong learning and to introduce them


to professional ethics and codes of professional practice.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Program Outcomes (POs)

PO1. Engineering knowledge:Apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex engineering
problems.

PO2. Problem analysis:Identify, formulate, research literature, and analyze complex


engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.

PO3.Design/development of solutions:Design solutions for complex engineering problems


and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.

PO4. Conduct investigations of complex problems:Use research-based knowledge and


research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.

PO5. Modern tool usage:Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.

PO6. The engineer and society:Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.

PO7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.

PO8. Ethics:Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineeringpractice.

PO9. Individual and team work:Function effectively as an individual, and as a member or


leader in diverse teams, and in multidisciplinary settings.

PO10. Communication:Communicate effectively on complex engineering activities with the


engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

PO11. Project management and finance:Demonstrate knowledge and understanding of the


engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.

PO12. Life-long learning:Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Course Outcomes (COs)


1. Demonstrate Proficiency with statistical analysis of data.
2. Build & access data-based models.
3. Execute statistical analyses with professional statistical software’s.
4. Demonstrate skill in Data Management
5. Apply Data Science concepts & methods to solve problems in real world contexts &will
communicate these solutions effectively.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Program Specific Outcomes (PSOs)

1. Apply Concepts in core areas of computer science -Data Structures, Database


Management System, Operating System, Computer Networks and Software Engineering
to solve technical issues.
2. Apply Software Engineering principal and practices to provide software solutions.
3. Ability to work in team and apply the knowledge acquired to develop new real life
system and able to adapt societal need of future.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Laboratory Regulations and Safety Rules

1. Without Prior permission do not enter into theLaboratory.


2. While entering into the LAB students should wear their IDcards.
3. The Students should come with proper uniform.
4. Student should not use mobile phone inside the laboratory.
5. Students should sign in the LOGIN REGISTER before entering into the
laboratory.
6. Students should come with observation and record note book to the
laboratory.
7. Do not change any computer setting.
8. Students should maintain silence inside the laboratory.
9. After completing the laboratory exercise, make sure to SHUTDOWN the
system properly.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

INDEX

S.No Experiment Date Signature


1 Intro to R tool for Data Analytics (CO-1)

2 Basic Statistics & Visualization in R (CO-3)

3 To understand K-Means Clustering(CO-2)

4 Design a Program to implement pyplot from


matplotlib & visualize a dataframe (CO-1)
5 Understanding Linear Regression (CO-2)

6 Apply pre-processing techniques on Boston


Housing Data Set using python withvarious
operations use clustering techniques.(CO-5)
7 Understanding Naive Bayes Classifier (CO-4)

8 Understanding Decision Trees (CO-4)

9 Apply various operation and techniques on


Haberman Cancer Survival data set using python
library. (CO-5)
10 Apply various operations techniques on titanic
Data set. (CO-1)

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


EXPERIMENT – 1

AIM: Introduction to R tool for DataAnalytics

R is an open-source programming language that is widely used as a statistical software and data analysis
tool. R generally comes with the Command-line interface. R is available across widely used platforms
likeWindows,Linux,andmacOS.Also,theRprogramminglanguageisthelatestcutting-edgetool.
It was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is
currently developed by the R Development Core Team. R programming language is an implementation
oftheSprogramminglanguage.ItalsocombineswithlexicalscopingsemanticsinspiredbyScheme.

Moreover, the project conceives in 1992, with an initial version released in 1995 and a stable beta version
in2000.

Why R Programming Language?

 R programming is used as a leading tool for machine learning, statistics, and data analysis.
Objects, functions, and packages can easily be created byR.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


 It’saplatform-independentlanguage.Thismeansitcanbeappliedtoalloperatingsystem.
 It’s an open-source free language. That means anyone can install it in any organization without
purchasing alicense.
 R programming language is not only a statistic package but also allows us to integrate with
other languages (C, C++). Thus, you can easily interact with many data sources and statistical
packages.
 TheRprogramminglanguagehasavastcommunityofusersandit’sgrowingdaybyday.
 R is currently one of the most requested programming languages in the Data Science job
market that makes it the hottest trendnowadays.

Features of R Programming Language

Statistical Features of R:
 Basic Statistics: The most common basic statistics terms are the mean, mode, and median.
These are all known as “Measures of Central Tendency.” So using the R language we can
measure central tendency veryeasily.
 Static graphics: R is rich with facilities for creating and developing interesting static
graphics. R contains functionality for many plot types including graphic maps, mosaic plots,
biplots, and the list goeson.
 Probability distributions: Probability distributions play a vital role in statistics and by using
R we can easily handle various types of probability distribution such as Binomial Distribution,
Normal Distribution, Chi-squared Distribution and manymore.
 Dataanalysis:Itprovidesalarge,coherentandintegratedcollectionoftoolsfordataanalysis.

Programming Features of R:
 R Packages: One of the major features of R is it has a wide availability of libraries. R has
CRAN(Comprehensive R Archive Network), which is a repository holding more than 10, 0000
packages.
 Distributed Computing: Distributed computing is a model in which components of a
software system are shared among multiple computers to improve efficiency and performance.
Two new packages ddR and multidplyrused for distributed programming in R were released
in November2015.

Programming in R:
Since R is much similar to other widely used languages syntactically, it is easier to code and learn in R.
Programs can be written in R in any of the widely used IDE like R Studio, Rattle, Tinn-R, etc. After
writing the program save the file with the extension .r. To run the program use the following command
on the command line:
R file_name.r

Advantages of R:
 R is the most comprehensive statistical analysis package. As new technology and concepts
often appear first inR.
 As R programming language is an open source. Thus, you can run R anywhere and at any
time.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


 R programming language is suitable for GNU/Linux and Windows operatingsystem.
 R programming is cross-platform which runs on any operatingsystem.
 In R, everyone is welcome to provide new packages, bug fixes, andcode enhancements.

Disadvantages of R:
 In the R programming language, the standard of some packages is less thanperfect.
 Although, R commands give little pressure to memory management. So R programming
language may consume all availablememory.
 In R basically, nobody to complain if something doesn’twork.
 R programming language is much slower than other programming languages such as Python
andMATLAB.

Applications of R:
 We use R for Data Science. It gives us a broad variety of libraries related to statistics. It also
provides the environment for statistical computing anddesign.
 R is used by many quantitative analysts as its programming tool. Thus, it helps in data
importing andcleaning.
 R is the most prevalent language. So many data analysts and research programmers use it.
Hence, it is used as a fundamental tool forfinance.
 Tech giants like Google, Facebook, bing, Twitter, Accenture, Wipro and many more using R
nowadays.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


VIVA QUESTIONS

Q1) List out some of the function that R provides?

Ans. The function that R provides are

 Mean
 Median
 Distribution
 Covariance
 Regression
 Non-linear
 Mixed Effects
 GLM
 GAM.etc.

Q2) Explain how you can start the R commander GUI?


Ans. Typing the command, (“Rcmdr”) into the R console starts the R commander GUI.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


EXPERIMENT – 2

AIM: Basic Statistics and Visualization inR.

The popular data visualization tools that are available are Tableau, Plotly, R, Google Charts, Infogram,
and Kibana. The various data visualization platforms have different capabilities, functionality, and use
cases.

They also require a different skill set. This article discusses the use of R for data visualization.

R is a language that is designed for statistical computing, graphical data analysis, and scientific research. It
is usually preferred for data visualization as it offers flexibility and minimum required coding through its
packages.

Consider the following airquality data set for visualization in R:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Types of Data Visualizations

Some of the various types of visualizations offered by R are:

Bar Plot

There are two types of bar plots- horizontal and vertical which represent data points as horizontal or
vertical bars of certain lengths proportional to the value of the data item. They are generally used for
continuous and categorical variable plotting. By setting the horiz parameter to true and false, we can get
horizontal and vertical bar plots respectively.

# Horizontal Bar Plot for


# Ozone concentration in air
barplot(airquality$Ozone,
main = 'Ozone Concenteration in air',
xlab = 'ozone levels', horiz = TRUE)

Output:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


# Vertical Bar Plot for
# Ozone concentration in air
barplot(airquality$Ozone, main = 'Ozone Concenteration in air',
xlab = 'ozone levels', col ='blue', horiz = FALSE)

Output:

Histogram

A histogram is like a bar chart as it uses bars of varying height to represent data distribution. However, in a
histogram values are grouped into consecutive intervals called bins. In a Histogram, continuous values are
grouped and displayed in these bins whose size can be varied.

data(airquality)

hist(airquality$Temp, main ="La Guardia Airport's\


Maximum Temperature(Daily)",
xlab ="Temperature(Fahrenheit)",
xlim = c(50, 125), col ="yellow",
freq = TRUE)

Output

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


For a histogram, the parameter xlim can be used to specify the interval within which all values are to be
displayed.
Another parameter freq when set to TRUE denotes the frequency of the various values in thehistogram
and when set to FALSE, the probability densities are represented on the y-axis such that they are of the
histogram adds up toone.

Box Plot
The statistical summary of the given data is presented graphically using a boxplot. A boxplot depicts
information like the minimum and maximum data point, the median value, first and third quartile, and
interquartile range.

data(airquality)

boxplot(airquality$Wind, main = "Average wind speed\


at La Guardia Airport",
xlab = "Miles per hour", ylab = "Wind",
col = "orange", border = "brown",
horizontal = TRUE, notch = TRUE)

Output

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Scatter Plot

A scatter plot is composed of many points on a Cartesian plane. Each point denotes the value taken by two
parameters and helps us easily identify the relationship between them.

# Scatter plot for Ozone Concentration per month


data(airquality)

plot(airquality$Ozone, airquality$Month,
main ="Scatterplot Example",
xlab ="Ozone Concentration in parts per billion",
ylab =" Month of observation ", pch = 19)

Output

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


VIVA QUESTIONS

Q 1) Explain RStudio.

Ans. RStudio is an integrated development environment which allows us to interact with R more
readily. RStudio is similar to the standard RGui, but it is considered more user-friendly. This IDE has
various drop-down menus, windows with multiple tabs, and so many customization processes. The first
time when we open RStudio, we will see three Windows. The fourth Window will be hidden by default.

Q 2) what are the advantages of R?

The advantages are:-


It is used for managing and manipulating of data. No license
restrictions
Free and open source software. Graphical
capabilities of R are good.
Runs on many Operating system and different hardware and also run on 32 & 64 bit processors etc.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


EXPERIMENT – 3

AIM: To understand K-Means Clustering.

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in
machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the
algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups
in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clusteringalgorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterativeprocess.


o Assigns each data point to its closest k-center. Those data points which are near to the particulark-
center, create acluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters.
It means here we will try to group these datasets into two differentclusters.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


o We need to choose some random k points or centroid to form the cluster. These points can be eitherthe
points from the dataset or any other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the belowimage:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between two
points. So, we will draw a median between both the centroids. Consider the belowimage:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will findnew
centroids asbelow:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same processof
finding a median line. The median will be like belowimage:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right
to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.

o We will repeat the process by finding the center of gravity of centroids, so the new centroids will beas
shown in the belowimage:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


o As we got the new centroids so again will draw the median line and reassign the data points. So,the
image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,which
means our model is formed. Consider the belowimage:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as
shown in the below image:

VIVA QUESTIONS

Q.1) What is K-Means clustering?

Ans. K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre- defined clusters that need to be created in the process, as
if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

Q.2) How does the K-Means algorithm work?

Ans. The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25
EXPERIMENT – 4

AIM: Design a Program for Import pyplot from Matplotliband visualize


ourDataFrame.

Program:

import pandas as pd #importing necessary libraries


import matplotlib.pyplot as plt

df = pd.DataFrame({

'name':['rohit','rahul','virat','shreyas','rishabh','ravindra','shardul','axar','harshal','yuzi','
bumrah'],
'age':[34,29,33,27,24,33,30,27,31,31,28],
'test':[43,40,96,0,25,56,4,4,0,0,24],
'odi':[227,38,254,21,18,168,15,38,0,56,67],
't20':[119,55,91,29,39,50,22,15,0,50,51]
}) #creating adataframe
df['limited_overs']=df['odi']+df['t20']
print(df)

Fig. 4.1 – This is our created raw dataframe

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


#to draw a scatterplot
df.plot(kind='scatter',x='test',y='limited_overs',color='red')
plt.show()

Fig. 4.2 – Scatterplot of test vs limited overs

df['age'].hist() #to plot a histogram using visualizationtools

Fig. 4.3 – Histogram of age vs frequency

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


#plotting a double line graph
ax = plt.gca()
df.plot(kind='line',x='name',y='test',ax=ax)
df.plot(kind='line',x='name',y='limited_overs', color='red', ax=ax)
plt.show()

Fig. 4.4 – Our line graph showing each players experience.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


VIVA QUESTIONS

Q.1) What is matplotlib?


Ans. Matplotlib is the python library extensively used to plot curves based on dataframes. Various kinds of plots
can be drawn by using the Matplotlib library.

Q.2) Which matplotlib operations have been performed in theabove program?


Ans. In the above program, we’ve plotted a histogram based on age of players, a line graph based on players’
experience & a scatterplot based on the test & limited overs cricket played by the players.

Q.3) According to above analysis, which player is most experienced & which is least experienced in
cricket?
Ans. According to analysis done by us, Virat Kohli is the most experienced player while Harshalpatel is the least
experienced player who is yet to make his debut.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


EXPERIMENT – 5

AIM: Understanding LinearRegression

Linear Regression in Machine Learning

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to the
value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input
value). ε = random error

The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple LinearRegression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.

o Multiple Linearregression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:

o Positive LinearRelationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis,
then such a relationship is termed as a Positive linear relationship.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


o Negative LinearRelationship:
If the dependent variable decreases on the Y-axis and independent variable increases on the X-
axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.

The different values for weights or the coefficient of lines (a 0, a1) gives a different line of regression,
so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use
cost function.

Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the bestfit
line.
o Cost function optimizes the regression coefficients or weights. It measures how alinear
regression model isperforming.
o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If the observed
points are far from the regression line, then the residual will be high, and so cost function will high. If
the scatter points are close to the regression line, then the residual will be small and hence the cost
function.

Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the costfunction.
o A regression model uses gradient descent to update the coefficients of the line by reducingthe
costfunction.
o It is done by a random selection of values of coefficient and then iteratively update the valuesto
reach the minimum costfunction.

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below
method:

1. R-squaredmethod:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and independent variableson
a scale of0-100%.
o The high value of R-square determines the less difference between the predicted valuesand
actual values and hence represents a goodmodel.
o It is also called a coefficient of determination, or coefficient of multiple determination for
multipleregression.
o It can be calculated from the belowformula:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given dataset.

o Linear relationship between the features andtarget:


Linear regression assumes the linear relationship between the dependent and independent
variables.
o Small or no multicollinearity between thefeatures:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and target
variables. Or we can say, it is difficult to determine which predictor variable is affecting the
target variable and which is not. So, the model assumes either little or no multicollinearity
between the features or independent variables.
o HomoscedasticityAssumption:
Homoscedasticity is a situation when the error term is the same for all the values of independent
variables. With homoscedasticity, there should be no clear pattern distribution of data in the
scatter plot.
o Normal distribution of errorterms:
Linear regression assumes that the error term should follow the normal distribution pattern. If
error terms are not normally distributed, then confidence intervals will become either too wide or
too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation,
which means the error is normally distributed.
o Noautocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


VIVA QUESTIONS

Q1) Is Overfitting a possibility with linear regression?

Ans. Yes, Overfitting is possible even with linear regression. This happens when multiple linear
regression is used to fit an extremely high-degree polynomial. When the parameters of such a model
are learned, they will fit too closely to the training data, fitting even the noise, and thereby fail to
generalize on test data.

Q2) Is it necessary to remove outliers? Why or why not?

Ans. Yes, it is necessary to remove outliers as they can have a huge impact on the model's predictions.
Take, for instance, plots 3 and 4 for the Anscombe's quartet provided above. It is apparent from these
plots that the outliers have caused a significant change in the best fit line in comparison to what it would
have been in their absence.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


EXPERIMENT – 6

AIM: Apply pre-processing techniques on Boston Housing Data Set using python with
various operations use clustering techniques.

Program:

import numpy asnp


import pandas aspd
import visuals as vs # Supplementary code
from sklearn.cross_validation import ShuffleSplit

# Pretty display for notebooks


%matplotlib inline

# Load the Boston housing dataset data =


pd.read_csv('housing.csv') prices =
data['MEDV']
features = data.drop('MEDV', axis = 1)

# Success
print 'Boston housing dataset has {0} data points with {1} variables
each'.format(*data.shape)

Fig. 6.1 – Output for the above query

# TODO: Minimum price of the data


minimum_price = np.min(prices)
# Alternative using pandas
# minimum_price = prices.min()

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


# TODO: Maximum price of the data
maximum_price= np.max(prices)
# Alternative using pandas
# maximum_price = prices.max()

# TODO: Mean price of the data


mean_price= np.mean(prices)
# Alternative using pandas
# mean_price = prices.mean()

# TODO: Median price of the data


median_price= np.median(prices) #
Alternative usingpandas
# median_price = prices.median()

# TODO: Standard deviation of prices of the data std_price=


np.std(prices)
# Alternative using pandas
# std_price = prices.std(ddof=0)

# There are other statistics you can calculate too like quartiles first_quartile=
np.percentile(prices, 25) third_quartile= np.percentile(prices, 75)
inter_quartile= third_quartile- first_quartile

# Show the calculated statistics


print "Statistics for Boston housing dataset:\n"

print "Minimum price: ${:,.2f}".format(minimum_price)


print "Maximum price: ${:,.2f}".format(maximum_price)
print "Mean price: ${:,.2f}".format(mean_price)
print "Median price ${:,.2f}".format(median_price)
print "Standard deviation of prices: ${:,.2f}".format(std_price)
print "First quartile of prices: ${:,.2f}".format(first_quartile)
print "Second quartile of prices: ${:,.2f}".format(third_quartile)
print "Interquartile (IQR) of prices: ${:,.2f}".format(inter_quartile)

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Fig. 6.2 – Statistics related to the dataset

# Using pyplot
import matplotlib.pyplotas plt
plt.figure(figsize=(20, 5))

# i: index
for i, col in enumerate(features.columns):

# 3 plots here hence 1, 3


plt.subplot(1, 3, i+1)
x = data[col] y
= prices plt.plot(x,
y, 'o')
# Create regression line
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))

plt.title(col)
plt.xlabel(col)
plt.ylabel('prices')

Fig. 6.3 – Graphs plotted pertaing to our dataset


7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25
VIVA QUESTIONS

Q.1) Define Clustering?


Ans. Clustering is the task of dividing the population or data points into anumber of groups such that data
points in the same groups are more similar to other data points in the same group than those in other groups.
In simple words, the aim is to segregate groups with similar traits and assign them intoclusters.

Q.2) List some different types of clustering?


Ans. there are more than 100 clustering algorithms known. But few of the algorithms are used
popularly, let’s look at them:

• Density-Based Methods: These methods consider the clusters as the dense region having some
similarities and differences from the lower dense region of the space. These methods have good
accuracy and the ability to merge two clusters. Example DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), OPTICS (Ordering Points to Identify Clustering
Structure),etc.
• Hierarchical Based Methods: The clusters formed in this method forma tree-type structure based
on the hierarchy. New clusters are formed using the previously formed one. It is divided into
twocategory
• Agglomerative (bottom-up approach)
• Divisive (top-downapproach)

Q.3) Write some applications of clustering ?Ans. Applications


of Clustering in differentfields
• Marketing: It can be used to characterize & discover customersegments for marketing purposes.
• Biology: It can be used for classification among different species ofplants andanimals.
• Libraries: It is used in clustering different books on the basis oftopics andinformation.
• Insurance: It is used to acknowledge the customers, their policiesand identifying thefrauds.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


EXPERIMENT – 7

AIM: Understanding Naïve Bayes Classifier.

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classificationproblems.
o It is mainly used in text classification that includes a high-dimensional trainingdataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helpsin
building the fast machine learning models that can make quickpredictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of anobject.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,and
classifyingarticles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on eachother.
o Bayes: It is called Bayes because it depends on the principle of Bayes'Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditionalprobability.
o The formula for Bayes' theorem is givenas:

Where,

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
vs Java

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of givenfeatures.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Frequency table for the Weather Conditions:

Likelihood table weather condition:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class ofdatasets.
o It can be used for Binary as well as Multi-classClassifications.
o It performs well in Multi-class predictions as compared to the otherAlgorithms.
o It is the most popular choice for text classificationproblems.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
betweenfeatures.

Applications of Naïve Bayes Classifier:


o It is used for CreditScoring.
o It is used in medical dataclassification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eagerlearner.
o It is used in Text classification such as Spam filtering and Sentimentanalysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussiandistribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed.
It is primarily used for document classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for thepredictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classificationtasks.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


VIVA QUESTIONS

Q1) What is Naive Bayes?


Ans. Naive Bayes is based on the mathematical concept of the Bayes theorem as the name suggests. It is a
collection of multiple algorithms which are based on the common idea of using Bayes theorem. These
algorithms assume that all the predictors are independent of each other and do not affect each other. All the
features contribute independently in calculating the probability for that class. That’s why it’s called naive.

It is a supervised classification algorithm. Naive Bayes also assumes that all the features have an equal effect on
the outcome.

Q2) How does Naive Bayes work?


Ans. It calculates two probabilities: the probability for each class and the conditional probability for each class
according to some condition. All these probabilities are calculated for the training data and after training, new
data points can be predicted using Bayes theorem. Naive Bias can also be trained in a semi- supervised manner
using a mixture of labeled and unlabelled dataset.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


EXPERIMENT – 8

AIM: Understanding Decision Tree

Decision Tree
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured
classifier, where internal nodes represent the features of a dataset, branches represent the decision
rules and each leaf node represents theoutcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any furtherbranches.
o The decisions or the test are performed on the basis of features of the givendataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based on
givenconditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-likestructure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Treealgorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
o Below diagram explains the general structure of a decisiontree:

Note: A decision tree can contain categorical data (YES/NO) as well as numericdata.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons for
using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-likestructure.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further
gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a
leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the
given conditions.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child
nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on
the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further.
It continues the process until it reaches the leaf node of the tree. The complete process can be better understood
using the belowalgorithm:

o Step-1: Begin the tree with the root node, says S, which contains the completedataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure(ASM).
o Step-3: Divide the S into subsets that contains possible values for the bestattributes.
o Step-4: Generate the decision tree node, which contains the bestattribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called the
final node as a leafnode.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the
offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM). The
root node splits further into the next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for the root node
and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute selection
measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

o InformationGain
o GiniIndex

1. InformationGain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset based on
anattribute.
o It calculates how much information a feature provides us about aclass.
o According to the value of information gain, we split the node and build the decisiontree.
o A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first. It can be calculated using the belowformula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(eachfeature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number ofsamples


o P(yes)= probability ofyes
o P(no)= probability ofno

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree)algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Giniindex.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binarysplits.
o Gini index can be calculated using the belowformula:

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features of
the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of treepruning technology used:

o Cost ComplexityPruning
o Reduced ErrorPruning.

Advantages of the DecisionTree

o It is simple to understand as it follows the same process which a human follow while making any
decision inreal-life.
o It can be very useful for solving decision-relatedproblems.
o It helps to think about all the possible outcomes for aproblem.
o There is less requirement of data cleaning compared to otheralgorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes itcomplex.


o It may have an overfitting issue, which can be resolved using the Random Forestalgorithm.
o For more class labels, the computational complexity of the decision tree mayincrease.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


VIVA QUESTIONS

Q1) What is the Decision Tree Algorithm?

Ans. A Decision Tree is a supervised machine learning algorithm that can be used for both Regression and
Classification problem statements. It divides the complete dataset into smaller subsets while at the same time
an associated Decision Tree is incrementally developed.

The final output of the Decision Trees is a Tree having Decision nodes and leaf nodes. A Decision Tree
can operate on both categorical and numerical data.

Q2) Explain the CART Algorithm for Decision Trees.

Ans. The CART stands for Classification and Regression Trees is a greedy algorithm that greedily searches
for an optimum split at the top level, then repeats the same process at each of the subsequent levels.

Moreover, it does verify whether the split will lead to the lowest impurity or not as well as the solution
provided by the greedy algorithm is not guaranteed to be optimal, it often produces a solution that’s
reasonably good since finding the optimal Tree is an NP-Complete problem that requires exponential time
complexity.

As a result, it makes the problem intractable even for small training sets. This is why we must go for a
“reasonably good” solution instead of an optimal solution.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


EXPERIMENT – 9

AIM: Apply various operation and techniques on Haberman Cancer


Survival data set using pythonlibrary.

Program:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt import seaborn as snssns.set()

#read/load csv file & providing some headersto the file.


cancer_df = pd.read_csv('../input/haberman.csv', header=None, names=['age', 'year_of_treatment',
'positive_lymph_nodes','survival_status_after_5_years']) print(cancer_df.head())

cancer_df['survival_status_after_5_years'] = cancer_df['survival_status_after_5_years'].map({1:"yes", 2:"no"})


cancer_df['survival_status_after_5_years'] = cancer_df['survival_status_after_5_years'].astype('category')
print(cancer_df.head())

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


print("Number of rows: " + str(cancer_df.shape[0])) print("Number of columns: " + str(cancer_df.shape[1]))
print("Columns: " + ", ".join(cancer_df.columns))

print("Target variable distribution") print(cancer_df.iloc[:,-1].value_counts()) print("*"*50)


print(cancer_df.iloc[:,-1].value_counts(normalize = True)) #printing some details about

the dataset

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, feature in enumerate(list(cancer_df.columns)[:-1]):
sns.boxplot( x='survival_status_after_5_years', y=feature, data=cancer_df, ax=axes[idx])
plt.show()

# plotting boxplots related to the Dataset

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


VIVA QUESTIONS

Q.1) What is a boxplot?

Ans. A boxplot is a graph that gives you a good indication of how the values in the data are spread out.
Although boxplots may seem primitive in comparison to
a histogram or density plot, they have the advantage of taking up less space, which is useful when
comparing distributions between many groups or datasets.

Q.2) What is the use of seaborn library?

Ans. Seaborn is an open-source Python library built on top of matplotlib. It is used for data visualization
and exploratory data analysis. Seaborn works easily with dataframes and the Pandas library. The graphs
created can also be customized easily. Below are a few benefits of Data Visualization.

Graphs can help us find data trends that are useful in any machine learning or forecasting project.

 Graphs make it easier to explain your data to non-technicalpeople.


 Visually attractive graphs can make presentations and reportsmuch more appealing to
thereader.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


EXPERIMENT – 10

AIM: Apply various operations techniques on titanic Data set.

Program:

b=a.loc[a['Embarked']=='Q']
c=a.loc[a['Embarked']=='S']
d=a.loc[a['Embarked']=='C'] label=np.array(['Queenstown','Southampton','Cherbough'])
e=np.array([len(b.index),len(c.index),len(d.index)]) plt.pie(e,labels=label) plt.show()

plt.bar(a['Age'],a['relative']) #plotting bar graphs of age vsrelative

Fig. 13.2 – Bar graph of Age vs relative

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


plt.hist(a['Age']) #plotting histograms based onage

Fig. 13.3 – The agehistogram

# procedure for plotting a triple bar graph based on survival status f=len(a.loc[(a['Survived']==1) &

(a['Sex']=='male')])
g=len(a.loc[(a['Survived']==1) & (a['Sex']=='female')])
h=len(a.loc[(a['Survived']==0) & (a['Sex']=='male')])
i=len(a.loc[(a['Survived']==0) & (a['Sex']=='female')])
s=len(a.loc[(a['Survived'].isnull())& (a['Sex']=='male')])
t=len(a.loc[(a['Survived'].isnull())& (a['Sex']=='female')]) p=np.array([f,g]) q=np.array([h,i])
r=np.array([s,t]) print(p,q,r)
gender=np.arange(len(['male','female'])) plt.bar(gender -0.2,p,0.2,label='Survivors') plt.bar(gender
,q,0.2,label='Succumbers')
plt.bar(gender +0.2,r,0.2,label='Unknown') plt.xticks(gender,['male','female']) plt.title('Survivors vs
Succumbers vs Unknown') plt.xlabel('Gender') plt.ylabel('No. of people') plt.legend()
plt.show()

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


Fig. 13.4 – A bar graph based on survival status

#plotting a bar graph based on travel class

j=len(a.loc[a['Pclass']==1])
k=len(a.loc[a['Pclass']==2])
l=len(a.loc[a['Pclass']==3])
m=np.array([j,k,l]) nlabel=[1,2,3] plt.bar(nlabel,m,width=0.4)

Fig. 13.5 – Bar graph based onstatus

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


plt.violinplot(a['Pclass']) #plotting aviolinplot

Fig. 13.6 – A violinplot comparing travel classed ofpeople

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25


VIVA QUESTIONS

Q.1) What is Machine Learning?


Ans. Machine learning (ML) is the study of computer algorithms that canimprove automatically through
experience and by the use of data. It is seen as apart
of artificial intelligence. Machine learning algorithms build a model based on sample data, known as
training data, in order to make predictions ordecisions without being explicitly programmed to doso.

Q.2) How can machine learning be classified?


Ans. At a broad level, machine learning can be classified into three types:
1. Supervisedlearning
2. Unsupervisedlearning
3. Reinforcementlearning

Supervised learning is a type of machine learning method in which we provide sample labeled data to the
machine learning system in order to train it, and on that basis, it predicts the output.

Unsupervised learning is a learning method in which a machine learns without any supervision. The training is
provided to the machine with the set of data that has not been labeled, classified, or categorized, and the algorithm
needs to act on that data without any supervision.

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for each
right action and gets a penalty for each wrong action. The agent learns automatically with these feedbacks and
improves its performance.

7th Sem (PCC-CIOT 701)- Introduction to Data Science 2024-25

You might also like