0% found this document useful (0 votes)
14 views121 pages

Practical Data Science

This book serves as a comprehensive guide to data analysis and machine learning using Python, covering essential libraries like Pandas, NumPy, and SciPy. It is designed for students and professionals, featuring step-by-step tutorials, practical examples, and case studies to facilitate learning. The authors aim to provide a clear understanding of data science, making it accessible for both beginners and experienced practitioners.

Uploaded by

DustinTc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views121 pages

Practical Data Science

This book serves as a comprehensive guide to data analysis and machine learning using Python, covering essential libraries like Pandas, NumPy, and SciPy. It is designed for students and professionals, featuring step-by-step tutorials, practical examples, and case studies to facilitate learning. The authors aim to provide a clear understanding of data science, making it accessible for both beginners and experienced practitioners.

Uploaded by

DustinTc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

Practical Data Science: A

Complete Guide to Data Analysis


and Machine Learning Using
Python
Danish Ather,Rajneesh Kler, Ibrohimbek Yusupov,
Khakimjon Zaynidinov

Introduction
This book provides a comprehensive guide to data analysis and
machine learning using Python, designed to equip readers with the
necessary tools and knowledge to excel in the field of data science.
It covers a wide range of topics from data manipulation using
libraries like Pandas and NumPy to advanced scientific computing
with SciPy, making it suitable for both beginners and experienced
practitioners.

Key Topics Covered


• Introduction to Data Analysis with Python
– Data Manipulation and Analysis using Pandas
– Numerical Computing with NumPy
– Advanced Scientific Computing with SciPy
• Machine Learning Techniques with Scikit-Learn
• Statistical Methods and Data Science
• Practical Applications and Case Studies

Features
• Step-by-step tutorials with practical examples
• Hands-on practice questions with solutions
• Integration of Python libraries for comprehensive data analysis

Target Audience
This book is ideal for students, data enthusiasts, and professionals
looking to acquire new skills and face the challenges of the modern
world of data science.

Preface
With data gradually becoming the new oil that is revolutionizing
industries and products, how to make sense of the ever-growing
volumes of data comes as a critical necessity. Practical Data Science:
An Introductory to Data Analysis and Machine Learning with Python:
A Complete Guide has been written for the purpose of putting at the
disposal of the reader sufficient data processing and machine
learning knowledge for anybody interested in data science field.

This book is designed to ensure the readers learn the complex


process of completing data analysis and machine learning with the
help of Python that is considered being one of the most effective
languages in this area. They include introductory topics such as data
manipulation using libraries like Pandas to the intermediate level
with Numerical Python, or NumerPy, and Scientific Python or SciPy,
Machine Learning with Scikit-Learn and many more. Hence, while
writing this guide, We sought to combine theory with practice
examples so that it may benefit both newcomers and practitioners.

It is the author’s intent to support the interest in the subject and


provide a clear understanding of the data science for the general
public. We know that data science can quite overwhelming,
especially in terms
i of its focus area and development rate. Hence, every chapter is
well designed to progress from one chapter to another and thereby
make learning a progressive affair. Incorporation of examples,
applications, and activities: This is another effective strategy used in
presenting the ideas discussed in this book.

It is our desire that this book be than just a reference manual to the
field of data science: rather, readers should be able to view it as a
guide to their path through this territory, one that allows them to
navigate through it with both speed and purpose. As an introductory
text, this book is intended for students, data lovers or professionals
who want to acquire new skills and face the challenges of the
modern world of data science.

ii

Acknowledgments
Finnishing this book has been a great inspiring event and this could
have been made possible due to the support of several people.

We am so grateful to my parents who always support and encourage


me in all the endeavors which We undertake. I would like to thank
my family as they have believed in my capabilities and encouraged
me throughout that I have achieved and been motivated to go
further for the things I love and for making this book complete.

We want to express my deep gratitude to my family without whom I


could not complete my work and be a support during these years.
They have also been patient and understanding with me and showed
me love to enable me to carry on no matter how hard. Thus, I am
grateful to them as without their support this was not possible.

We would also like to thank all my mentors and my colleagues who


have willingly given their time to teach and enlighten me and give
their valuable feedbacks. They have made invaluable contributions
toward making this book informative and up to date in its
presentation of information. I would like to express my special
appreciation for the helpful comments I received from my fellow
students as that helped to make

iii the necessary adjustments to some of the materials provided


herein.

Also I would like to express my gratefulness to the Python


community and the authors of the libraries and tools described in
this book. Their commitment towards creating and sustaining these
uncomplicated structures has made it possible to commodify data
science. It is on this basis that this book has relied so much on the
efforts of this community and I wish to see an expansion of the
community.

Lastly, I would like to thank the readers who have selected this book
as their companion in their journey to becoming data scientist. Your
passion and interest are the top motivators behind this work, and I
still hope that the present book offers you enough information to be
useful on your way to becoming a true data science expert.

I am grateful to all of you for encouragement, ideas, and help that


you provided me in the process of completing this guide. This book
is as determined and as authored as you are.

iv

Contents
1 Introduction to Data Analysis with Python 7 1.1 Introduction to
Data Analysis . . . . . 8 1.2 Pandas: Data Manipulation and Analysis 8

1.2.1 What is Pandas? . . . . . . . . 8


1.2.2 Key Concepts . . . . . . . . . . 8
1.2.3 Creating DataFrames and Series 9
1.2.4 Basic DataFrame Operations . . 10
1.2.5 Data Selection and Filtering . . 12
1.2.6 Handling Missing Data . . . . . 14
1.2.7 Data Aggregation and Grouping 15
1.2.8 Merging and Joining DataFrames 16
1.2.9 Working with Dates and Times 18
1.3 NumPy: Numerical Computing . . . . 20
1.3.1 Introduction to NumPy . . . . 20
1.3.2 Creating Arrays . . . . . . . . . 20
1.3.3 Array Operations . . . . . . . . 22
1.3.4 Indexing, Slicing, and Reshaping 23
1.3.5 Broadcasting . . . . . . . . . . 25
1.3.6 Mathematical Functions . . . . 26
1.3.7 Linear Algebra with NumPy . . 28
1.3.8 Random Number Generation . 29
1.3.9 Real-World Applications . . . . 30
1.4 SciPy: Advanced Scientific Computing 32
1.4.1 Overview of SciPy . . . . . . . 32
1.4.2 Optimization . . . . . . . . . . 32 1.4.3 Integration . . . . . . . . . . .
34 1.4.4 Interpolation . . . . . . . . . . 35 1.4.5 Linear Algebra . . . . . .
. . . 36 1.4.6 Signal and Image Processing . . 38 1.4.7 Statistics with
SciPy . . . . . . 39 1.4.8 Practical Applications . . . . . 41 1.5 Practice
Questions and Solutions . . . 43 1.5.1 Practice Questions on Pandas .
43 1.5.2 Practice Questions on NumPy . 44 1.5.3 Practice Questions
on SciPy . . 45 1.6 Conclusion . . . . . . . . . . . . . . . . 47

2 Pandas Library 49
2.1 Introduction to Pandas Library . . . . 49
2.1.1 Why Use Pandas? . . . . . . . . 49
2.1.2 Installation . . . . . . . . . . . 50
2.2 Creating DataFrames and Series . . . . 51
2.2.1 Creating DataFrames . . . . . . 51
2.2.2 Creating Series . . . . . . . . . 54
2.3 Basic DataFrame Operations . . . . . . 56
2.3.1 Viewing and Inspecting DataFrames 56 2.3.2 Sorting and
Filtering Data . . . 59
2.4 Handling Missing Data . . . . . . . . . 63
2.4.1 Detecting Missing Data . . . . . 63
2.4.2 Handling Missing Data . . . . . 64
2.5 Data Aggregation and Grouping . . . . 67
2.5.1 Grouping and Aggregation . . . 67
2.6 Merging and Joining DataFrames . . . 70
2.6.1 Merging DataFrames . . . . . . 70
2.7 Working with Dates and Times . . . . 74
2.7.1 Handling DateTime Data . . . 74
2.8 Case Studies and Practical Examples . 78
2.8.1 Case Study 1: Analyzing Sales Data . . . . . . . . . . . . . . . 78
2.8.2 Case Study 2: Customer Segmentation . . . . . . . . . . . . 80
2.8.3 Case Study 3: Time Series Forecasting . . . . . . . . . . . . . . 81
2.9 Practice Questions and Solutions . . . 84 2.9.1 Practice Questions
on Pandas . 84 2.10 Conclusion . . . . . . . . . . . . . . . . 87

3 Scikit-Learn Library 89
3.1 Introduction to Scikit-Learn . . . . . . 89
3.1.1 Why Use Scikit-Learn? . . . . . 89
3.1.2 Installation . . . . . . . . . . . 90
3.2 Supervised Learning . . . . . . . . . . 91
3.2.1 Linear Regression . . . . . . . . 91
3.2.2 Logistic Regression . . . . . . . 94
3.2.3 Decision Trees . . . . . . . . . . 97
3.3 Unsupervised Learning . . . . . . . . . 100
3.3.1 K-Means Clustering . . . . . . . 100
3.3.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . 102
3.4 Model Evaluation and Parameter Tuning106
3.4.1 Cross-Validation . . . . . . . . 106
3.4.2 Hyperparameter Tuning . . . . 108
3.5 Practice Questions and Solutions . . . 111
3.5.1 Practice Questions on Scikit-Learn111
3.6 Conclusion . . . . . . . . . . . . . . . . 116
4 Statistics 117
4.1 Introduction to Statistics . . . . . . . . 117
4.2 Descriptive Statistics . . . . . . . . . . 117
4.2.1 Key Measures . . . . . . . . . . 118
4.2.2 Example: Descriptive Statistics Calculation . . . . . . . . . . . 118
4.3 Basics of Grouping . . . . . . . . . . . 121
4.3.1 Example 1: Grouping and Aggregation by Class . . . . . . . 121
4.3.2 Example 2: Grouping by Multiple Columns . . . . . . . . . . . 122
4.4 Analysis of Variance (ANOVA) . . . . 124 4.4.1 Example 1: One-
Way ANOVA . 124 4.4.2 Example 2: Two-Way ANOVA 125 4.5
Correlation . . . . . . . . . . . . . . . 127 4.5.1 Example 1: Correlation
Calculation . . . . . . . . . . . . . . 127 4.5.2 Example 2: Correlation
Matrix 128 4.6 Polynomial Regression and Pipelines . 130 4.6.1
Example 1: Polynomial Regression with Pipelines . . . . . . . 130 4.6.2
Example 2: Using Pipelines for Scaling and Regression . . . . . 132
4.7 R-squared and MSE for In-Sample Evaluation . . . . . . . . . . . . . .
. . . . 134 4.7.1 Example 1: Calculating R-squared and MSE . . . . . . .
. . . . . 134 4.7.2 Example 2: Comparing Models Using R-squared and
MSE . . . 135 4.8 Prediction and Decision Making . . . . 138 4.8.1
Example 1: Prediction and Decision Making in Business . . . 138 4.8.2
Example 2: Decision Making in Healthcare . . . . . . . . . . . . 139 4.9
Practice Questions and Solutions . . . 142 4.9.1 Practice Questions on
Statistics 142 4.10 Conclusion . . . . . . . . . . . . . . . . 147

5 Model Refinement 149


5.1 Introduction to Model Refinement . . . 149
5.2 Grid Search . . . . . . . . . . . . . . . 149
5.2.1 Example 1: Grid Search for Ridge Regression . . . . . . . . . . . .
150 5.2.2 Example 2: Grid Search for Support Vector Machine (SVM)
. . 151 5.3 Model Refinement . . . . . . . . . . . . 154 5.3.1 Example 1:
Refining a Linear Regression Model with CrossValidation . . . . . . . . .
. . . 154 5.3.2 Example 2: Refining with Regularization Techniques . .
. . . 156 5.4 Binning . . . . . . . . . . . . . . . . . 158 5.4.1 Example 1:
Binning Continuous Data into Equal Width Bins . . 158 5.4.2 Example
2: Binning into Equal Frequency Bins . . . . . . . . . 159 5.5 Indicator
Variables (Dummy Variables) 161 5.5.1 Example 1: Creating
Indicator Variables for Categorical Data . 161 5.5.2 Example 2:
Creating Indicator Variables for Multiple Categorical Features . . . . . .
. . . . . 162 5.6 Practice Questions and Solutions . . . 164 5.6.1
Practice Questions on Model Refinement . . . . . . . . . . . . . 164 5.7
Conclusion . . . . . . . . . . . . . . . . 168
Chapter 1 Introduction to Data
Analysis with Python
ANALYSIS WITH PYTHON

1.1 Introduction to Data Analysis


Data analysis consists of processes such as data inspection, data
cleansing, data transformation, and data modeling to generate
valuable insights and knowledge with an aim of making some
conclusions and assist in the decision making processes. Because of
the availability of a number of libraries, Python has emerged as one
of the most preferred languages for data analysis. This guide
introduces three essential libraries for data analysis in Python:
Pandas, NumPy and SciPy. These libraries are useful as they have
capabilities of managing data, carrying numerical and scientific
computations which are quite useful to data scientists and analysts.

1.2 Pandas: Data Manipulation and


Analysis
1.2.1 What is Pandas?

Pandas is an open-source Python library for data analysis which is


specially used for making changes on numerical tables and time
series data. This is achieved using NumPy as its foundation; the
Data Structures it supports are: Series for one-dimensional arrays
and DataFrame for two-dimensional arrays.

1.2.2 Key Concepts


• DataFrame:A bidimensional, extensible and non homogenic
structure based on rows and columns with indexed names.

ANALYSIS
• Series:An object with linear structure storing an array or data and
an array of labels, the indexes.

• Index:Names of the rows and columns of a DataFrame which


enables one to select data in the DataFrame easily.

1.2.3 Creating DataFrames and Series

Example 1: Creating a DataFrame from a Dictionary


import pandas as pd

# Creating a DataFrame from a dictionary data = {


’ Name ’: [ ’ Alice ’ , ’ Bob ’, ’ Charlie ’],
’ Age ’: [25 , 30 , 35] ,
’ City ’: [ ’ New ␣ York ’ , ’ Los ␣ Angeles ’, ’ → Chicago ’]
}
df = pd . DataFrame ( data )
print ( df )

Explanation: This example creates a DataFrame using a dictionary


where the keys are column names and the values are lists
representing column data.

Example 2: Creating a DataFrame from a List of Dictionaries


# Creating a DataFrame from a list of → dictionaries
data = [
{ ’ Name ’: ’ Alice ’ , ’ Age ’: 25 , ’ City ’: ’ → New ␣ York ’} ,
{ ’ Name ’: ’ Bob ’ , ’ Age ’: 30 , ’ City ’: ’ Los → ␣ Angeles ’} ,
{ ’ Name ’: ’ Charlie ’ , ’ Age ’: 35 , ’ City ’: → ’ Chicago ’}
]

ANALYSIS WITH PYTHON


df = pd . DataFrame ( data )
print ( df )
Explanation: A DataFrame is created from a list of dictionaries, which
is useful when each row of data can be represented as a dictionary.

Example 3: Creating a Series


# Creating a Series from a list
ages = pd . Series ([25 , 30 , 35] , name = ’ Age ’)

# Creating a Series from a dictionary city_populations = pd . Series ({ ’ New ␣ York ’: →


8419000 , ’ Los ␣ Angeles ’: 3980000 , ’
→ Chicago ’: 2716000} , name = ’ Population ’
→)
print ( ages )
print ( city_populations )

Explanation: A Series can be created from a list or a dictionary. If


created from a dictionary, the keys become the index of the Series.

1.2.4 Basic DataFrame Operations

Example 1: Viewing Data in a DataFrame


# Displaying the first few rows of the → DataFrame
print ( df . head () ) # By default , shows the → first 5 rows

# Displaying the last few rows


print ( df . tail (2) ) # Shows the last 2 rows
statistics of numerical # Displaying basic → columns
print ( df . describe () )
# Displaying the DataFrame ’s structure print ( df . info () )
ANALYSIS

Explanation: The head() and tail() methods are used to quickly view
the top and bottom rows of the DataFrame. The describe() method
provides summary statistics of numerical columns, and info()
displays the structure of the DataFrame.

Example 2: Modifying Data in a DataFrame # Adding a new column with


default values df [ ’ Salary ’] = [50000 , 60000 , 70000]
# Modifying an existing entry
df . loc [0 , ’ Age ’] = 26 # Changing Alice ’s → age to 26
# Adding a calculated column
df [ ’ Salary ␣ After ␣ Tax ’] = df [ ’ Salary ’] * 0.7
# Renaming columns
df . rename ( columns ={ ’ City ’: ’ Location ’}, → inplace = True )
print ( df )

Explanation: This example demonstrates adding new columns,


modifying specific values, performing calculations to create new
columns, and renaming columns using the rename() method.

Example 3: Deleting Data


# Dropping a column
df . drop ( ’ Salary ␣ After ␣ Tax ’ , axis =1 , inplace → = True )
# Dropping a row
df . drop (1 , axis =0 , inplace = True ) # Drops → the row with index 1
print ( df )

Explanation: The drop() method is used to remove rows or columns


from a DataFrame. Specifying axis=1 drops a column, while axis=0
drops a row. ANALYSIS WITH PYTHON

1.2.5 Data Selection and Filtering

Example 1: Selecting Data from a DataFrame


# Selecting a column
ages = df [ ’ Age ’]
print ( ages )

# Selecting multiple columns


names_and_locations = df [[ ’ Name ’, ’ Location → ’ ]]
print ( names_and_locations )
# Selecting rows by index position
first_row = df . iloc [0] # Selecting the → first row
print ( first_row )
# Selecting rows by index label
row_by_label = df . loc [0] # Assuming
→ default integer index
print ( row_by_label )
Explanation:
• df[’Age’] selects the ’Age’ column as a Pandas Series.
df[[’Name’, ’Location’]] selects multiple columns•
and returns a new DataFrame.
• df.iloc[0] selects the first row based on its position (0-indexed).
• df.loc[0] selects the first row using the label (index) which is 0 in
this case.
Example 2: Filtering Data in a DataFrame
# Filtering rows where Age is greater than → 30
age_above_30 = df [ df [ ’ Age ’] > 30]
print ( age_above_30 )

ANALYSIS
# Filtering rows where Salary is between → 55000 and 65000
salary_range = df [( df [ ’ Salary ’] > 55000) & → ( df [ ’ Salary ’] < 65000) ]
print ( salary_range )

# Filtering rows based on string values location_ny = df [ df [ ’ Location ’] == ’ New ␣ →


York ’]
print ( location_ny )
Explanation:
• df[df[’Age’] > 30] filters rows where the ’Age’ column value is
greater than 30.
• Logical operators like & (and) are used to combine multiple
conditions.
• Filtering by string values allows selection based on categorical
data.
Example 3: Advanced Filtering with Multiple Conditions
# Filtering with multiple conditions using → ’ or ’
high_salary_or_young = df [( df [ ’ Salary ’] > → 60000) | ( df [ ’ Age ’] < 30) ]
print ( high_salary_or_young )

# Filtering using the ’ isin ’ method location_filter = df [ df [ ’ Location ’]. isin ([ ’ → New ␣
York ’ , ’ Chicago ’ ]) ]
print ( location_filter )
# Using ’ not ’ conditions
not_new_york = df [~ df [ ’ Location ’]. str . → contains ( ’ New ␣ York ’) ]
print ( not_new_york )
ANALYSIS WITH PYTHON

Explanation: Advanced filtering techniques include using | (or), isin()


for multiple matching values, and negation with ~ to exclude certain
criteria.
1.2.6 Handling Missing Data

Example 1: Detecting Missing Data


# Creating a DataFrame with missing values data = {
’ Name ’: [ ’ Alice ’ , ’ Bob ’ , None ],
’ Age ’: [25 , None , 35] ,
’ Location ’: [ ’ New ␣ York ’ , ’ Los ␣ Angeles ’, → None ]
}
df = pd . DataFrame ( data )

# Checking for missing values print ( df . isnull () )


# Counting missing values in each column print ( df . isnull () . sum () )

Explanation: The isnull() method returns a DataFrame of the same


shape, with True where data is missing. The sum() method counts
the number of missing values per column.

Example 2: Filling Missing Data


# Filling missing values with a default → value
df [ ’ Age ’ ]. fillna ( df [ ’ Age ’ ]. mean () , inplace = → True ) # Filling with mean of the
→ column
df [ ’ Name ’ ]. fillna ( ’ Unknown ’ , inplace = True ) → # Filling with a default string

print ( df )

Explanation: The fillna() method fills missing values with specified


values, such as the mean of a column or a default string.
ANALYSIS

Example 3: Dropping Missing Data # Dropping rows with any missing values
df_cleaned = df . dropna ()

# Dropping rows with missing values in a → specific column


df_cleaned_specific = df . dropna ( subset =[ ’ → Location ’ ])

print ( df_cleaned )
print ( df_cleaned_specific )

Explanation: The dropna() method removes rows with missing data,


which can be applied to the entire DataFrame or specific columns.
1.2.7 Data Aggregation and Grouping

Example 1: Grouping Data


# Grouping by a single column
grouped = df . groupby ( ’ Location ’)

# Aggregating data
average_age = grouped [ ’ Age ’]. mean ()
total_salary = grouped [ ’ Salary ’]. sum ()

print ( average_age )
print ( total_salary )

Explanation: The groupby() method is used to split the data into


groups based on a column. Aggregation functions like mean() and
sum() are then applied to these groups.

Example 2: Multiple Aggregations


# Grouping by multiple columns and
→ performing multiple aggregations
grouped = df . groupby ([ ’ Location ’, ’ Name ’]) . → agg ({
’ Age ’: [ ’ mean ’ , ’ max ’] ,

ANALYSIS WITH PYTHON


’ Salary ’: ’ sum ’ })
print ( grouped )

Explanation: The agg() method allows performing multiple


aggregations on grouped data, providing detailed summaries of the
dataset.

Example 3: Pivot Tables


# Creating a pivot table
pivot = df . pivot_table ( values = ’ Salary ’, → index = ’ Location ’ , columns = ’ Name ’,
→ aggfunc = ’ sum ’)

print ( pivot )

Explanation: Pivot tables provide a way to summarize data, similar to


Excel pivot tables, allowing aggregation of data with customizable
rows and columns.
1.2.8 Merging and Joining DataFrames

Example 1: Merging DataFrames


# Creating two DataFrames
df1 = pd . DataFrame ({
’ Name ’: [ ’ Alice ’ , ’ Bob ’] ,
’ Age ’: [25 , 30]
})

df2 = pd . DataFrame ({
’ Name ’: [ ’ Alice ’ , ’ Bob ’] , ’ Salary ’: [50000 , 60000]

})
# Merging on the ’ Name ’ column
merged_df = pd . merge ( df1 , df2 , on = ’ Name ’)
print ( merged_df )
ANALYSIS

Explanation: The merge() function combines DataFrames based on a


common column or index. This is particularly useful for combining
related datasets.

Example 2: Different Types of Joins


# Left join
left_join = pd . merge ( df1 , df2 , on = ’ Name ’, → how = ’ left ’)
# Right join
right_join = pd . merge ( df1 , df2 , on = ’ Name ’, → how = ’ right ’)
# Outer join
outer_join = pd . merge ( df1 , df2 , on = ’ Name ’, → how = ’ outer ’)
# Inner join
inner_join = pd . merge ( df1 , df2 , on = ’ Name ’, → how = ’ inner ’)

print ( left_join )
print ( right_join )
print ( outer_join )
print ( inner_join )

Explanation: Different types of joins control how DataFrames are


combined:
• left: Includes all rows from the left DataFrame, and matched rows
from the right.
• right: Includes all rows from the right DataFrame, and matched
rows from the left.
• outer: Includes all rows from both DataFrames, matching where
possible.

• inner: Includes only rows with matching values in both


DataFrames.
ANALYSIS WITH PYTHON

Example 3: Concatenating DataFrames


# Creating DataFrames to concatenate df1 = pd . DataFrame ({
’ Name ’: [ ’ Alice ’ , ’ Bob ’] ,
’ Age ’: [25 , 30]
})

df2 = pd . DataFrame ({
’ Name ’: [ ’ Charlie ’ , ’ David ’], ’ Age ’: [35 , 40]

})
# Concatenating along rows ( default ) concat_df = pd . concat ([ df1 , df2 ])
# Concatenating along columns
concat_df_columns = pd . concat ([ df1 , df2 ], → axis =1)
print ( concat_df )
print ( concat_df_columns )

Explanation: The concat() function combines DataFrames along a


particular axis (rows or columns), which is useful for appending data
or merging along specific dimensions.

1.2.9 Working with Dates and Times

Example 1: Handling DateTime Data # Creating a DataFrame with date data


data = {

’ Date ’: [ ’ 2023 -01 -01 ’ , ’ 2023 -01 -02 ’, ’ → 2023 -01 -03 ’] ,
’ Value ’: [100 , 150 , 200]
}
df = pd . DataFrame ( data )

# Converting strings to df [ ’ Date ’] = pd . to_datetime ( df [ ’ Date ’]) DateTime objects


ANALYSIS
# Extracting components
df [ ’ Year ’] = df [ ’ Date ’ ]. dt . year df [ ’ Month ’] = df [ ’ Date ’ ]. dt . month df [ ’ Day ’]
= df [ ’ Date ’ ]. dt . day
print ( df )

Explanation: The pd.to datetime() function converts date strings to


DateTime objects, allowing extraction of individual components like
year, month, and day.

Example 2: DateTime Indexing and Resampling


# Setting the ’ Date ’ column as the index df . set_index ( ’ Date ’ , inplace = True )

# Resampling data to calculate monthly → averages


monthly_avg = df . resample ( ’M ’). mean ()

print ( monthly_avg )

Explanation: Setting a DateTime column as the index allows for


powerful time-based operations, like resampling data to different
time frequencies (e.g., monthly, weekly).

Example 3: DateTime Calculations # Adding a new column with time differences


df [ ’ Next ␣ Day ’] = df [ ’ Date ’] + pd . Timedelta ( → days =1)
# Calculating time differences between rows df [ ’ Diff ’] = df [ ’ Date ’ ]. diff ()
print ( df )

Explanation: DateTime calculations include adding time deltas and


calculating differences between dates, which are essential for time
series analysis.
ANALYSIS WITH PYTHON

1.3 NumPy: Numerical Computing


1.3.1 Introduction to NumPy

For details the following packages can be discussed: NumPy


(Numerical Python) is used for computations and modifications of
arrays and it is used in matrix calculations. It performs well on these
data structures and is, therefore, considered a powerful instrument
in numerical and scientific computation. In Python, there is kind of
list and for the most part, they are like arrays in NumPy but these
are even more efficient and contain more complex mathematical
functions.

1.3.2 Creating Arrays

Example 1: Creating Arrays from Lists import numpy as np


# Creating a one - dimensional array
arr1 = np . array ([1 , 2 , 3 , 4])
# Creating a two - dimensional array ( matrix ) arr2 = np . array ([[1 , 2 , 3] , [4 , 5, 6]])
# Creating an array with a range of values arr3 = np . arange (0 , 10 , 2) # Start at 0, →
end before 10 , step by 2

print ( arr1 )
print ( arr2 )
print ( arr3 )

Explanation: Arrays can also be created from python lists by the use
of np. array(). The arange() function creates an array out of values
which are within a certain range and have a certain increase factor.

Example 2: Creating Arrays with Zeros and Ones


# Creating an array filled with zeros zeros = np . zeros ((2 , 3) ) # 2 rows , 3 → columns
# Creating an array filled with ones ones = np . ones ((3 , 2) ) # 3 rows , 2 → columns
# Creating an identity matrix
identity = np . eye (3) # 3 x3 identity → matrix

print ( zeros )
print ( ones )
print ( identity )

Explanation: Arrays can be initialized with default values using


functions like zeros(), ones(), and eye() for identity matrices.

Example 3: Creating Arrays with Random Numbers


# Creating an array with random values random_arr = np . random . rand (2 , 3) # 2 rows
→ , 3 columns of random values between
→ 0 and 1

# Creating an array with random integers random_ints = np . random . randint (0 , 10 ,


size → =(2 , 3) ) # Integers between 0 and 10

# Creating a normally distributed random → array


random_normal = np . random . randn (3 , 3) # 3 → x3 matrix with values from a standard
→ normal distribution

print ( random_arr ) print ( random_ints )


ANALYSIS WITH PYTHON
print ( random_normal )

Explanation: NumPy’s random module provides functions to generate


random numbers, including uniform distributions (rand), integers
(randint), and normal distributions (randn).

1.3.3 Array Operations

Example 1: Element-wise Operations # Creating an array


arr = np . array ([10 , 20 , 30 , 40])
# Performing arithmetic arr_plus_5 = arr + 5

→ each element
arr_times_2 = arr * 2
→ each element by 2 arr_squared = arr ** 2
→ each element
operations
# Adds 5 to

# Multiplies
# Squares

print ( arr_plus_5 ) print ( arr_times_2 ) print ( arr_squared )

Explanation: NumPy allows arithmetic operations to be performed


element-wise on arrays, which means applying an operation to each
element individually.

Example 2: Matrix Operations


# Creating matrices
matrix1 = np . array ([[1 , 2] , [3 , 4]]) matrix2 = np . array ([[5 , 6] , [7 , 8]])

# Matrix addition
matrix_sum = matrix1 + matrix2
# Matrix multiplication ( element - wise ) matrix_mult = matrix1 * matrix2
# Dot product ( matrix multiplication ) dot_product = np . dot ( matrix1 , matrix2 )

print ( matrix_sum )
print ( matrix_mult )
print ( dot_product )
Explanation: NumPy supports both element-wise matrix operations
and matrix multiplication using the dot() function.

Example 3: Mathematical Functions # Creating an array


arr = np . array ([0 , np . pi /2 , np . pi ])

# Applying trigonometric functions sin_values = np . sin ( arr )


cos_values = np . cos ( arr )

# Applying exponential and logarithmic → functions


exp_values = np . exp ( arr )
log_values = np . log ( np . array ([1 , np .e , np .e → **2]) )

print ( sin_values )
print ( cos_values )
print ( exp_values )
print ( log_values )

Explanation: NumPy provides a wide range of mathematical


functions, including trigonometric, exponential, and logarithmic
functions, which can be applied element-wise to arrays.

1.3.4 Indexing, Slicing, and Reshaping

Example 1: Basic Indexing and Slicing # Creating an array


arr = np . array ([1 , 2 , 3 , 4 , 5, 6])
ANALYSIS WITH PYTHON
# Accessing elements
element = arr [2] # Third element (0 → indexed )
# Slicing arrays
slice1 = arr [1:4] #

→ to 3
slice2 = arr [:3] # slice3 = arr [3:] #

→ onwards
Elements from index 1
First three elements Elements from index 3

print ( element ) print ( slice1 ) print ( slice2 ) print ( slice3 )

Explanation: Indexing and slicing allow accessing specific elements


or subarrays within a NumPy array.
Example 2: Reshaping Arrays
# Creating an array with 6 elements arr = np . array ([1 , 2 , 3 , 4 , 5 , 6])
# Reshaping the array into a 2 x3 matrix reshaped_arr = arr . reshape ((2 , 3) ) # Now →
has 2 rows and 3 columns

# Flattening a 2 D array back into a 1D → array


flattened_arr = reshaped_arr . flatten ()

print ( reshaped_arr )
print ( flattened_arr )

Explanation: The reshape() function changes the shape of an array


without altering its data, and flatten() converts a multi-dimensional
array into a one-dimensional array.

Example 3: Advanced Indexing


# Creating a 3 x3 matrix
matrix = np . array ([[10 , 20 , 30] , [40 , 50 , → 60] , [70 , 80 , 90]])

# Accessing specific elements using → advanced indexing


elements = matrix [[0 , 1 , 2] , [0 , 1, 2]] # → Accesses (0 ,0) , (1 ,1) , (2 ,2)

# Boolean indexing
bool_index = matrix [ matrix > 50] # → Elements greater than 50
print ( elements )
print ( bool_index )

Explanation: Advanced indexing techniques include using arrays of


indices and Boolean indexing to select elements based on conditions.

1.3.5 Broadcasting

Example 1: Broadcasting Scalars


# Creating an array
arr = np . array ([1 , 2 , 3])
# Broadcasting a scalar addition
broadcast_add = arr + 5
# Broadcasting a scalar multiplication broadcast_mult = arr * 3
print ( broadcast_add )
print ( broadcast_mult )

Explanation: Broadcasting allows NumPy to perform operations


between arrays of different shapes by expanding the smaller array
across the larger array.
Example 2: Broadcasting Arrays
# Creating arrays of different shapes
ANALYSIS WITH PYTHON
arr1 = np . array ([1 , 2 , 3]) # → Shape (3 ,)
arr2 = np . array ([[10] , [20] , [30]]) # → Shape (3 , 1)

# Broadcasting operation : Adding arr1 to → each row of arr2


result = arr1 + arr2

print ( result )

Explanation: In this example, arr1 is broadcasted across the rows of


arr2, allowing for element-wise addition even though the arrays have
different shapes.

Example 3: Broadcasting with Multi-Dimensional Arrays


# Creating a 2 x3 matrix
matrix = np . array ([[1 , 2 , 3] , [4 , 5, 6]])

# Broadcasting a 1 D array along the second → dimension


vector = np . array ([10 , 20 , 30])

# Adding the vector to each row of the → matrix


broadcast_result = matrix + vector

print ( broadcast_result )

Explanation: Broadcasting extends the lower-dimensional array


(vector) across the higher-dimensional array (matrix), allowing
element-wise addition across rows.

1.3.6 Mathematical Functions

Example 1: Applying Trigonometric Functions # Creating an array with


angles in radians angles = np . array ([0 , np . pi /4 , np . pi /2 , np . → pi ])
# Applying sine , cosine , and tangent

→ functions
sin_values = np . sin ( angles )
cos_values = np . cos ( angles )
tan_values = np . tan ( angles )

print ( sin_values )
print ( cos_values )
print ( tan_values )

Explanation: NumPy provides trigonometric functions like sin(),


cos(), and tan() that operate elementwise on arrays.

Example 2: Exponential and Logarithmic Functions


# Creating an array
arr = np . array ([1 , 2 , 3])
# Applying exponential function exp_values = np . exp ( arr )
# Applying natural logarithm log_values = np . log ( arr )
# Applying logarithm base 10 log10_values = np . log10 ( arr )

print ( exp_values )
print ( log_values )
print ( log10_values )

Explanation: Exponential and logarithmic functions are essential for


various calculations in data analysis and can be applied element-wise
to arrays.

Example 3: Statistical Functions


# Creating an array
arr = np . array ([10 , 20 , 30 , 40 , 50])
# Calculating mean , median , and standard → deviation
ANALYSIS WITH PYTHON
mean_value = np . mean ( arr ) median_value = np . median ( arr ) std_value = np . std (
arr )

print ( mean_value )
print ( median_value )
print ( std_value )

Explanation: NumPy includes functions for statistical calculations,


such as mean(), median(), and std() for standard deviation.

1.3.7 Linear Algebra with NumPy

Example 1: Solving Linear Equations from numpy . linalg import solve


# Define a system of linear equations : Ax = → b
A = np . array ([[3 , 2] , [1 , 4]]) #
→ Coefficients
b = np . array ([5 , 6]) # Constants
# Solving for x using the ’ solve ’ function x = solve (A , b )
print ( ’ Solution : ’ , x )

Explanation: The solve() function from NumPy’s linear algebra


module solves systems of linear equations defined by matrix A and
vector b.

Example 2: Matrix Inversion


from numpy . linalg import inv
# Creating a square matrix
matrix = np . array ([[1 , 2] , [3 , 4]])
# Calculating the inverse of the matrix inverse_matrix = inv ( matrix )
print ( inverse_matrix )

Explanation: Matrix inversion is used in various applications, such as


solving linear equations and transforming data. The inv() function
computes the inverse of a square matrix.

Example 3: Eigenvalues and Eigenvectors from numpy . linalg import eig


# Creating a matrix
matrix = np . array ([[4 , -2] , [1 , 1]])
# Calculating eigenvalues and eigenvectors eigenvalues , eigenvectors = eig ( matrix )
print ( ’ Eigenvalues : ’ , eigenvalues ) print ( ’ Eigenvectors : ’ , eigenvectors )

Explanation: Eigenvectors and Eigenvalues are concepts used in the


field of linear algebra NumPy comes with a function that helps in the
computation of Eigenvalues and Eigenvectors for a given matrix
which is the eig() function.

1.3.8 Random Number Generation

Example 1: Uniform Distribution


# Generating random numbers from a uniform → distribution
uniform_random = np . random . uniform ( low =0.0 , → high =1.0 , size =(3 , 3) )

print ( uniform_random )

Explanation: The uniform() function generates random numbers


within a specified range and is often used in simulations and
randomized trials.
Example 2: Normal Distribution
ANALYSIS WITH PYTHON
# Generating random numbers from a normal → distribution
normal_random = np . random . normal ( loc =0.0 , → scale =1.0 , size =(3 , 3) )

print ( normal_random )

Explanation: The normal() function creates random numbers using


the locations parameter (loc) and scale parameter (scale); commonly
used in statistical modelling.

Example 3: Random Integers


# Generating random integers
random_ints = np . random . randint ( low =0 , high → =10 , size =(3 , 3) )
print ( random_ints )

Explanation: Another function is the randint() function which is used


for creating random integers within a range which is ideal when
sampling or partitioning of data is required.

1.3.9 Real-World Applications

Example 1: Data Normalization


# Creating a dataset
data = np . array ([10 , 20 , 30 , 40 , 50])
# Normalizing the data to a range of 0 to 1 normalized_data = ( data - np . min ( data )) /
( → np . max ( data ) - np . min ( data ))
print ( normalized_data )

Explanation: Data normalization deals with scaling the data values to


some standard range of values which are widely used in machine
learning and statistical analysis.
Example 2: Simulating Data for Analysis
# Simulating normally distributed data simulated_data = np . random . normal ( loc =100 ,
→ scale =15 , size =1000)

# Calculating summary statistics mean = np . mean ( simulated_data ) std = np . std (


simulated_data )
print ( ’ Mean : ’ , mean )
print ( ’ Standard ␣ Deviation : ’, std )

Explanation: Simulating data is a powerful technique for testing


hypotheses, modeling scenarios, and validating statistical methods.

Example 3: Monte Carlo Simulation


# Monte Carlo simulation to estimate the → value of Pi
num_points = 1000000
points = np . random . rand ( num_points , 2) # → Generate random points in a unit →
square
inside_circle = np . sum ( np . sqrt ( points [: , → 0]**2 + points [: , 1]**2) <= 1)
pi_estimate = ( inside_circle / num_points ) → * 4

print ( ’ Estimated ␣ Pi : ’ , pi_estimate )

Explanation: Monte Carlo simulations use random sampling to solve


problems that may be deterministic in principle, such as estimating
the value of Pi. ANALYSIS WITH PYTHON

1.4 SciPy: Advanced Scientific Computing


1.4.1 Overview of SciPy

SciPy is built on top of NumPy and is an extension of NumPy and


offers many more functionalities for scientific computation. It
contains subroutines which are optimization, integration,
interpolation, eigenvalue problems, algebraic equations and some
other sophisticated calculations. Data Analysis, Engineering and
scientific computations cannot be completed without SciPy.

1.4.2 Optimization

Example 1: Finding the Minimum of a Function from scipy . optimize import


minimize

# Define a simple function to minimize : f(x → ) = ( x - 3) ^2


def func (x ) :
return ( x - 3) ** 2
# Use the minimize function to find the → minimum
result = minimize ( func , x0 =0) # Starting → guess is x = 0

print ( ’ Minimum ␣ found ␣ at ␣ x ␣ = ’ , result .x)

Explanation: The minimize() function from SciPy’s optimization


module is used to find the minimum of a given function, which is
crucial in many fields including machine learning and engineering.

Example 2: Constrained Optimization from scipy . optimize import minimize


COMPUTING
# Define a function to minimize : f(x , y) = → x ^2 + y ^2
def func ( variables ) :
x , y = variables
return x **2 + y **2

# Define a constraint : x + y = 1
constraints = ({ ’ type ’: ’ eq ’, ’ fun ’: lambda → variables : variables [0] + variables
→ [1] - 1})

# Use the minimize function with → constraints


result = minimize ( func , x0 =[0 , 0] , → constraints = constraints )

print ( ’ Minimum ␣ found ␣ at : ’ , result .x)

Explanation: Constrained optimization allows for minimizing


functions while respecting certain conditions or constraints, which is
common in practical applications.

Example 3: Maximizing a Function from scipy . optimize import minimize


# Define a function to maximize : f(x) = -(x minimizing → - 3) ^2 ( maximize by →
negative )
def func ( x ) :
return -( x - 3) ** 2

# Use the minimize → function to function maximize with a negative

result = minimize ( func , x0 =0)


print ( ’ Maximum ␣ found ␣ at ␣ x ␣ = ’, result .x)

Explanation: Maximizing a function can be achieved by minimizing its


negative, which is a standard approach in optimization problems.
ANALYSIS WITH PYTHON

1.4.3 Integration

Example 1: Numerical Integration from scipy . integrate import quad


# Define a function to → ^2 + 1
def integrand ( x ) :
return x ** 2 + 1

# Use quad to integrate → to 2


result , error = quad ( integrand , 0, 2) integrate : f(x) = x

the function from 0


print ( ’ Integral ␣ result ␣ = ’ , result )

Explanation: The quad() function performs numerical integration of a


function over a specified interval, which is essential in many fields of
science and engineering.

Example 2: Double Integration


from scipy . integrate import dblquad

# Define the function to integrate def integrand (x , y ) :


return x * y

# Define the limits for y (0 to 1) and x (0 → to y )


result , error = dblquad ( integrand , 0, 1, → lambda x : 0 , lambda x : x)

print ( ’ Double ␣ integral ␣ result ␣= ’, result )

Explanation: The dblquad() function computes double integrals over


a two-dimensional region, which is useful in physics and
engineering.

Example 3: Solving Differential Equations from scipy . integrate import


solve_ivp
COMPUTING
# Define the differential → x + y
def dydx (x , y ) :
return x + y
# Solve the differential → initial condition y (0) = 1
solution = solve_ivp ( dydx , [0 , 5] , [1]) equation : dy / dx =

equation with
print ( ’ Solution : ’ , solution .y)

Explanation: Solving differential equations is crucial in modeling


dynamic systems, and SciPy’s solve ivp() function provides a robust
method for these calculations.

1.4.4 Interpolation

Example 1: Linear Interpolation


from scipy . interpolate import interp1d import numpy as np

# Given data points


x = np . array ([0 , 1 , 2 , 3])
y = np . array ([1 , 3 , 2 , 5])

# Creating a linear interpolation function linear_interp = interp1d (x , y)


# Interpolating at a new point
new_y = linear_interp (1.5)
print ( ’ Interpolated ␣ value ␣ at ␣ 1.5: ’, new_y )

Explanation :Interpolation estimates values between known data


points, thus one of the simplest methods is Linear Interpolation
where it provides a straight-line value between two distinct points.

Example 2: Cubic Spline Interpolation ANALYSIS WITH PYTHON


from scipy . interpolate import CubicSpline

# Given data points


x = np . array ([0 , 1 , 2 , 3]) y = np . array ([1 , 3 , 2 , 5])

# Creating a cubic spline interpolation → function


cubic_interp = CubicSpline (x , y)

# Interpolating at a new point new_y = cubic_interp (1.5)


print ( ’ Cubic ␣ spline ␣ interpolated ␣ value ␣ at ␣ → 1.5: ’ , new_y )

Explanation: Cubic spline interpolation provides a smoother estimate


compared to linear interpolation by fitting a cubic polynomial
between data points.
Example 3: 2D Interpolation
from scipy . interpolate import griddata
# Given data points
points = np . array ([[0 , 0] , [0 , 1] , [1 , 0] , → [1 , 1]])
values = np . array ([0 , 1 , 1 , 0])
# Interpolating at a new point
grid_z0 = griddata ( points , values , (0.5 , → 0.5) , method = ’ linear ’)
print ( ’2 D␣ interpolated ␣ value ␣ at ␣ (0.5 , ␣ 0.5) : → ’ , grid_z0 )

Explanation: 2D interpolation estimates values on a surface from


known points, useful in geospatial analysis and image processing.

1.4.5 Linear Algebra

Example 1: Solving Linear Equations COMPUTING


from scipy . linalg import solve

# Define a system of linear equations : Ax = → b


A = np . array ([[3 , 2] , [1 , 4]]) #
→ Coefficients
b = np . array ([5 , 6]) # Constants

# Solving for x using the ’ solve ’ function x = solve (A , b )


print ( ’ Solution : ’ , x )

Explanation: Solving systems of linear equations is fundamental in


engineering and science, and the solve() function from SciPy’s linear
algebra module provides an efficient solution.

Example 2: Matrix Decomposition from scipy . linalg import lu


# Creating a matrix
matrix = np . array ([[1 , 2 , 3] , [4 , 5, 6] , → [7 , 8 , 9]])
# Performing LU decomposition P , L , U = lu ( matrix )
print ( ’L : ’ , L )
print ( ’U : ’ , U )

Explanation: Matrix decomposition techniques, like LU


decomposition, are used in numerical analysis for solving linear
systems and matrix inversion.

Example 3: Singular Value Decomposition (SVD)


from scipy . linalg import svd
# Creating a matrix
matrix = np . array ([[1 , 2] , [3 , 4] , [5 , 6]])
ANALYSIS WITH PYTHON
# Performing SVD
U , S , Vh = svd ( matrix )

print ( ’U : ’ , U )
print ( ’S : ’ , S )
print ( ’ Vh : ’ , Vh )

Explanation: SVD is used in signal processing, statistics, and data


compression. It decomposes a matrix into three other matrices,
revealing its essential features.

1.4.6 Signal and Image Processing

Example 1: Signal Filtering


from scipy . signal import butter , filtfilt
# Creating a Butterworth filter
b , a = butter ( N =3 , Wn =0.05) # 3 rd order , → cutoff frequency of 0.05
# Sample signal
signal = np . sin ( np . linspace (0 , 10 , 100) )
# Applying the filter
filtered_signal = filtfilt (b , a , signal )
print ( ’ Filtered ␣ signal : ’ , filtered_signal )

Explanation: Signal filtering removes noise from data, and SciPy


provides various filters like the Butterworth filter to smooth signals.

Example 2: Image Processing with 2D Convolution


from scipy . ndimage import convolve
import numpy as np
# Creating a sample 2 D array ( image ) image = np . array ([[1 , 2 , 1] , [0 , 1, 0] , [2 , →
1 , 2]])
COMPUTING
# Creating a filter ( kernel )
kernel = np . array ([[1 , 0 , -1] , [1 , 0, -1] , → [1 , 0 , -1]])
# Applying the 2 D convolution
filtered_image = convolve ( image , kernel )
print ( ’ Filtered ␣ image : ’ , filtered_image )

Explanation: Convolution is a fundamental operation in image


processing, used for tasks like edge detection and image filtering.
Example 3: Fourier Transform
from scipy . fft import fft , ifft
# Creating a sample signal
signal = np . array ([0 , 1 , 0 , -1])
# Performing Fourier Transform
fft_signal = fft ( signal )
# Performing Inverse Fourier Transform ifft_signal = ifft ( fft_signal )
print ( ’ FFT : ’ , fft_signal )
print ( ’ Inverse ␣ FFT : ’ , ifft_signal )

Explanation: Fourier Transforms decompose signals into frequency


components, widely used in signal processing and time series
analysis.

1.4.7 Statistics with SciPy

Example 1: Descriptive Statistics from scipy import stats


# Creating a dataset
data = np . array ([10 , 20 , 30 , 40 , 50])
ANALYSIS WITH PYTHON
# Calculating descriptive statistics mean = stats . tmean ( data )
median = stats . tmedian ( data )
std_dev = stats . tstd ( data )

print ( ’ Mean : ’ , mean )


print ( ’ Median : ’ , median )
print ( ’ Standard ␣ Deviation : ’ , std_dev )

Explanation: SciPy provides comprehensive statistical functions,


including mean, median, and standard deviation, which are
foundational in data analysis.

Example 2: Hypothesis Testing


from scipy . stats import ttest_1samp
# Sample data
data = np . array ([10 , 12 , 14 , 16 , 18])
# Performing a one - sample t - test t_stat , p_value = ttest_1samp ( data , 15)
print ( ’T - statistic : ’ , t_stat )
print ( ’P - value : ’ , p_value )
Explanation: Hypothesis testing is the way of making decisions
regarding populations based on sample data and the t-test is applied
to compare mean.

Example 3: Correlation Analysis


from scipy . stats import pearsonr

# Sample data
x = np . array ([1 , 2 , 3 , 4 , 5]) y = np . array ([2 , 4 , 6 , 8 , 10])

# Calculating Pearson correlation


→ coefficient
corr_coefficient , p_value = pearsonr (x , y)

print ( ’ Correlation ␣ Coefficient : ’, → corr_coefficient )


print ( ’P - value : ’ , p_value )

COMPUTING

Explanation: Correlation analysis measures the relationship between


variables, with the Pearson correlation coefficient quantifying linear
relationships.

1.4.8 Practical Applications

Example 1: Curve Fitting


from scipy . optimize import curve_fit
# Defining a linear function
def linear_func (x , a , b ) :
return a * x + b

# Sample data
x_data = np . array ([1 , 2 , 3 , 4, 5])
y_data = np . array ([2.2 , 4.1 , 6.0 , 8.3 ,

→ 10.1])
# Fitting the curve
params , covariance = curve_fit ( linear_func , → x_data , y_data )
print ( ’ Fitted ␣ parameters : ’ , params )

Explanation: Curve fitting estimates the parameters of a function


that best fits the data, which is essential in data modeling and
predictions.
Example 2: Signal Denoising
from scipy . signal import medfilt
# Creating a noisy signal
signal = np . array ([1 , 2 , 3 , 2, 1, 100 , 1, → 2 , 3])
# Applying median filter to remove noise filtered_signal = medfilt ( signal ,
→ kernel_size =3)
print ( ’ Filtered ␣ signal : ’ , filtered_signal )
ANALYSIS WITH PYTHON

Explanation: Signal denoising removes noise from data, making it


easier to analyze underlying trends or patterns.

Example 3: Time Series Forecasting from scipy . signal import correlate


# Creating sample time series data time_series1 = np . array ([1 , 2 , 3, 4, 5, 6,

→ 7 , 8])
time_series2 = np . array ([3 , 4 , 5, 6, 7, 8,
→ 9 , 10])

# Calculating cross - correlation cross_corr = correlate ( time_series1 , → time_series2 )


print ( ’ Cross - correlation : ’ , cross_corr )

Explanation: Cross-correlation measures the similarity between two


time series, which is useful in forecasting and analyzing relationships
in temporal data.

1.5 Practice Questions and Solutions


1.5.1 Practice Questions on Pandas

Question 1: Creating DataFrames


Create a DataFrame from the following data and
display the first 3 rows:

• Names: John, Jane, Jim, Jack


• Ages: 28, 34, 29, 40
• Cities: Boston, Austin, Denver, Seattle

Solution:
import pandas as pd
data = {
’ Name ’: [ ’ John ’ , ’ Jane ’, ’ Jim ’, ’ Jack ’

→],
’ Age ’: [28 , 34 , 29 , 40] ,
’ City ’: [ ’ Boston ’ , ’ Austin ’, ’ Denver ’,

→ ’ Seattle ’]
}
df = pd . DataFrame ( data )
print ( df . head (3) )
Question 2: Data Selection and Filtering

Using the DataFrame created in Question 1, filter and display rows


where Age is greater than 30.
Solution:
filtered_df = df [ df [ ’ Age ’] > 30]
print ( filtered_df )

Question 3: Handling Missing Data Create a DataFrame with missing


data using the following dictionary, then fill missing values in the
’Age’ column with the mean value.
ANALYSIS WITH PYTHON

• Names: John, Jane, None, Jack


• Ages: 28, None, 29, 40
• Cities: Boston, None, Denver, Seattle

Solution:
data = {
’ Name ’: [ ’ John ’ , ’ Jane ’ , None , ’ Jack ’], ’ Age ’: [28 , None , 29 , 40] ,
’ City ’: [ ’ Boston ’ , None , ’ Denver ’, ’

→ Seattle ’]
}
df = pd . DataFrame ( data )
df [ ’ Age ’ ]. fillna ( df [ ’ Age ’ ]. mean () , inplace = → True )
print ( df )

1.5.2 Practice Questions on NumPy


Question 1: Array Creation
Create a one-dimensional NumPy array with values
from 1 to 10, and reshape it into a 2x5 matrix. Solution:
import numpy as np

arr = np . arange (1 , 11)


reshaped_arr = arr . reshape (2 , 5)
print ( reshaped_arr )

Question 2: Array Operations


Given the array arr = np.array([10, 20, 30, 40]), calculate the square
of each element. Solution:
arr = np . array ([10 , 20 , 30 , 40])
squared_arr = arr ** 2
print ( squared_arr )

Question 3: Broadcasting
Create a 3x3 matrix with values from 1 to 9 and add a vector [10,
20, 30] to each row using broadcasting.
Solution:
matrix = np . array ([[1 , 2 , 3] , [4 , 5, 6] , → [7 , 8 , 9]])
vector = np . array ([10 , 20 , 30])
broadcast_result = matrix + vector
print ( broadcast_result )

1.5.3 Practice Questions on SciPy

Question 1: Optimization
Use SciPy to find the minimum of the function
f(x) = (x − 2)2.
Solution:
from scipy . optimize import minimize
def func ( x ) :
return ( x - 2) ** 2
result = minimize ( func , x0 =0)
print ( ’ Minimum ␣ at : ’ , result .x)
Question 2: Numerical Integration Integrate the function f(x) = x3
from 0 to 1 using SciPy.
Solution:
from scipy . integrate import quad
def integrand ( x ) :
return x ** 3
result , _ = quad ( integrand , 0, 1)
print ( ’ Integral ␣ result : ’ , result )
ANALYSIS WITH PYTHON
Question 3: Linear Algebra

Solve the system of equations 3 x + 2y = 5 and x + 4y = 6 using


SciPy.
Solution:
from scipy . linalg import solve
import numpy as np
A = np . array ([[3 , 2] , [1 , 4]]) b = np . array ([5 , 6])
solution = solve (A , b )
print ( ’ Solution : ’ , solution ) 1.6. CONCLUSION

1.6 Conclusion
This chapter provides an introduction to data analysis using Python’s
powerful libraries: Specifically, Pandas, NumPy along with the SciPy
tool. These libraries can be used by the students and practitioners to
manage the data; numerical computation, data analysis and
scientific computing for solving practical problems. ANALYSIS WITH
PYTHON
Chapter 2 Pandas Library
2.1 Introduction to Pandas Library
Pandas is an open-source software, which is a library of tools for
analysing data in Python as a programming language, widely used in
data science and AI. It delivers basic and highly effective data
structures such as DataFrames and Series, features of which are
designed to analyze structured data such as tables, time series and
extensive data. Pandas is compatible with other Python libraries and
it is used alongside these libraries; NumPy, Matplotlib, and Scikit-
learn among others is why it is widely adopted by data professionals.

2.1.1 Why Use Pandas?

Pandas simplifies data handling and analysis through its intuitive API
and powerful functionalities:

Data Loading :It is easy to upload data from•


such sources as CSV files, Microsoft Excel, MS SQL, MySQL,
PostgreSQL and JSON data as well.

• Data Cleaning:Impute the data, identify and eliminate duplicity


and, transform the nature of data.

Data Transformation :Join or merge data for•


better organization, reshape the data or pivot the data in a more
organized manner.

Data Analysis :Regarding EDA, use descriptive•


statistics, grouping by operations and, aggregations, using the
various offerings of pandas.
• Data Visualization:Use plot data directly using Matplotlib & Seaborn
which are integrated with Pandas.

2.1.2 Installation

To install Pandas, you can use the following command in your


Python environment:
pip install pandas

2.2 Creating DataFrames and Series


2.2.1 Creating DataFrames

DataFrame is a table like structure in which data is arranged in rows


and columns with the rows being labeled. It is like a table in SQL or
a spreadsheet in Excel and is among the basic data structures in
Pandas.

Example 1: Creating a DataFrame from a Dictionary


import pandas as pd

# Creating a DataFrame from a dictionary of → lists


data = {
’ Name ’: [ ’ Alice ’ , ’ Bob ’, ’ Charlie ’], ’ Age ’: [25 , 30 , 35] ,
’ City ’: [ ’ New ␣ York ’ , ’ Los ␣ Angeles ’, ’ → Chicago ’]
}

df = pd . DataFrame ( data )
print ( df )
Explanation:

• In this example, a DataFrame is built from a dictionary in which


the keys are names of columns and the values are lists of data in the
corresponding columns.

• This method is particularly useful when data is already available in


the form of Python dictionaries which is quite common when the
data is obtained or preprocessed through scripts.
• This means that each key-value pair in the dictionary will become a
column in the DataFrame, thus enabling easy construction and
modification of tabular data.

Example 2: Creating a DataFrame from a List of Dictionaries


# Creating a DataFrame from a list of → dictionaries
data = [
{ ’ Name ’: ’ Alice ’ , ’ Age ’: 25 , ’ City ’: ’ → New ␣ York ’} ,
{ ’ Name ’: ’ Bob ’ , ’ Age ’: 30 , ’ City ’: ’ Los → ␣ Angeles ’} ,
{ ’ Name ’: ’ Charlie ’ , ’ Age ’: 35 , ’ City ’: → ’ Chicago ’}
]

df = pd . DataFrame ( data ) print ( df )


Explanation:
• The first case is beneficial where there is a record which can be
naturally modeled as a dictionary.

• It should be noted that all the contents within the list are
dictionaries and each of these dictionaries is a row in the DataFrame
and

• Pandas optimally aligns the keys of the dictionaries to get the


column names. This method is quite flexible as it has to do with data
entry and data management when dealing with large and more so
complicated sets of data.

Example 3: Creating DataFrames from CSV Files


# Reading a DataFrame from a CSV file df = pd . read_csv ( ’ data . csv ’)
print ( df . head () ) # Displays the first 5

→ rows
Explanation:

Pandas can load data from various file formats, •


with CSV being one of the most common due to its simplicity and
widespread use.

The read csv() function reads data from a CSV•


file into a DataFrame, allowing for easy data manipulation and
analysis.

The head() function is useful for quickly inspect•


ing the structure of the loaded data by displaying the first few rows.

Example 4: Creating DataFrames from Excel Files


# Reading a DataFrame from an Excel file df = pd . read_excel ( ’ data . xlsx ’, sheet_name
= → ’ Sheet1 ’)
print ( df . head () )
Explanation:

• Another method of getting data into pandas is using read excel()


function they are used to read excel files.

• There is the sheet name parameter that allows you to specify the
sheet to extract data from in case the source Excel file has several
sheets.

• This feature is most advantageous and appropriate in the working


with finances necessary for different data analysis and reporting in
MS Excel.

2.2.2 Creating Series

A Series is a one-dimensional array-like object containing a sequence


of values and an associated array of data labels, called the index.

Example 1: Creating a Series from a List


# Creating a Series from a list
ages = pd . Series ([25 , 30 , 35] , name = ’ Age ’) print ( ages )

Explanation:

A Series can also be created from an actual •


Python list where the elements in the list become the data in the
Series.
• The name parameter is also not compulsory and allows giving a
name to the Series for easier identification if several Series are
created.

• By default, the Index is an integer sequence that begins with zero,


and it can be changed if necessary.

Example 2: Creating a Series from a Dictionary


# Creating a Series from a dictionary city_populations = pd . Series ({ ’ New ␣ York ’: →
8419000 , ’ Los ␣ Angeles ’: 3980000 , ’
→ Chicago ’: 2716000} , name = ’ Population ’
→)
print ( city_populations )

Explanation:

• When you create a Series from a dictionary then keys hold the
index position and values filled the data.
• This approach is useful for creating Series with meaningful labels
like city name from database or any other categorical identifier.

Series generated this way can be easily merged •


with other Series or DataFrames based on the index.

Example 3: Creating a Series with a Custom Index


# Creating a Series with a custom index temperatures = pd . Series ([70 , 65 , 80] , →
index =[ ’ New ␣ York ’ , ’ Los ␣ Angeles ’, ’
→ Chicago ’] , name = ’ Temperature ’) print ( temperatures )

Explanation:

• The index parameter allows you to specify custom labels for each
element in the Series, providing greater flexibility in data selection
and alignment.

Custom indices are particularly useful when align•


ing Series with other data structures or when the index itself carries
significant meaning, such as timestamps or categorical labels.
2.3 Basic DataFrame Operations
It is important to have a good understanding of fundamental
DataFrame operations in order to perform data manipulations.

2.3.1 Viewing and Inspecting DataFrames

Data preview and data exploration is always an initial part of data


analysis which helps to gain an overall picture of the data structure
and contents of the data set.

Example 1: Inspecting the Structure of a DataFrame


# Displaying the first few rows
print ( df . head () ) # By default , shows the → first 5 rows
# Displaying the last few rows
print ( df . tail (3) ) # Shows the last 3 rows
# Displaying the DataFrame ’s print ( df . info () )
structure

# Displaying basic statistics → columns


print ( df . describe () )

Explanation:
of numerical

• The head function is employed to look at a part of the DataFrame


while the tail function briefly describes the last elements of the
DataFrame for big datasets.

• It is advised to use the key info() on the kernel to get the


summary for the DataFrame such as number of records, number of
fields, non-null values, types of fields, and so on which helps in
understanding of abnormality like missing or wrongly typed data.

• Several methods of the describe() type enable the obtaining of


simple statistics such as mean value, standard deviation or
percentiles for numerical fields, which gives a general idea about the
distribution and dispersion of values in the table.
Example 2: Accessing Columns and Rows
# Accessing a single column
ages = df [ ’ Age ’]
print ( ages )

# Accessing multiple columns subset = df [[ ’ Name ’ , ’ City ’]] print ( subset )

# Accessing rows by index first_row = df . iloc [0] print ( first_row )

# Accessing rows by label first_row_label = df . loc [0] print ( first_row_label )

Explanation:

• The elements of a column can be indexed using the bracket


notation from the column name and for indexing multiple columns
one can use list of strings.

• Rows can be accessed in pandas with positionbased indexing with


the help of iloc and labelbased indexing by using
• loc which is useful in cases when one needs to select data by
specifying row names or custom indices.

Example 3: Modifying DataFrames


# Adding a new column
df [ ’ Salary ’] = [50000 , 60000 , 70000]
# Modifying an existing value
df . loc [0 , ’ Age ’] = 26 # Changing Alice ’s → age
# Dropping a column
df . drop ( ’ Salary ’ , axis =1 , inplace = True )
# Dropping a row
df . drop (0 , axis =0 , inplace = True ) # Drops → the first row
print ( df )
Explanation:

• The addition of new columns can also be achieved by assigning a


list or a Series to a new column name which is very frequently used
while enriching the datasets with new or additional features /
calculations.

• In order to correct or update some of the values in the DataFrame


specific value have to be changed with the help of loc reporting to
the row and column label.

• To delete a column or a row, you can use the drop() method and
while axis=1 is for column any value other than 0 is for a row.

• The inplace=True changes the current DataFrame by applying the


operation.
Example 4: Renaming Columns
# Renaming columns
df . rename ( columns ={ ’ Age ’: ’ Years ’, ’ City ’: → ’ Location ’} , inplace = True )
print ( df )
Explanation:

• The rename() method is used to rename columns which is crucial


as they may need to include more appropriate names to the dataset.

• This operation proves handy especially when you have several


datasets and you need a way to standardize the names of the
columns that are in these datasets or a need to change some of the
column names so as to conform to prescribed naming conventions
needed for reporting and analysis.

2.3.2 Sorting and Filtering Data

Sorting and filtering are basic operations required in the investigation


of datasets.
Example 1: Sorting DataFrames
# Sorting by a single column
df_sorted = df . sort_values ( by = ’ Age ’)

# Sorting by multiple columns


df_sorted_multi = df . sort_values ( by =[ ’ → Location ’ , ’ Age ’] , ascending =[ True ,
→ False ])

print ( df_sorted )
print ( df_sorted_multi )
Explanation:
• The sort values() method is used to sort the DataFrame in one or
several columns.
• Sorting can be done either ascending or descending and this is
regulated by the ascending parameter whereby it can take a list for
sorting in a number of columns.

• Sorting is essentially useful in arranging the data under specific


categories which may be useful in analysis and presentation of the
data.

Example 2: Filtering DataFrames


# Filtering rows where Age is greater than → 30
age_above_30 = df [ df [ ’ Age ’] > 30]

# Filtering rows based on string values location_filter = df [ df [ ’ Location ’]. str . →


contains ( ’ New ’) ]
print ( age_above_30 )
print ( location_filter )
Explanation:

Filtering is achieved by applying conditions di•


rectly to the DataFrame, resulting in a Boolean mask that selects
rows meeting the criteria.

String-based filtering can use methods like str.contains()•


to match patterns within text columns, which is useful for selecting
subsets of data based on partial matches.

Example 3: Advanced Filtering with Multiple Conditions


# Filtering with multiple conditions using → ’ and ’ and ’ or ’
filtered_df = df [( df [ ’ Age ’] > 25) & ( df [ ’ → Location ’] == ’ New ␣ York ’)]
# Filtering with a condition on index → labels
filtered_by_index = df . loc [ df . index > 0]

print ( filtered_df )
print ( filtered_by_index ) Explanation:

• Multiple conditions can be combined using logical operators such


as & (and), | (or), and ~ (not).
• Parentheses are required around each condition to ensure proper
evaluation, and indexing methods such as loc can also be used for
filtering based on index labels.

Example 4: Conditional Assignment


# Creating a new column based on a
→ condition
df [ ’ Age ␣ Group ’] = [ ’ Youth ’ if age < 30 else → ’ Adult ’ for age in df [ ’ Age ’]]

# Using np . where for conditional assignment import numpy as np


df [ ’ Status ’] = np . where ( df [ ’ Age ’] > 30 , ’

→ Senior ’ , ’ Junior ’)
print ( df ) Explanation:

• Conditional assignment creates new columns implementing logical


conditions, which is helpful for the dividing of data. Next, we discuss
the use of built-in functions to perform vectorized conditional
assignment for greater performance on large datasets.
• The np. where() function from NumPy is one such in-built function
available in Python.

2.4 Handling Missing Data


Dealing with missing data is one of the important stages in data
preprocessing since missing values can distort the results of
statistical models and data visualization.

2.4.1 Detecting Missing Data

Example 1: Identifying Missing Data


# Creating a DataFrame with missing values data = {
’ Name ’: [ ’ Alice ’ , ’ Bob ’, None ],
’ Age ’: [25 , None , 35] ,
’ Location ’: [ ’ New ␣ York ’, ’ Los ␣ Angeles ’, → None ]
}
df = pd . DataFrame ( data )
# Checking for missing values print ( df . isnull () )
# Counting missing values in each column print ( df . isnull () . sum () )

# Checking for missing values in specific → columns


missing_ages = df [ ’ Age ’ ]. isnull ()
print ( missing_ages )

Explanation:

• The isnull() function yields a DataFrame in the same shape as the


input, containing True where data is missing; this enables the user
to scrutinize the missing values.

• Analyzing the missing value entails counting the missing values


with the intention of identifying the degree of missing data within
each column so that an appropriate strategy can be employed on it.

Conducting a check on the columns of the data •


set can help narrow down the relevant data that need further
attention in as far as cleaning or imputation is concerned.

2.4.2 Handling Missing Data

Example 1: Filling Missing Data


# Filling missing values with a default → value
df [ ’ Age ’ ]. fillna ( df [ ’ Age ’ ]. mean () , inplace = → True ) # Filling with mean of the
→ column
df [ ’ Name ’ ]. fillna ( ’ Unknown ’ , inplace = True ) → # Filling with a default string

# Filling with forward fill and backward → fill


df [ ’ Location ’ ]. fillna ( method = ’ ffill ’, → inplace = True ) # Forward fill
df [ ’ Location ’ ]. fillna ( method = ’ bfill ’, → inplace = True ) # Backward fill

print ( df )
Explanation:

• Replace missing value using the fillna() with either mean, median
value or mode of the given array.

• Among them there is forward-fill (ffill ) and backward fill (bfill)


where missing values are replaced by previous or next non-missing
value, respectively.
• These techniques are most appropriate with time series data since
trends and continuity prevail in such data.

Example 2: Dropping Missing Data # Dropping rows with any missing values
df_cleaned = df . dropna ()

# Dropping rows with missing values in a → specific column


df_cleaned_specific = df . dropna ( subset =[ ’ → Location ’ ])

# Dropping columns with missing values df_cleaned_columns = df . dropna ( axis =1)

print ( df_cleaned )
print ( df_cleaned_specific ) print ( df_cleaned_columns )

Explanation:

• Ignorance of missing values can involve removing complete cases


whereby rows or columns containing missing values are removed
can be considered a simple technique especially when the missing
value is small or unimportant in the data set.

The dropna() function can be used in general•


or for several columns or rows thus making the process flexible.

• One must have seen that the dataset can be simplified sometimes
by removing the certain number of columns in which there is a
percentage of missing data values that is high.

Example 3: Imputing Missing Data with Custom Functions


# Custom function to impute missing values def impute_age ( row ) :
if pd . isnull ( row [ ’ Age ’ ]) :
return df [ ’ Age ’ ]. mean () # Use the → mean as a placeholder else :
return row [ ’ Age ’]
# Applying the custom function using apply df [ ’ Age ’] = df . apply ( impute_age , axis
=1)
print ( df )
Explanation:

• Custom functions are considered being more convenient than using


pre-defined procedures because they provide more opportunities to
implement certain imputation techniques which can be derived from
the real-world experience or result from certain calculations.

The apply() method applies the custom func•


tion to each row or column; this means that the customer not only
has the ability to input data that he believes applies to the general
formula, but also the ability to input data that applies specifically to
a given row or column.

2.5 Data Aggregation and Grouping


Grouping and summing helps in handling large set of data and
summarizing data in the form of groups and applying aggregate
functions on them.

2.5.1 Grouping and Aggregation

Example 1: Grouping Data


# Grouping by a single column and
→ calculating mean
grouped = df . groupby ( ’ Location ’)[ ’ Age ’]. → mean ()

# Grouping by multiple columns and counting → occurrences


grouped_count = df . groupby ([ ’ Location ’, ’ → Age ’ ]) . size ()

print ( grouped )
print ( grouped_count )
Explanation:

The groupby() is an inbuilt function used to•


subgroup the data on the basis of the columns of data.

• Other functions such as mean(), sum(), count(), etc. , which


belongs to the aggregation methods, can be used to work on the
grouped data to obtain the sum up information.

• This is especially useful when the hundreds of thousands of rows


are to be grouped by multiple variables because this then enables
multi level analysis where further insight into the relative variables in
a dataset maybe derived from.

Example 2: Multiple Aggregations


# Grouping by location and performing → multiple aggregations
grouped = df . groupby ( ’ Location ’). agg ({ ’ Age ’: [ ’ mean ’ , ’ max ’ , ’ min ’],
’ Salary ’: [ ’ sum ’ , ’ mean ’]
})

print ( grouped )
Explanation:

• The agg() function permits the use of more than one aggregation
function to be applied on grouped data giving a more enhanced
output.

• Aggregations can be set on per column basis so that the way how
the summary has to be calculated can be defined.

• This is especially so in cases where one is analyzing data or


making presentations and may need to use several metrics at once.

Example 3: Pivot Tables


# Creating a pivot table
pivot = df . pivot_table ( values = ’ Salary ’, → index = ’ Location ’ , columns = ’ Name ’,
→ aggfunc = ’ sum ’)

# Adding margins for grand totals


pivot_with_totals = df . pivot_table ( values = ’ → Salary ’ , index = ’ Location ’, columns =

→ Name ’ , aggfunc = ’ sum ’ , margins = True )

print ( pivot )
print ( pivot_with_totals )
Explanation:

• Pivot tables in Pandas are analogous to the ones in spreadsheets


which provides aggregated data in the form of row and column.
• Using margins with the margins=True parameter adds grand totals
which may be useful to visualise the size of the data set.

• It helps in the rearrangement of data and in the analysis of


relationship between two or more variables hence making complex
data easy to understand.

Example 4: Cross Tabulations


# Creating a cross - tabulation to compare → two columns
cross_tab = pd . crosstab ( df [ ’ Location ’], df [ → ’ Age ␣ Group ’ ])

print ( cross_tab )
Explanation:

• Cross-tabulations are a means of generating contingency tables of


two categorical variables and enable an understanding of the
association or distribution of these variables.

• This technique is very common in statistical analysis and reporting


and it is very useful especially in the marketing and social science
fields.

2.6 Merging and Joining DataFrames


Merging and joining DataFrames allows you to combine datasets in
various ways, which is essential for data integration and analysis.

2.6.1 Merging DataFrames

Example 1: Merging DataFrames on a Common Column


# Creating two DataFrames
df1 = pd . DataFrame ({
’ Name ’: [ ’ Alice ’ , ’ Bob ’] ,
’ Age ’: [25 , 30]
})

df2 = pd . DataFrame ({
’ Name ’: [ ’ Alice ’ , ’ Bob ’] , ’ Salary ’: [50000 , 60000]
})
# Merging on the ’ Name ’ column
merged_df = pd . merge ( df1 , df2 , on = ’ Name ’)
print ( merged_df )
Explanation:
• The merge() function is used to join two or more DataFrames by
some column or an index.

• This is especially beneficial when dealing with connected databases


where one can join the sales data with the customer data or data
from different sources.

• The merging process enables the combination of a large number of


data sets into one, thus enabling analysis of all the data obtained.
Example 2: Different Types of Joins
# Left join
left_join = pd . merge ( df1 , df2 , on = ’ Name ’, → how = ’ left ’)
# Right join
right_join = pd . merge ( df1 , df2 , on = ’ Name ’, → how = ’ right ’)
# Outer join
outer_join = pd . merge ( df1 , df2 , on = ’ Name ’, → how = ’ outer ’)
# Inner join
inner_join = pd . merge ( df1 , df2 , on = ’ Name ’, → how = ’ inner ’)

print ( left_join ) print ( right_join ) print ( outer_join ) print ( inner_join )

Explanation:
• Different types of joins control how DataFrames are combined:

– Left Join: Includes all rows from the left DataFrame and matched
rows from the right.

– Right Join: Includes all rows from the right DataFrame and
matched rows from the left.

– Outer Join: Includes all rows from both DataFrames, matching


where possible and filling with NaNs where not.

– Inner Join: Includes only rows with matching values in both


DataFrames.
• Joins are fundamental operations for combining data from multiple
sources, enabling comprehensive data analysis and reporting.

Example 3: Concatenating DataFrames


# Creating DataFrames to concatenate df1 = pd . DataFrame ({
’ Name ’: [ ’ Alice ’ , ’ Bob ’] ,
’ Age ’: [25 , 30]
})

df2 = pd . DataFrame ({
’ Name ’: [ ’ Charlie ’ , ’ David ’], ’ Age ’: [35 , 40]

})
# Concatenating along rows ( default ) concat_df = pd . concat ([ df1 , df2 ])
# Concatenating along columns
concat_df_columns = pd . concat ([ df1 , df2 ], → axis =1)
print ( concat_df )
print ( concat_df_columns ) Explanation:

• The concat() function join DataFrames along a certain axis (index


or columns) that permits adding data below or to the side of the
current data.

• This is useful for adding more rows of data or for joining two
DataFrames with different set of columns for bringing data together
at scale. Concatenation is typical when data of similar format are
combined, but for different objects or at different time points.

Example 4: Joining on Index # Creating DataFrames with indices df1 = pd .


DataFrame ({
’ Age ’: [25 , 30]
} , index =[ ’ Alice ’ , ’ Bob ’ ])

df2 = pd . DataFrame ({
’ Salary ’: [50000 , 60000]
} , index =[ ’ Alice ’ , ’ Bob ’ ])

# Joining DataFrames on their indices joined_df = df1 . join ( df2 )


print ( joined_df )
Explanation:
• Merge operation can be done on indices which is beneficial when
the index of the DataFrames has some significance as it combines
the DataFrames on the basis of rows indices.

• This method enables the integration of data in a rather easy way


when the use of column-based joins is not possible and when indices
are an important aspect of the data structure.

2.7 Working with Dates and Times


One of the most important issues was to handle dates and times as
they are common for time series analysis.

2.7.1 Handling DateTime Data

Example 1: Converting Strings to DateTime Objects


# Creating a DataFrame with date data data = {

’ Date ’: [ ’ 2023 -01 -01 ’ , ’ 2023 -01 -02 ’, ’ → 2023 -01 -03 ’] ,
’ Value ’: [100 , 150 , 200]
}
df = pd . DataFrame ( data )

# Converting strings to df [ ’ Date ’] = pd . to_datetime ( df [ ’ Date ’]) DateTime objects

# Extracting components
df [ ’ Year ’] = df [ ’ Date ’ ]. dt . year df [ ’ Month ’] = df [ ’ Date ’ ]. dt . month df [ ’ Day ’]
= df [ ’ Date ’ ]. dt . day

print ( df ) Explanation:

• The pd. to datetime() function is used to convert the date strings


to DateTime type so that date and time related operations can be
performed.

• Temporal elements such as year, month or day can be obtained


using the dt accessor which is particularly helpful for sorting or
categorising data by time.
• Manipulation of time series data as well as the accuracy of time
based calculations is greatly improved when using DateTime objects.
Example 2: DateTime Indexing and Resampling
# Setting the ’ Date ’ column as the index df . set_index ( ’ Date ’ , inplace = True )

# Resampling data to calculate monthly → averages


monthly_avg = df . resample ( ’M ’). mean ()

# Resampling data to calculate weekly sums weekly_sum = df . resample ( ’W ’). sum ()


print ( monthly_avg )
print ( weekly_sum )
Explanation:

• To that end, one can set the index to be a DateTime column which
enables a number of timebased operations like resampling.

• Resampling is the process of altering frequency of the data that is


vital in combining data into time series (for example averaging the
data per month or per year).

• Time-based resampling is an important aspect of summarization of


time series data, forecasting and detection of trends.

Example 3: Calculating Time Differences


# Creating a DataFrame with start and end → times
data = {
’ Start ’: [ ’ 2023 -01 -01 ␣ 08:00:00 ’, ’ → 2023 -01 -01 ␣ 09:30:00 ’],
’ End ’: [ ’ 2023 -01 -01 ␣ 08:30:00 ’, ’ → 2023 -01 -01 ␣ 10:00:00 ’] }
df = pd . DataFrame ( data )

# Converting strings to df [ ’ Start ’] = pd . to_datetime ( df [ ’ Start ’]) df [ ’ End ’] = pd .


to_datetime ( df [ ’ End ’]) DateTime objects

# Calculating time differences


df [ ’ Duration ’] = df [ ’ End ’] - df [ ’ Start ’]
# Calculating duration in minutes
df [ ’ Duration ␣ in ␣ Minutes ’] = df [ ’ Duration ’]. → dt . total_seconds () / 60
print ( df )
Explanation:

• Some of the functions of DateTime calculations are to add time


deltas and to find the differences between two dates which are very
crucial in time series analysis.
• Many programs, for example, timing of events, scheduling and
measuring time lapses between two events require the calculation of
durations or intervals.

• When working with timed data, it’s possible to learn a lot about
time-based measures, including processing time, response rate, and
temporal distribution.

Example 4: Shifting and Lagging Data # Creating a time series DataFrame


data = {
’ Date ’: pd . date_range ( start = ’ 2023 -01 -01 → ’ , periods =5 , freq = ’D ’) ,

’ Value ’: [100 , 150 , 200 , 250 , 300] }


df = pd . DataFrame ( data )
df . set_index ( ’ Date ’ , inplace = True )

# Shifting data forward by one day df [ ’ Shifted ’] = df [ ’ Value ’ ]. shift (1)

# Calculating the difference between → current and previous day


df [ ’ Difference ’] = df [ ’ Value ’] - df [ ’ → Shifted ’]

print ( df )
Explanation:

• With the help of shift() method, data can be shifted forward or


backward by a given number of periods which is quite helpful in
analyzing the lagged or lead data in a time series.

• Subtracting the shifted values enables one to assess changes that


occur at different time intervals including daily fluctuations, trends or
seasonal patterns.

• These methods are quite useful in time series analysis, financial


modeling and in any other application relating to sequential
information.

2.8 Case Studies and Practical Examples


Following case studies provide practical applications of Pandas,
demonstrating how it can be used to solve real-world problems.
2.8.1 Case Study 1: Analyzing Sales Data

Problem: You have a dataset containing sales data for a company,


including columns for the date of sale, item sold, and sales amount.
Your task is to analyze the sales trends over time and identify the
top-performing items.

Steps:
1. Load the sales data from a CSV file.

2. Convert the date column to DateTime format and set it as the


index.

3. Calculate monthly sales totals.


4. Identify the top 5 items by sales amount.
5. Visualize the sales trends over time.

Solution:
import pandas as pd
import matplotlib . pyplot as plt
# Load the sales data
sales_data = pd . read_csv ( ’ sales_data . csv ’)
# Convert the ’ Date ’ column to DateTime and → set as index
EXAMPLES
sales_data [ ’ Date ’] = pd . to_datetime ( → sales_data [ ’ Date ’ ])
sales_data . set_index ( ’ Date ’, inplace = True )

# Calculate monthly sales totals


monthly_sales = sales_data . resample ( ’M ’)[ ’ → Amount ’ ]. sum ()

# Identify the top 5 items by sales amount top_items = sales_data . groupby ( ’ Item ’)[ ’
→ Amount ’ ]. sum () . sort_values ( ascending =
→ False ) . head (5)

# Plotting the sales trends


plt . figure ( figsize =(10 , 6) )
plt . plot ( monthly_sales , marker = ’o ’) plt . title ( ’ Monthly ␣ Sales ␣ Trends ’) plt . xlabel (
’ Month ’)
plt . ylabel ( ’ Total ␣ Sales ␣ Amount ’) plt . grid ( True )
plt . show ()
print ( monthly_sales ) print ( top_items )
Explanation:

• This is because, when you resample the sales data, it is easier to


analyze trends spanning different time frames like monthly or
quarterly.

• Arranging by item makes it easy to determine the best performers,


which can inform stock and promotional strategies. Using plots to
display sales trends makes it easier to identify seasonal fluctuations
in sales activities.

2.8.2 Case Study 2: Customer Segmentation

Problem: A retail company wants to segment its customers based on


their purchase history. You have data on customer purchases,
including customer ID, purchase date, and amount. Your goal is to
create customer segments based on their total spend and frequency
of purchases.

Steps:
1. Load the customer data from a CSV file.
2. Aggregate the data by customer ID to calculate total spend and
purchase frequency.
3. Create customer segments based on spend and frequency.
4. Analyze the characteristics of each segment. Solution:
import pandas as pd
# Load the customer data
customer_data = pd . read_csv ( ’ customer_data . → csv ’)

# Aggregate by customer ID to calculate → total spend and frequency


customer_summary = customer_data . groupby ( ’ → CustomerID ’) . agg ({
’ Amount ’: ’ sum ’ ,
’ PurchaseDate ’: ’ count ’
}) . rename ( columns ={ ’ Amount ’: ’ TotalSpend ’, → ’ PurchaseDate ’: ’ Frequency ’})

# Create customer segments based on spend → and frequency


EXAMPLES
customer_summary [ ’ Segment ’] = pd . cut ( → customer_summary [ ’ TotalSpend ’], bins
→ =[0 , 100 , 500 , 1000 , 5000] , labels =[ ’ → Low ’ , ’ Medium ’ , ’ High ’, ’ Very ␣ High ’])

# Analyzing segment characteristics segment_analysis = customer_summary . groupby

→ ( ’ Segment ’) . agg ({
’ TotalSpend ’: [ ’ mean ’ , ’ count ’], ’ Frequency ’: ’ mean ’

})
print ( customer_summary )
print ( segment_analysis )
Explanation:

Aggregating data by customer allows for analysis •


at the individual level, helping identify highvalue customers.

Segmentation can be used to tailor marketing •


strategies, improve customer service, and optimize resource
allocation.

Analyzing the characteristics of each segment •


provides insights into customer behavior, enabling data-driven
decision-making.

2.8.3 Case Study 3: Time Series Forecasting

Problem: You have a dataset containing daily stock prices for a


company, including the date and closing price. Your goal is to create
a forecast model to predict future stock prices based on historical
data.

Steps:
1. Load the stock price data from a CSV file. 2. Convert the date
column to DateTime format and set it as the index.
3. Use a simple moving average to forecast future prices.
4. Evaluate the forecast accuracy using historical data.
Solution:
import pandas as pd
# Load the stock price data
stock_data = pd . read_csv ( ’ stock_data . csv ’)
# Convert the ’ Date ’ column to DateTime and → set as index
stock_data [ ’ Date ’] = pd . to_datetime ( → stock_data [ ’ Date ’ ])
stock_data . set_index ( ’ Date ’ , inplace = True )

# Calculate the 7 - day moving average stock_data [ ’7 - Day ␣ MA ’] = stock_data [ ’ Close


’ → ]. rolling ( window =7) . mean ()

# Forecast future prices based on the → moving average


forecast = stock_data [ ’7 - Day ␣ MA ’]. shift ( -7) → # Shift by 7 → forecasting days to
simulate

# Evaluating forecast accuracy stock_data [ ’ Error ’] = stock_data [ ’ Close ’] → forecast


print ( stock_data )
print ( forecast )
Explanation:

• Moving averages smooth out short-term fluctuations and highlight


longer-term trends in data. EXAMPLES

• Forecasting using historical data can help in making informed


decisions in financial markets, supply chain management, and other
areas.

Evaluating forecast accuracy by comparing the •


forecast to actual values helps in assessing the reliability of the
forecasting method.

2.9 Practice Questions and Solutions


2.9.1 Practice Questions on Pandas

Question 1: Creating DataFrames


Create a DataFrame from the following data and
display the first 3 rows:

• Names: John, Jane, Jim, Jack


• Ages: 28, 34, 29, 40
• Cities: Boston, Austin, Denver, Seattle
Solution:
import pandas as pd
data = {
’ Name ’: [ ’ John ’ , ’ Jane ’ , ’ Jim ’, ’ Jack ’

→],
’ Age ’: [28 , 34 , 29 , 40] ,
’ City ’: [ ’ Boston ’ , ’ Austin ’, ’ Denver ’,

→ ’ Seattle ’]
}
df = pd . DataFrame ( data )
print ( df . head (3) )
Question 2: Data Selection and Filtering

Using the DataFrame created in Question 1, filter and display rows


where Age is greater than 30.
Solution:
filtered_df = df [ df [ ’ Age ’] > 30]
print ( filtered_df )

Question 3: Handling Missing Data Create a DataFrame with missing


data using the following dictionary, then fill missing values in the
’Age’ column with the mean value.
2.9. PRACTICE QUESTIONS AND SOLUTIONS

• Names: John, Jane, None, Jack


• Ages: 28, None, 29, 40
• Cities: Boston, None, Denver, Seattle

Solution:
data = {
’ Name ’: [ ’ John ’ , ’ Jane ’, None , ’ Jack ’], ’ Age ’: [28 , None , 29 , 40] ,
’ City ’: [ ’ Boston ’ , None , ’ Denver ’, ’

→ Seattle ’]
}
df = pd . DataFrame ( data )
df [ ’ Age ’ ]. fillna ( df [ ’ Age ’ ]. mean () , inplace = → True )
print ( df )
Question 4: Aggregation and Grouping
Using the DataFrame created in Question 1, group by ’City’ and
calculate the mean age.
Solution:
grouped = df . groupby ( ’ City ’)[ ’ Age ’]. mean () print ( grouped )

Question 5: Merging DataFrames


Create another DataFrame with the following data and merge it with
the DataFrame from Question 1 on the ’Name’ column:

• Names: John, Jane, Jim, Jack


• Salaries: 60000, 65000, 70000, 75000
Solution:
data2 = {
’ Name ’: [ ’ John ’ , ’ Jane ’, ’ Jim ’, ’ Jack ’ → ] ,
’ Salary ’: [60000 , 65000 , 70000 , 75000] }

df2 = pd . DataFrame ( data2 )


merged_df = pd . merge ( df , df2 , on = ’ Name ’) print ( merged_df )

2.10. CONCLUSION

2.10 Conclusion
This chapter provides an in-depth introduction to the Pandas library
for data analysis in Python. Through detailed explanations,
examples, and practical applications, students and practitioners can
learn to effectively handle, manipulate, and analyze data, perform
numerical computations, and apply advanced data transformation
techniques using Pandas. The practice questions reinforce the
concepts covered, ensuring a comprehensive understanding of
Pandas’ capabilities.
Chapter 3 Scikit-Learn Library
3.1 Introduction to Scikit-Learn
Scikit-learn is a versatile open-source Python library that provides
simple and efficient tools for data mining and machine learning. Built
on NumPy, SciPy, and Matplotlib, Scikit-learn integrates seamlessly
with other Python libraries, making it an essential tool for developing
machine learning models. It offers a wide range of supervised and
unsupervised learning algorithms, model evaluation metrics,
preprocessing tools, and hyperparameter tuning techniques.

3.1.1 Why Use Scikit-Learn?

Scikit-learn is popular due to its ease of use, flexibility, and


comprehensive set of features:

• Wide Range of Algorithms: Includes algorithms for classification,


regression, clustering, dimensionality reduction, and more.

• Model Evaluation: Provides metrics and tools for evaluating the


performance of machine learning models, such as cross-validation
and various scoring methods.

• Preprocessing: Includes tools for data preprocessing such as


scaling, normalization, handling missing values, and encoding
categorical variables.

Parameter Tuning: Offers techniques like Grid•


Search and Random Search for hyperparameter optimization,
allowing for the fine-tuning of model performance.

Integration: Works seamlessly with other li•


braries like Pandas and Matplotlib for data manipulation and
visualization, enhancing the endto-end machine learning workflow.

3.1.2 Installation

To install Scikit-learn, use the following command in your Python


environment:
pip install scikit - learn

3.2 Supervised Learning


Supervised learning involves training a model on labeled data, where
the algorithm learns the mapping between input features and output
labels. This section covers some common supervised learning
algorithms in Scikit-learn, including detailed examples and
explanations.

3.2.1 Linear Regression

Linear Regression is a fundamental algorithm used for predicting


continuous target variables. It assumes a linear relationship between
the input features and the target variable.

Example 1: Simple Linear Regression


from sklearn . linear_model import
→ LinearRegression
import numpy as np

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) #

→ Input features
y = np . array ([2 , 4 , 5 , 4 , 5]) #
→ Target variable

# Creating and fitting the model model = LinearRegression () model . fit (X , y )

# Making predictions
predictions = model . predict ( np . array ([[6]]) → )
print ( " Predicted ␣ value ␣ for ␣ input ␣ 6: " , → predictions )
Explanation:
• Linear Regression fits a line to the data that minimizes the sum of
squared residuals between the predicted and actual values.

The fit() method is used to train the model on•


the provided data, while the predict() method is used to make
predictions on new data.

The model learns the slope (coefficient) and in•


tercept of the line that best fits the data, making it suitable for
predicting continuous outcomes.

Example 2: Evaluating Linear Regression from sklearn . metrics import


→ mean_squared_error , r2_score
# Making predictions on the training data predictions = model . predict ( X )

# Calculating evaluation metrics


mse = mean_squared_error (y , predictions ) r2 = r2_score (y , predictions )

print ( " Mean ␣ Squared ␣ Error : " , mse )


print ( "R - squared : " , r2 )
Explanation:

• Mean Squared Error (MSE): Measures the average of the squared


differences between predicted and actual values. Lower values
indicate better performance.

• R-squared: Indicates the proportion of the variance in the target


variable that is predictable from the input features. Values closer to
1 indicate a better fit.

• These metrics help evaluate how well the linear regression model
captures the underlying relationship between the features and the
target variable.

Example 3: Multiple Linear Regression # Sample data with multiple features X


= np . array ([[1 , 2] , [2 , 3] , [3 , 4] , [4 , → 5] , [5 , 6]])
y = np . array ([3 , 5 , 7 , 9 , 11])

# Creating and fitting the model model = LinearRegression () model . fit (X , y )


# Making predictions
predictions = model . predict ( np . array ([[6 , → 7]]) )
print ( " Predicted ␣ value ␣ for ␣ input ␣ [6 , ␣ 7]: " , → predictions )
Explanation:

• Multiple Linear Regression extends Simple Linear Regression by


considering multiple input features, making it suitable for more
complex datasets.

• The model learns the coefficients for each feature, allowing it to


make predictions based on multiple inputs, thus capturing more
complex relationships.

• This approach is commonly used in scenarios where the target


variable is influenced by several factors, such as predicting house
prices based on size, location, and other features.

3.2.2 Logistic Regression

Logistic Regression is used for binary classification problems, where


the goal is to predict one of two possible classes. It estimates the
probability of a class label using the logistic function, which outputs
values between 0 and 1.

Example 1: Binary Classification with Logistic Regression


from sklearn . linear_model import
→ LogisticRegression
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] ,

→ [6]]) # Input features


y = np . array ([0 , 0 , 0 , 1 , 1 , 1])
→ # Target labels

# Creating and fitting the model model = LogisticRegression () model . fit (X , y )

# Making predictions
predictions = model . predict ( np . array → ([[3.5]]) )
print ( " Predicted ␣ class ␣ for ␣ input ␣ 3.5: " , → predictions )
Explanation:
Logistic Regression outputs probabilities for each •
class, and the class with the highest probability is chosen as the
prediction.

The model uses the logistic function (sigmoid) •


to map predicted values to probabilities, making it suitable for
classification tasks.

It is particularly useful for binary classification •


problems, such as spam detection, fraud detection, and disease
diagnosis.

Example 2: Evaluating Logistic Regression


from sklearn . metrics import accuracy_score , → confusion_matrix ,
→ classification_report

# Making predictions on the training data predictions = model . predict (X)

# Calculating evaluation metrics


accuracy = accuracy_score (y , predictions ) conf_matrix = confusion_matrix (y ,

→ predictions )
class_report = classification_report (y ,
→ predictions )

print ( " Accuracy : " , accuracy )


print ( " Confusion ␣ Matrix :\ n " , conf_matrix ) print ( " Classification ␣ Report :\ n" ,

→ class_report )
Explanation:

• Accuracy: Measures the proportion of correct predictions out of all


predictions made. Higher values indicate better performance.

• Confusion Matrix: Provides a summary of the prediction results,


showing the counts of true positive, true negative, false positive, and
false negative predictions.

• Classification Report: Provides detailed metrics including precision,


recall, and F1-score for each class, offering a more comprehensive
evaluation of the model’s performance.
• These metrics are crucial for assessing the effectiveness of
classification models, especially in applications where class
distribution is imbalanced.

Example 3: Multiclass Classification # Sample data for multiclass classification X


= np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 1 , 2 , 0 , 1 , 2])

# Creating and fitting the model with → multiclass option


model = LogisticRegression ( multi_class = ’ → multinomial ’ , solver = ’ lbfgs ’)
model . fit (X , y )

# Making predictions
predictions = model . predict ( np . array ([[3]]) → )
print ( " Predicted ␣ class ␣ for ␣ input ␣ 3: " , → predictions )
Explanation:

• Logistic Regression can be extended to multiclass problems using


the ’multinomial’ option, allowing for predictions across multiple
classes.

• The solver parameter specifies the optimization algorithm used to


fit the model, with ’lbfgs’ being a popular choice for multiclass
problems due to its efficiency in handling large datasets and multiple
classes.

• This approach is useful in scenarios where the target variable can


take more than two possible classes, such as predicting the type of
flower species or classifying handwritten digits.

3.2.3 Decision Trees

Decision Trees are a versatile machine learning algorithm capable of


performing both classification and regression tasks. They work by
recursively splitting the data into subsets based on the value of input
features, creating a tree-like model of decisions.
Example 1: Decision Tree for Classification from sklearn . tree import
→ DecisionTreeClassifier
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 0 , 1 , 1 , 1, 0])

# Creating and fitting the Decision Tree → model


model = DecisionTreeClassifier ( random_state → =0)
model . fit (X , y )

# Making predictions
predictions = model . predict ( np . array → ([[3.5]]) )
print ( " Predicted ␣ class ␣ for ␣ input ␣ 3.5: " , → predictions )
Explanation:

• Decision Trees split the data based on feature values that best
separate the target classes, continuing recursively to create a tree
structure.

• Each node in the tree represents a decision point, and each leaf
node represents a class label.

• Decision Trees are easy to interpret and visualize, making them


useful for understanding the decision-making process of the model.
Example 2: Visualizing Decision Trees
from sklearn . tree import plot_tree import matplotlib . pyplot as plt

# Visualizing the Decision Tree


plt . figure ( figsize =(10 , 6) )
plot_tree ( model , filled = True , feature_names

→ =[ ’ Feature ’] , class_names =[ ’ Class ␣0 ’, → ’ Class ␣ 1 ’ ])


plt . title ( " Decision ␣ Tree ␣ Visualization ") plt . show ()

Explanation:

Visualizing Decision Trees helps in understand•


ing the decision-making process of the model by showing the
sequence of splits based on feature values.
Each node in the tree diagram contains infor•
mation about the feature used for the split, the threshold value, and
the distribution of samples among the classes.

This visualization is particularly valuable in sce•


narios where model interpretability is crucial, such as in medical
decision-making or credit scoring.

Example 3: Decision Tree for Regression from sklearn . tree import


→ DecisionTreeRegressor
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([2 , 4 , 6 , 8 , 10 , 12])
# Creating and fitting the Decision Tree → Regressor model
model = DecisionTreeRegressor ( random_state → =0)
model . fit (X , y )
# Making predictions
predictions = model . predict ( np . array ([[7]]) → )
print ( " Predicted ␣ value ␣ for ␣ input ␣ 7: " , → predictions )
Explanation:

Decision Trees can also be used for regression •


analysis, where the goal is to predict continuous values rather than
class labels.

• In regression trees, the model splits the data based on feature


values that best reduce the variance in the target variable within
each subset.

• This approach is useful for capturing non-linear relationships


between input features and the target variable, making it suitable for
complex regression problems.

3.3 Unsupervised Learning


Unsupervised learning involves training models on unlabeled data to
identify patterns, structures, or relationships within the data. This
section covers common unsupervised learning algorithms in
Scikitlearn, with detailed examples and explanations.

3.3.1 K-Means Clustering

K-Means clustering is an unsupervised classification technique that is


used in the process of partitioning a given data set in a given set of
clusters in accordance with corresponding features. Its goal is to
have the least amount of variation within clusters whilst having the
greater amount of variation between clusters.

Example 1: Basic K-Means Clustering from sklearn . cluster import KMeans


# Sample data
X = np . array ([[1 , 2] , [1 , 4] , [1 , 0] , [10 , → 2] , [10 , 4] , [10 , 0]])
# Creating and fitting the K - Means model model = KMeans ( n_clusters =2 ,
random_state → =0)
model . fit ( X )
# Predicting cluster labels labels = model . predict ( X )
print ( " Cluster ␣ labels : " , labels )
print ( " Cluster ␣ centers : " , model .
→ cluster_centers_ )
Explanation:

• K-Means Clustering assigns data points to clusters based on their


distances to cluster centroids, which are updated iteratively to
minimize withincluster variance.

The n clusters parameter specifies the number•


of clusters, while random state ensures reproducibility of results.

The algorithm outputs the cluster labels for each •


data point and the coordinates of the cluster centroids, which
represent the centers of the clusters.

Example 2: Visualizing K-Means Clustering import matplotlib . pyplot as plt


# Plotting the clustered data
plt . scatter ( X [: , 0] , X [: , 1] , c= labels ,

→ cmap = ’ viridis ’)
plt . scatter ( model . cluster_centers_ [: , 0] ,
→ model . cluster_centers_ [: , 1] , s =200 ,
→ c = ’ red ’ , marker = ’X ’)
plt . title ( ’K - Means ␣ Clustering ’)
plt . xlabel ( ’ Feature ␣ 1 ’)
plt . ylabel ( ’ Feature ␣ 2 ’)
plt . grid ( True )
plt . show ()

Explanation:

• Visualizing clustered data helps in understanding the separation


and grouping of data points, as well as the positioning of cluster
centroids.

• Plotting can reveal whether clusters are wellseparated or if


overlaps exist, providing insights into the effectiveness of the
clustering.

Visualization is especially useful in exploratory •


data analysis, where understanding the structure of the data is key.
Example 3: Evaluating Clustering with Silhouette Score
from sklearn . metrics import
→ silhouette_score
# Calculating the silhouette score sil_score = silhouette_score (X , labels )

Explanation:
print ( " Silhouette ␣ Score : " , sil_score )
• Silhouette Score: Measures how similar each

data point is to its own cluster compared to other clusters. Values


close to 1 indicate welldefined clusters, while values close to 0
indicate overlapping clusters.

• Silhouette Score is commonly used to assess the quality of


clustering and determine the optimal number of clusters, providing a
quantitative measure of clustering performance.

• High silhouette scores suggest that clusters are well-separated,


making the clustering meaningful and useful for analysis.
3.3.2 Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction


technique that transforms data into a lower-dimensional space while
preserving as much variance as possible. It helps simplify complex
datasets by reducing the number of features while retaining the
most important information.

Example 1: Applying PCA for Dimensionality Reduction


from sklearn . decomposition import PCA
# Sample data with 3 features
X = np . array ([[1 , 2 , 3] , [4 , 5, 6] , [7 , 8, → 9] , [10 , 11 , 12]])

# Creating and fitting the PCA model to → reduce to 2 dimensions


model = PCA ( n_components =2)
X_reduced = model . fit_transform (X)

print ( " Reduced ␣ data :\ n " , X_reduced ) Explanation:

• PCA reduces the dimensionality of the dataset by projecting it onto


a lower-dimensional subspace that captures the most variance in the
data.

• The n components parameter specifies the number of principal


components to keep, effectively controlling the dimensionality of the
reduced dataset.

• PCA is widely used in scenarios where highdimensional data can be


challenging to visualize, analyze, or process efficiently.

Example 2: Visualizing PCA Results # Plotting the reduced data


plt . scatter ( X_reduced [: , 0] , X_reduced [: ,

→ 1])
plt . title ( ’ PCA ␣ Dimensionality ␣ Reduction ’) plt . xlabel ( ’ Principal ␣ Component ␣1 ’) plt
. ylabel ( ’ Principal ␣ Component ␣2 ’) plt . grid ( True )
plt . show ()

Explanation:
• Visualizing the reduced data in two dimensions helps in
understanding how PCA has transformed the original dataset and
highlights patterns or clusters in the data.

• This visualization is particularly useful when reducing high-


dimensional data to two or three dimensions for exploratory data
analysis.

By visualizing principal components, you can •


identify directions in the data that capture the most variation,
guiding further analysis and feature selection.

Example 3: Explained Variance Ratio # Checking the explained variance ratio


explained_variance = model .
→ explained_variance_ratio_
print ( " Explained ␣ Variance ␣ Ratio :" ,
→ explained_variance )
Explanation:

• This is expressed as the explained variance ratio that gives the


exact percentage of the variation in the entire data set that each
principal component explains; used in determining how well the
compressed data holds a mirror to the original data.

• High explained variance proves that the choice of the principal


components provides most of the DataInfoMap, and thus PCA is an
efficient tool for the dimensionality reduction.

• The explained variance ratio aids in determining the number of


components to retain for analysis in the case where where
exploration of dimensions has to be done but without compromising
on the amount of variability retained or explained.

Example 4: Using PCA for Feature Selection # Sample data with redundant
features X = np . array ([[1 , 2 , 1] , [2 , 3, 2] , [3 , 4, → 3] , [4 , 5 , 4]])

# Applying PCA to reduce redundant features model = PCA ( n_components =2)


X_reduced = model . fit_transform (X)
print ( " Original ␣ shape : " , X . shape )
print ( " Reduced ␣ shape : " , X_reduced . shape ) Explanation:

PCA can also be used for feature selection by •


reducing the dataset to components that capture the most variance,
effectively removing redundant or less informative features.

• This approach is useful in preprocessing steps to simplify models


and reduce overfitting, especially in datasets with highly correlated
features.

By reducing the number of features, PCA helps •


in speeding up training times and improving the interpretability of
models.

3.4 Model Evaluation and Parameter


Tuning
Evaluating and tuning models is essential for optimizing performance
and ensuring that the machine learning models generalize well to
new data. This section covers key techniques in model evaluation
and hyperparameter tuning.

3.4.1 Cross-Validation

Cross-validation is a technique for assessing how a machine learning


model will perform on an independent dataset. It involves splitting
the dataset into multiple folds, training the model on some folds,
and evaluating it on the remaining folds.

Example 1: K-Fold Cross-Validation


from sklearn . model_selection import → cross_val_score
from sklearn . linear_model import
→ LinearRegression

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([2 , 4 , 5 , 4 , 5])
# Creating the Linear Regression model model = LinearRegression ()

# Performing K - Fold Cross - Validation with 5 → folds


scores = cross_val_score ( model , X , y , cv =5)

print ( " Cross - Validation ␣ Scores :" , scores ) print ( " Mean ␣ Score : " , scores . mean ()
)
TUNING

• Cross-validation helps in evaluating the model’s performance


across different subsets of the data, providing a more robust
estimate of its accuracy and generalizability.

• The cv parameter specifies the number of folds to use, with higher


values generally providing more stable estimates at the cost of
increased computation.

• K-Fold Cross-Validation is widely used for hyperparameter tuning


and model selection, as it provides a balance between bias and
variance in performance estimates.

Example 2: Stratified K-Fold Cross-Validation


from sklearn . model_selection import
→ StratifiedKFold
from sklearn . linear_model import
→ LogisticRegression

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 0 , 0 , 1 , 1, 1])
# Creating the Logistic Regression model model = LogisticRegression ()

# Performing Stratified K - Fold Cross → Validation


skf = StratifiedKFold ( n_splits =3)
scores = cross_val_score ( model , X , y , cv = → skf )

print ( " Stratified ␣ Cross - Validation ␣ Scores :" → , scores )

• Stratified K-Fold Cross-Validation ensures that each fold has a


similar distribution of classes, which is particularly important for
imbalanced datasets.
• This approach helps in maintaining the representativeness of the
folds, leading to more accurate and reliable evaluation results.

• Stratification is crucial in classification tasks, especially when the


target classes are imbalanced, as it prevents biased performance
estimates.

3.4.2 Hyperparameter Tuning

Hyperparameter tuning involves optimizing the hyperparameters of a


model to improve its performance. This process is crucial for
achieving the best possible results from machine learning
algorithms.

Example 1: Grid Search


from sklearn . model_selection import
→ GridSearchCV
from sklearn . ensemble import
→ RandomForestClassifier

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 0 , 0 , 1 , 1 , 1])
# Creating the Random Forest model
model = RandomForestClassifier ( random_state → =0)

# Defining the hyperparameter grid param_grid = {


’ n_estimators ’: [10 , 50 , 100] ,
’ max_depth ’: [ None , 10 , 20]

TUNING
}
# Performing Grid Search
grid_search = GridSearchCV ( model , → param_grid , cv =3)
grid_search . fit (X , y )

print ( " Best ␣ Parameters : " , grid_search . → best_params_ )


print ( " Best ␣ Score : " , grid_search .
→ best_score_ )

Explanation:
• Grid Search exhaustively searches over a specified parameter grid,
training and evaluating the model on each combination of
parameters.

• This approach allows for systematic optimization of


hyperparameters, although it can be computationally expensive for
large grids.

• Grid Search is widely used for fine-tuning complex models, such as


ensemble methods, to identify the best set of hyperparameters for
optimal performance.

Example 2: Randomized Search


from sklearn . model_selection import
→ RandomizedSearchCV

# Performing Randomized Search


random_search = RandomizedSearchCV ( model , → param_grid , cv =3 , n_iter =5 ,
→ random_state =0)
random_search . fit (X , y )

print ( " Best ␣ Parameters : " , random_search . → best_params_ )


print ( " Best ␣ Score : " , random_search .
→ best_score_ )

Explanation:

• Randomized Search samples a specified number of parameter


settings from the grid, providing a more efficient but less exhaustive
search compared to Grid Search.

This method balances the trade-off between com•


putational cost and the thoroughness of the search, making it
suitable for large parameter spaces.

• Randomized Search is useful when computational resources are


limited or when the parameter space is too large for a full Grid
Search.
3.5 Practice Questions and Solutions
3.5.1 Practice Questions on Scikit-Learn

Question 1: Linear Regression

Using the provided data below, create a Linear Regression model


and predict the value for input X = 7.

• Input Features (X): [1, 2, 3, 4, 5, 6]


• Target Variable (y): [2, 4, 5, 4, 5, 6]
Solution:
from sklearn . linear_model import → LinearRegression
import numpy as np

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([2 , 4 , 5 , 4 , 5, 6])

# Creating and fitting the model model = LinearRegression () model . fit (X , y )

# Making a prediction
prediction = model . predict ( np . array ([[7]]) ) print ( " Predicted ␣ value ␣ for ␣ input ␣ 7:
",

→ prediction )
Question 2: Logistic Regression
Create a Logistic Regression model using the data below and
evaluate its accuracy.
• Input Features (X): [1, 2, 3, 4, 5, 6]
• Target Labels (y): [0, 0, 0, 1, 1, 1]
Solution:
from sklearn . linear_model import
→ LogisticRegression
from sklearn . metrics import accuracy_score

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 0 , 0 , 1 , 1 , 1])

# Creating and fitting the model model = LogisticRegression () model . fit (X , y )


# Making predictions
predictions = model . predict ( X )

# Calculating accuracy
accuracy = accuracy_score (y , predictions ) print ( " Accuracy : " , accuracy )

Question 3: K-Means Clustering


Apply K-Means Clustering to the data below with 2 clusters and
visualize the result.
• Data Points (X): [[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]
Solution:
from sklearn . cluster import KMeans import matplotlib . pyplot as plt import numpy as np

# Sample data
X = np . array ([[1 , 2] , [1 , 4] , [1 , 0] , [10 , → 2] , [10 , 4] , [10 , 0]])
# Creating and fitting the model model = KMeans ( n_clusters =2 , random_state → =0)
model . fit ( X )
# Predicting cluster labels labels = model . predict ( X )
# Visualizing the clustered data
plt . scatter ( X [: , 0] , X [: , 1] , c= labels ,

→ cmap = ’ viridis ’)
plt . scatter ( model . cluster_centers_ [: , 0] ,
→ model . cluster_centers_ [: , 1] , s =200 ,
→ c = ’ red ’ , marker = ’X ’)
plt . title ( ’K - Means ␣ Clustering ’)
plt . xlabel ( ’ Feature ␣ 1 ’)
plt . ylabel ( ’ Feature ␣ 2 ’)
plt . grid ( True )
plt . show ()

Question 4: Principal Component Analysis (PCA)


Reduce the dimensionality of the following data to 2 dimensions
using PCA and display the transformed data.

• Data Points (X): [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
Solution:
from sklearn . decomposition import PCA import numpy as np
# Sample data
X = np . array ([[1 , 2 , 3] , [4 , 5, 6] , [7 , 8, → 9] , [10 , 11 , 12]])

# Creating and fitting the PCA model to → reduce to 2 dimensions


model = PCA ( n_components =2)
X_reduced = model . fit_transform (X) print ( " Reduced ␣ data :\ n " , X_reduced )

Question 5: Hyperparameter Tuning with Grid Search


Use Grid Search to find the best parameters for a
RandomForestClassifier using the following data and hyperparameter
grid.

• Input Features (X): [1, 2, 3, 4, 5, 6]


• Target Labels (y): [0, 0, 0, 1, 1, 1]

• Hyperparameter Grid: n estimators = [10, 50], max depth =


[None, 10]
Solution:
from sklearn . model_selection import → GridSearchCV
from sklearn . ensemble import
→ RandomForestClassifier

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 0 , 0 , 1 , 1 , 1])
# Creating the model
model = RandomForestClassifier ( random_state → =0)

# Defining the hyperparameter grid param_grid = {


’ n_estimators ’: [10 , 50] ,
’ max_depth ’: [ None , 10]
}

# Performing Grid Search


grid_search = GridSearchCV ( model , → param_grid , cv =3)
grid_search . fit (X , y )

print ( " Best ␣ Parameters : " , grid_search . → best_params_ )


print ( " Best ␣ Score : " , grid_search . → best_score_ )

3.6 Conclusion
Chapter 3 provides a comprehensive introduction to Scikit-learn,
covering essential machine learning algorithms, model evaluation
techniques, and parameter tuning strategies. Through detailed
examples and practical applications, students and practitioners can
learn to build, evaluate, and optimize machine learning models using
Scikit-learn, empowering them to make data-driven decisions and
predictions.
Chapter 4 Statistics
4.1 Introduction to Statistics
Statistics is a branch of mathematics that deals with the collection,
analysis, interpretation, and presentation of data. In data science
and machine learning, statistical methods are used to describe data
distributions, test hypotheses, and make data-driven decisions. This
module covers essential statistical concepts, including descriptive
statistics, grouping, ANOVA, correlation, polynomial regression, and
in-sample evaluation metrics such as R-squared and Mean Squared
Error (MSE).

4.2 Descriptive Statistics


Descriptive statistics provide summaries and descriptions of data.
Key measures include mean, median, mode, variance, standard
deviation, range, and percentiles. These statistics help in
understanding the central tendency, dispersion, and distribution
shape of a dataset.

4.2.1 Key Measures

• Mean: The arithmetic average of the data points.


• Median: The middle value when the data points are sorted in
ascending order.
• Mode: The most frequently occurring value in the dataset.
• Range: The difference between the maximum and minimum
values.
• Variance: Measures the spread of the data points around the
mean.
• Standard Deviation: The square root of the variance, indicating the
average distance from the mean.

• Percentiles: Values below which a given percentage of observations


in a group fall.

4.2.2 Example: Descriptive Statistics Calculation

Sample Data: 4.2. DESCRIPTIVE STATISTICS

Student Scores A 85
B 90
C 78
D 88
E 92
F 70
G 95
H 80
I 85
J 77

Calculations:
import numpy as np
from scipy . stats import mode
# Sample data
scores = np . array ([85 , 90 , 78 , 88 , 92 , 70 , → 95 , 80 , 85 , 77])
# Mean
mean_score = np . mean ( scores )
# Median
median_score = np . median ( scores )
# Mode
mode_score = mode ( scores )
# Variance
variance_score = np . var ( scores )
# Standard Deviation
std_dev_score = np . std ( scores )
# Range
range_score = np . ptp ( scores )

# Percentiles percentile_25 = np . percentile ( scores , 25) percentile_50 = np . percentile (


scores , 50)
→ # Same as median
percentile_75 = np . percentile ( scores , 75)

print ( f " Mean : ␣ { mean_score } , ␣ Median :␣{ → median_score } , ␣ Mode : ␣ {


mode_score . mode → [0]} " )

print ( f " Variance : ␣ { variance_score },␣ → Standard ␣ Deviation : ␣ { std_dev_score }")


print ( f " Range : ␣ { range_score } " )
print ( f " 25 th ␣ Percentile : ␣ { percentile_25 },␣ → 50 th ␣ Percentile : ␣ { percentile_50 },␣
75 → th ␣ Percentile : ␣ { percentile_75 }")

Explanation:
• The np.mean() function calculates the average score.
• The np.median() function identifies the central value of the
dataset.
• The mode() function from scipy.stats determines the most frequent
score.

• Variance and standard deviation measure the spread of scores


around the mean, providing insight into the variability of the data.

The range provides a basic measure of dispersion, •


indicating the span between the smallest and largest values.

Percentiles help to understand the data distri•


bution and identify cutoff points for different portions of the dataset.

4.3 Basics of Grouping


Grouping enables the data to be split into subgroups to enable
analysis to go deeper in the data. Grouping can be utilized with
aggregation functions to calculate values based on every subset that
will help in the identification of patterns as well as trends by groups.

4.3.1 Example 1: Grouping and Aggregation by Class

Sample Data:
Student A
B
C
D
E
F
G
H

I
J

Class Scores Science 85


Science 90
Arts 78
Arts 88
Science 92
Arts 75
Commerce 80
Commerce 85
Commerce 78
Science 87

Calculations: import pandas as pd


# Sample data
data = {

’ Student ’: [ ’A ’ , ’B ’ , ’C ’, ’D ’, ’E ’, ’F → ’ , ’G ’ , ’H ’ , ’I ’ , ’J ’],
’ Class ’: [ ’ Science ’ , ’ Science ’, ’ Arts ’, → ’ Arts ’ , ’ Science ’, ’ Arts ’, ’ → Commerce ’ , ’
Commerce ’, ’ Commerce ’ → , ’ Science ’] ,
’ Scores ’: [85 , 90 , 78 , 88 , 92 , 75 , 80 , → 85 , 78 , 87]
}

df = pd . DataFrame ( data )

# Grouping by ’ Class ’ and calculating mean → scores


grouped = df . groupby ( ’ Class ’) . mean ()

# Grouping by ’ Class ’ and calculating → multiple aggregations


grouped_multi = df . groupby ( ’ Class ’). agg ([ ’ → mean ’ , ’ std ’ , ’ count ’ ])
print ( " Mean ␣ Scores ␣ by ␣ Class :\ n" , grouped ) print ( " \ nDetailed ␣ Group ␣ Statistics
:\ n" , → grouped_multi )
Explanation:
• The groupby() function groups data based on a specified column,
here ’Class’.

• Aggregation functions such as mean(), std() (standard deviation),


and count() can be applied to grouped data to calculate summary
statistics.

• This helps in understanding the performance and variability within


different groups, which can inform targeted interventions or
strategies.

4.3.2 Example 2: Grouping by Multiple Columns

Sample Data:
Student Class Gender Scores A Science Male 85 B Science Female 90
C Arts Female 78 D Arts Male 88 E Science Male 92 F Arts Female 75
G Commerce Male 80 H Commerce Female 85 I Commerce Female
78 J Science Female 87

Calculations:
# Grouping by ’ Class ’ and ’ Gender ’, then → calculating mean scores
grouped_multi_columns = df . groupby ([ ’ Class ’ → , ’ Gender ’ ]) . mean ()

print ( " Mean ␣ Scores ␣ by ␣ Class ␣ and ␣ Gender :\ n" , → grouped_multi_columns )


Explanation:

• Grouping by multiple columns allows for more refined analysis,


such as examining differences within and between subgroups.

This approach can highlight interactions between •


variables (e.g., class and gender), aiding in more detailed
understanding and tailored decisionmaking.

4.4 Analysis of Variance (ANOVA)


ANOVA is a statistical method used to compare means of three or
more groups to determine if at least one group mean is significantly
different from the others. It is commonly used in situations where
more than two groups need comparison, providing a way to test
hypotheses about group differences.

4.4.1 Example 1: One-Way ANOVA

Sample Data:

Student A
B
C
D
E
F
G
H
I
J
Class Scores

Science 85
Science 90
Arts 78
Arts 88
Commerce 92
Commerce 85
Science 87
Arts 75
Commerce 80
Science 91

Calculations: from scipy . stats import f_oneway


# Scores for each group
science_scores = [85 , 90 , 87 , 91] arts_scores = [78 , 88 , 75]
commerce_scores = [92 , 85 , 80]

# Performing ANOVA
f_statistic , p_value = f_oneway ( → science_scores , arts_scores ,
→ commerce_scores )

4.4. ANALYSIS OF VARIANCE (ANOVA)


print ( f "F - statistic : ␣ { f_statistic },␣p - value → : ␣ { p_value } " )
Explanation:

• The f oneway() function performs a one-way ANOVA test,


comparing the means of multiple groups.

• The F-statistic measures the ratio of betweengroup variance to


within-group variance.

• A low p-value (typically less than 0.05) indicates that there are
significant differences between the group means, suggesting that
not all group means are equal.

• ANOVA helps in determining whether observed differences among


group means are statistically significant or due to random variation.

4.4.2 Example 2: Two-Way ANOVA

Two-Way ANOVA is used when there are two independent variables.


It examines the effect of each independent variable on the
dependent variable, as well as any interaction effects between them.

Sample Data:
Student Class A Science B Science C Arts D Arts E Commerce F
Commerce G Science H Arts I Commerce J Science Study Method
Scores

Online 85
Offline 90
Online 78
Offline 88
Online 92
Offline 85
Online 87
Online 75
Offline 80
Offline 91

Note: Two-Way ANOVA requires specific data formatting and


statistical packages beyond basic Python libraries like scipy. Detailed
implementation would typically involve using libraries like
statsmodels.

4.5 Correlation
Correlation measures the strength and direction of a linear
relationship between two variables. The correlation coefficient
ranges from -1 to 1:

• -1: Perfect negative correlation (as one variable increases, the


other decreases).
• 0: No correlation (no linear relationship).
• 1: Perfect positive correlation (as one variable increases, the other
also increases).

4.5.1 Example 1: Correlation Calculation

Sample Data:

Student Hours Studied Scores A 2 50


B 3 60
C 4 70
D 5 80
E 6 90

Calculations:
import numpy as np
# Sample data
hours_studied = np . array ([2 , 3, 4, 5, 6]) scores = np . array ([50 , 60 , 70 , 80 , 90])

# Calculating correlation coefficient correlation = np . corrcoef ( hours_studied , → scores )


[0 , 1]
print ( f " Correlation ␣ Coefficient :␣{
→ correlation } " )
Explanation:

• The np.corrcoef() function calculates the correlation coefficient


matrix. The coefficient between the two variables is extracted from
the matrix.

• A correlation coefficient close to 1 indicates a strong positive


relationship, meaning that as hours studied increases, scores also
increase proportionally.

• Understanding correlation helps in identifying relationships


between variables, which is crucial for making predictions and
decisions in data analysis.

4.5.2 Example 2: Correlation Matrix

A correlation matrix shows the correlation coefficients between


multiple variables in a dataset, allowing for the assessment of
relationships between all pairs of variables.

Sample Data:

Student Hours Studied Scores Attendance (%) A 2 50 80


B 3 60 85
C 4 70 90
D 5 80 95
E 6 90 100

Calculations:
# Sample data
data = np . array ([ [2 , 50 , 80] , [3 , 60 , 85] , [4 , 70 , 90] , [5 , 80 , 95] , [6 , 90 , 100]
])
# Calculating the correlation matrix correlation_matrix = np . corrcoef ( data , → rowvar =
False )
print ( " Correlation ␣ Matrix :\ n" ,
→ correlation_matrix )
Explanation:

A correlation matrix provides a comprehensive •


view of how each variable is related to others, identifying strong and
weak correlations across multiple dimensions.

• This is particularly useful in feature selection and multivariate


analysis, guiding decisions about which variables to include in
predictive models.

High correlation between features may indicate •


multicollinearity, which can affect model performance and
interpretation.

4.6 Polynomial Regression and Pipelines


Polynomial Regression is an extension of linear regression where the
relationship between the independent variable and the dependent
variable is modeled as an nth degree polynomial. Pipelines in Scikit-
learn help automate machine learning workflows by chaining
together multiple steps, such as preprocessing and model fitting.

4.6.1 Example 1: Polynomial Regression with Pipelines

Sample Data:

XY
12
24
36
48
5 11
Calculations:
from sklearn . preprocessing import → PolynomialFeatures
from sklearn . linear_model import → LinearRegression
from sklearn . pipeline import Pipeline
import numpy as np

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([2 , 4 , 6 , 8 , 11])

# Creating a pipeline for polynomial → regression


PIPELINES
pipeline = Pipeline ([
( ’ poly_features ’ , PolynomialFeatures (

→ degree =2) ) ,
( ’ linear_regression ’ , LinearRegression
→ () )
])

# Fitting the model pipeline . fit (X , y )


# Making predictions
predictions = pipeline . predict ( np . array → ([[6]]) )
print ( f " Predicted ␣ value ␣ for ␣ input ␣ 6: ␣{ → predictions } " )
Explanation:

• PolynomialFeatures transforms the input data to include polynomial


terms up to the specified degree (in this case, degree 2).

• The Pipeline object chains the transformation step with a linear


regression model, simplifying the workflow and ensuring that the
transformation is consistently applied.

• Polynomial regression is useful when modeling non-linear


relationships between variables, such as quadratic or cubic trends.

• Pipelines make the process efficient and reduce the risk of errors in
data processing and model training.

4.6.2 Example 2: Using Pipelines for Scaling and Regression


Pipelines can also be used to include data scaling, which ensures
that features are on the same scale before applying regression.

Sample Data:

XY
13
26
39
4 12
5 15

Calculations:
from sklearn . preprocessing import
→ StandardScaler
from sklearn . pipeline import make_pipeline

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([3 , 6 , 9 , 12 , 15])

# Creating a pipeline with scaling and → regression


pipeline = make_pipeline (
StandardScaler () ,
LinearRegression ()
)

# Fitting the model pipeline . fit (X , y )


# Making predictions
predictions = pipeline . predict ( np . array → ([[6]]) )
print ( f " Predicted ␣ value ␣ for ␣ input ␣ 6: ␣{ → predictions } " )
PIPELINES
Explanation:
• StandardScaler standardizes features by removing the mean and
scaling to unit variance.

• Using pipelines ensures that scaling is applied consistently during


training and prediction, improving model performance and reliability.

This approach is particularly important for al•


gorithms sensitive to feature scaling, such as regression models and
neural networks.

4.7 R-squared and MSE for InSample


Evaluation
R-squared and Mean Squared Error (MSE) are metrics used to
evaluate the performance of regression models. R-squared measures
the proportion of the variance in the dependent variable that is
predictable from the independent variable(s), while MSE measures
the average of the squared differences between the predicted and
actual values.

4.7.1 Example 1: Calculating R-squared and MSE

Sample Data:

Actual Predicted
3 2.8
7 7.1
5 5.2
9 9.3
11 10.9

Calculations:
from sklearn . metrics import r2_score , → mean_squared_error

# Sample data
actual = np . array ([3 , 7 , 5 , 9 , 11]) predicted = np . array ([2.8 , 7.1 , 5.2 , 9.3 ,

→ 10.9])
# Calculating R - squared
r2 = r2_score ( actual , predicted )
# Calculating Mean Squared Error
EVALUATION
mse = mean_squared_error ( actual , predicted )
print ( f "R - squared : ␣ { r2 } , ␣ Mean ␣ Squared ␣ Error → : ␣ { mse } " )
Explanation:
• The r2 score() function calculates the R-squared value, providing a
measure of how well the model explains the variance in the target
variable.

The mean squared error() function computes•


the average squared differences between predicted and actual
values, assessing the model’s prediction accuracy.

High R-squared values (closer to 1) and low MSE •


values indicate better model performance, making these metrics
essential for evaluating regression models.

• These metrics help determine if a model is a good fit for the data,
guiding further model tuning or selection.

4.7.2 Example 2: Comparing Models Using R-squared and


MSE

Scenario: Compare the performance of linear and polynomial


regression models using R-squared and MSE.

Sample Data:
XY12

2 4.1
3 6.5
4 8.7
5 11.2 Calculations:
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([2 , 4.1 , 6.5 , 8.7 , 11.2])

# Linear regression model


linear_model = LinearRegression () linear_model . fit (X , y )
linear_pred = linear_model . predict (X)

# Polynomial regression model


poly_pipeline = Pipeline ([
( ’ poly_features ’ , PolynomialFeatures ( → degree =2) ) ,
( ’ linear_regression ’ , LinearRegression → () )
])
poly_pipeline . fit (X , y )
poly_pred = poly_pipeline . predict (X)

# Evaluation metrics
linear_r2 = r2_score (y , linear_pred ) linear_mse = mean_squared_error (y ,

→ linear_pred )
poly_r2 = r2_score (y , poly_pred )
poly_mse = mean_squared_error (y , poly_pred )

print ( f " Linear ␣ Model ␣ ->␣R - squared :␣{ → linear_r2 } , ␣ MSE : ␣ { linear_mse }")
print ( f " Polynomial ␣ Model ␣ ->␣R - squared :␣{ → poly_r2 } , ␣ MSE : ␣ { poly_mse }")

Explanation:

• Comparing R-squared and MSE values between models helps


identify which model better captures the relationship in the data.

• Polynomial regression often provides a better fit for non-linear


relationships, as indicated by EVALUATION

higher R-squared and lower MSE compared to linear regression.

• Model comparison is crucial in selecting the best approach for


prediction, ensuring accurate and reliable results.

4.8 Prediction and Decision Making


Prediction involves using trained models to forecast future outcomes
based on new data. Decision-making processes often involve
comparing model predictions to evaluate potential outcomes or
guide strategic choices.

4.8.1 Example 1: Prediction and Decision Making in Business

Scenario:
A company uses a regression model to predict future sales based on
advertising spend. The goal is to decide on the advertising budget
for the next quarter to maximize sales.

Sample Data:

Advertising Spend ( $) Sales (Units)


1000 150
2000 250
3000 350
4000 450
5000 550

Calculations:
import numpy as np
from sklearn . linear_model import → LinearRegression
# Sample data
X = np . array ([[1000] , [2000] , [3000] , → [4000] , [5000]])
y = np . array ([150 , 250 , 350 , 450 , 550])

# Creating and fitting the model model = LinearRegression () model . fit (X , y )

# Predicting sales for $6000 advertising → spend


predicted_sales = model . predict ( np . array → ([[6000]]) )

print ( f " Predicted ␣ Sales ␣ for ␣ $6000 ␣


→ Advertising ␣ Spend : ␣ { predicted_sales → [0]} " )

Explanation:
• The model uses linear regression to learn the relationship between
advertising spend and sales.

• Predictions can be made for various budget scenarios to guide


decision-making, such as optimizing the advertising budget to
achieve desired sales targets.

By analyzing the impact of different advertising •


levels on sales, the company can make informed decisions about
resource allocation.
This approach can be extended to various busi•
ness decisions, including inventory management, pricing strategies,
and market expansion plans.

4.8.2 Example 2: Decision Making in Healthcare

Scenario:

A hospital uses a logistic regression model to predict the likelihood of


patients being readmitted within 30 days based on demographic and
clinical data. This helps allocate resources for high-risk patients.

Sample Data:
Patient Age Severity Score Readmitted (Yes/No) 1 65 7 Yes 2 50 4
No 3 80 9 Yes 4 45 3 No 5 70 6 Yes

Calculations:
from sklearn . linear_model import
→ LogisticRegression
# Sample data
X = np . array ([[65 , 7] , [50 , 4] , [80 , 9] ,

→ [45 , 3] , [70 , 6]])


y = np . array ([1 , 0 , 1 , 0 , 1]) # 1 = Yes , 0
→ = No

# Creating and fitting the logistic → regression model


model = LogisticRegression ()
model . fit (X , y )

# Predicting readmission risk for a new → patient


new_patient = np . array ([[60 , 8]])
readmission_prob = model . predict_proba ( → new_patient )

print ( f " Probability ␣ of ␣ Readmission ␣ for ␣ the ␣ → new ␣ patient : ␣ { readmission_prob


[0][1]} → " )

Explanation:

• Logistic regression is used to model the probability of a binary


outcome, such as readmission (Yes/No).
The model predicts the probability of readmis•
sion based on input features, aiding in resource allocation for high-
risk patients.

• Predictive analytics in healthcare supports decisionmaking


processes that improve patient outcomes and optimize operational
efficiency.

4.9 Practice Questions and Solutions


4.9.1 Practice Questions on Statistics

Question 1: Descriptive Statistics

Given the scores data below, calculate the mean, median, mode,
variance, standard deviation, range, and percentiles.

• Scores: [70, 75, 80, 85, 90, 95, 100]


Solution:
import numpy as np
from scipy . stats import mode
# Sample data
scores = np . array ([70 , 75 , 80 , 85 , 90 , 95 , → 100])

# Calculating descriptive statistics mean_score = np . mean ( scores )


median_score = np . median ( scores )
mode_score = mode ( scores )
variance_score = np . var ( scores )
std_dev_score = np . std ( scores )
range_score = np . ptp ( scores )
percentile_25 = np . percentile ( scores , 25) percentile_50 = np . percentile ( scores , 50)

→ # Same as median
percentile_75 = np . percentile ( scores , 75)

print ( f " Mean : ␣ { mean_score } , ␣ Median :␣{ → median_score } , ␣ Mode : ␣ {


mode_score . mode → [0]} " )

print ( f " Variance : ␣ { variance_score },␣ → Standard ␣ Deviation : ␣ { std_dev_score }")


print ( f " Range : ␣ { range_score } " )
print ( f " 25 th ␣ Percentile : ␣ { percentile_25 },␣ → 50 th ␣ Percentile : ␣ { percentile_50 },␣
75 → th ␣ Percentile : ␣ { percentile_75 }")

Question 2: Grouping and Aggregation Using the student data


below, group by ’Class’ and calculate the mean and count of scores
for each class.
• Students: [’A’, ’B’, ’C’, ’D’, ’E’, ’F’, ’G’]
• Classes: [’Science’, ’Science’, ’Arts’, ’Arts’, ’Commerce’, ’Commerce’,
’Science’]
• Scores: [85, 90, 78, 88, 92, 85, 87]
Solution:
import pandas as pd
# Sample data
data = {

’ Student ’: [ ’A ’ , ’B ’ , ’C ’, ’D ’, ’E ’, ’F → ’ , ’G ’] ,
’ Class ’: [ ’ Science ’ , ’ Science ’, ’ Arts ’, → ’ Arts ’ , ’ Commerce ’, ’ Commerce ’, → ’
Science ’] ,
’ Scores ’: [85 , 90 , 78 , 88 , 92 , 85 , 87] }

df = pd . DataFrame ( data )

# Grouping by ’ Class ’ and calculating mean → and count


grouped = df . groupby ( ’ Class ’). agg ([ ’ mean ’, → ’ count ’ ])

print ( grouped )
Question 3: ANOVA
Perform a one-way ANOVA on the following groups to determine if
their means are significantly different.

• Group A: [85, 90, 88]


Group B: [78, 85, 80]•
• Group C: [92, 95, 93]

Solution:
from scipy . stats import f_oneway

# Scores for each group


group_a = [85 , 90 , 88]
group_b = [78 , 85 , 80]
group_c = [92 , 95 , 93]
# Performing ANOVA
f_statistic , p_value = f_oneway ( group_a , → group_b , group_c )
print ( f "F - statistic : ␣ { f_statistic },␣p - value → : ␣ { p_value } " )
Question 4: Correlation
Calculate the correlation between the number of study hours and
scores.

• Hours Studied: [1, 2, 3, 4, 5]


• Scores: [50, 60, 70, 80, 90]

Solution:
import numpy as np

# Sample data
hours_studied = np . array ([1 , 2, 3, 4, 5]) scores = np . array ([50 , 60 , 70 , 80 , 90])

# Calculating correlation coefficient correlation = np . corrcoef ( hours_studied , → scores )


[0 , 1]
print ( f " Correlation ␣ Coefficient :␣{
→ correlation } " )
Question 5: Polynomial Regression Use polynomial regression to fit
the data below and predict the output for input X = 6.

• X: [1, 2, 3, 4, 5]
• Y: [2, 4, 6, 8, 11]

Solution:
from sklearn . preprocessing import
→ PolynomialFeatures
from sklearn . linear_model import
→ LinearRegression
from sklearn . pipeline import Pipeline
import numpy as np

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([2 , 4 , 6 , 8 , 11])

# Creating a pipeline for polynomial → regression


pipeline = Pipeline ([
( ’ poly_features ’ , PolynomialFeatures ( → degree =2) ) ,
( ’ linear_regression ’ , LinearRegression → () )
])
# Fitting the model pipeline . fit (X , y )
# Making predictions
predictions = pipeline . predict ( np . array → ([[6]]) )
print ( f " Predicted ␣ value ␣ for ␣ input ␣ 6: ␣{ → predictions } " )
Question 6: R-squared and MSE
Calculate the R-squared and MSE for the given actual and predicted
values.
• Actual: [3, 7, 5, 9, 11]
• Predicted: [2.8, 7.1, 5.2, 9.3, 10.9]
Solution:
from sklearn . metrics import r2_score , → mean_squared_error
import numpy as np

# Sample data
actual = np . array ([3 , 7 , 5 , 9 , 11]) predicted = np . array ([2.8 , 7.1 , 5.2 , 9.3 ,

→ 10.9])
# Calculating R - squared
r2 = r2_score ( actual , predicted )
# Calculating Mean Squared Error
mse = mean_squared_error ( actual , predicted )
print ( f "R - squared : ␣ { r2 } , ␣ Mean ␣ Squared ␣ Error → : ␣ { mse } " )
4.10. CONCLUSION

4.10 Conclusion
This chapter provides a comprehensive overview of essential
statistical concepts, including descriptive statistics, grouping, ANOVA,
correlation, polynomial regression, and in-sample evaluation metrics
like R-squared and MSE. Through detailed examples, tabular data,
and practice questions, students and practitioners can gain a solid
understanding of statistical analysis techniques and their applications
in data-driven decisionmaking.
Chapter 5 Model Refinement
5.1 Introduction to Model Refinement
Model refinement is an essential step in the machine learning
pipeline, focusing on improving model performance through
optimization of parameters, feature transformation, and data
preprocessing techniques. Key methods include Grid Search for
hyperparameter tuning, iterative model refinement through
crossvalidation, binning for data simplification, and indicator
(dummy) variables for categorical encoding. These techniques help
achieve a balance between model complexity and generalizability,
leading to robust predictive models.

5.2 Grid Search


Grid Search is a powerful hyperparameter tuning technique that
systematically explores combinations of model parameters to find
the optimal set that maximizes performance. It evaluates each
parameter combination using cross-validation, ensuring the chosen
parameters generalize well to unseen data.

5.2.1 Example 1: Grid Search for Ridge Regression

Sample Data:

Feature Target
13
25
37
4 10
5 12
6 15
7 18
8 20
9 23 10 25

Grid Search Implementation:


from sklearn . model_selection import → GridSearchCV
from sklearn . linear_model import Ridge
import numpy as np

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , [6] ,

→ [7] , [8] , [9] , [10]])


y = np . array ([3 , 5 , 7 , 10 , 12 , 15 , 18 , 20 ,
→ 23 , 25])

# Defining the model model = Ridge ()

# Defining the hyperparameter grid param_grid = {


’ alpha ’: [0.01 , 0.1 , 1 , 10 , 100] , ’ solver ’: [ ’ auto ’ , ’ svd ’, ’ cholesky ’, ’

→ lsqr ’]
}

# Performing Grid Search with 5- fold cross → validation


grid_search = GridSearchCV ( estimator = model , → param_grid = param_grid , cv =5 ,
scoring → = ’ neg_mean_squared_error ’)
grid_search . fit (X , y )

# Best parameters and score


print ( " Best ␣ Parameters : " , grid_search .

→ best_params_ )
print ( " Best ␣ Score : " , - grid_search .
→ best_score_ )

Explanation:

• The GridSearchCV function performs exhaustive evaluation of


combinations of hyperparameters, using cross-validation to ensure
the reliability of results.

• In this example, various values of alpha (the regularization


strength) and different solvers are tested to find the best
configuration that minimizes the mean squared error.

• Optimal hyperparameter tuning through Grid Search enhances


model performance, ensuring that it generalizes well to new data.

5.2.2 Example 2: Grid Search for Support Vector Machine


(SVM)

Sample Data:
Feature 1 Feature 2 Target 1 2 0 2 3 0 3 4 1 4 5 1 5 6 1 6 7 0 7 8 1
8 9 1 9 10 0

10 11 1
Grid Search Implementation:
from sklearn . model_selection import → GridSearchCV
from sklearn . svm import SVC
import numpy as np

# Sample data

X = np . array ([[1 , 2] , [2 , 3] , [3 , 4] , [4 , → 5] , [5 , 6] , [6 , 7] , [7 , 8] , [8 , 9] , → [9
, 10] , [10 , 11]])

y = np . array ([0 , 0 , 1 , 1 , 1 , 0, 1, 1, 0, → 1])


# Defining the model model = SVC ()

# Defining the hyperparameter grid


param_grid = {
’C ’: [0.1 , 1 , 10 , 100] ,
’ kernel ’: [ ’ linear ’ , ’ poly ’, ’ rbf ’],
’ degree ’: [2 , 3 , 4] , # Only applicable → if ’ poly ’ is selected
’ gamma ’: [ ’ scale ’ , ’ auto ’]
}

# Performing Grid Search with 3- fold cross

→ validation
grid_search = GridSearchCV ( estimator = model , → param_grid = param_grid , cv =3 ,
scoring → = ’ accuracy ’)

grid_search . fit (X , y )
# Best parameters and score
print ( " Best ␣ Parameters : " , grid_search .
→ best_params_ )
print ( " Best ␣ Score : " , grid_search .
→ best_score_ )

Explanation:

Support Vector Machines (SVMs) have multiple •


hyperparameters, such as C (penalty parameter), kernel type, and
gamma, which can significantly affect the model’s performance.

Grid Search tests various combinations of these •


parameters to identify the best set that maximizes accuracy, thus
optimizing the decision boundary of the SVM.

• Proper tuning of SVM parameters is crucial for achieving a well-


fitted model that can handle both linear and non-linear separations.

5.3 Model Refinement


Model refinement involves systematically improving a model by
adjusting its parameters, modifying the features, and enhancing the
data representation. Techniques such as cross-validation,
regularization, and feature engineering play vital roles in this
iterative process.

5.3.1 Example 1: Refining a Linear Regression Model with


CrossValidation

Sample Data:

Years of Experience Salary ($)


1 35000
2 40000
3 45000
4 50000
5 55000
6 60000
7 65000
8 70000
9 75000
10 80000

Cross-Validation Implementation:
from sklearn . linear_model import → LinearRegression
from sklearn . model_selection import → cross_val_score
import numpy as np

# Sample data X = np . array ([[1] , [2] , [3] , [4] , [5] , [6] ,

→ [7] , [8] , [9] , [10]])


y = np . array ([35000 , 40000 , 45000 , 50000 ,
→ 55000 , 60000 , 65000 , 70000 , 75000 ,
→ 80000])

# Defining the model


model = LinearRegression ()

# Performing cross - validation with 10 folds scores = cross_val_score ( model , X , y , cv


→ =10 , scoring = ’ neg_mean_squared_error ’
→)

# Calculating mean and standard deviation → of the scores


mean_score = np . mean ( - scores )
std_dev_score = np . std ( - scores )

print ( f " Mean ␣ MSE : ␣ { mean_score },␣ Standard ␣ → Deviation ␣ of ␣ MSE : ␣ {


std_dev_score }")
Explanation:

• Cross-validation divides the data into multiple folds, allowing the


model to be trained and validated on different subsets, providing a
robust estimate of its performance.

• The negative mean squared error (MSE) scoring metric quantifies


the model’s prediction accuracy, with lower values indicating better
performance.

• Refinement through cross-validation helps detect overfitting,


guiding adjustments to the model’s complexity and feature selection
for improved generalization.

5.3.2 Example 2: Refining with Regularization Techniques

Regularization methods such as Lasso and Ridge regression add


penalties to the model’s loss function, discouraging overly complex
models and reducing the risk of overfitting.

Sample Data:

Feature Target
1 1.5
2 2.3
3 3.6
4 4.8
5 5.9
6 6.5
7 7.8
8 8.1
9 9.3
10 10.5

Lasso Regression with Regularization: from sklearn . linear_model import Lasso


from sklearn . model_selection import
→ cross_val_score
import numpy as np
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , [6] ,

→ [7] , [8] , [9] , [10]])


y = np . array ([1.5 , 2.3 , 3.6 , 4.8 , 5.9 , 6.5 ,
→ 7.8 , 8.1 , 9.3 , 10.5])

# Defining the model with Lasso


→ regularization
model = Lasso ( alpha =0.1)

# Performing cross - validation with 5 folds scores = cross_val_score ( model , X , y , cv =5


, → scoring = ’ neg_mean_squared_error ’)

# Calculating mean and standard deviation → of the scores


mean_score = np . mean ( - scores )
std_dev_score = np . std ( - scores )

print ( f " Mean ␣ MSE ␣ with ␣ Lasso :␣{ mean_score },␣ → Standard ␣ Deviation ␣ of ␣ MSE
:␣{
→ std_dev_score } " )

Explanation:

Lasso regression adds an L1 penalty to the loss •


function, encouraging sparsity by shrinking less important
coefficients to zero, effectively performing feature selection.

Regularization techniques like Lasso and Ridge •


help in balancing model complexity and prediction accuracy, making
the model more robust to variations in the data.

• Refinement through regularization is particularly useful when


dealing with high-dimensional data or when preventing overfitting is
a priority.

5.4 Binning
Binning, or discretization, is a technique used to transform
continuous variables into discrete categories. It simplifies the data,
reduces the impact of noise, and can enhance model performance,
particularly in algorithms that prefer categorical inputs.

5.4.1 Example 1: Binning Continuous Data into Equal Width


Bins

Sample Data:

Age Income ($)


22 28000
25 32000
30 40000
35 48000
40 60000
45 72000
50 85000
55 90000
60 100000
65 110000

Equal Width Binning Implementation: import pandas as pd


# Sample data
ages = pd . Series ([22 , 25 , 30 , 35 , 40 , 45 , → 50 , 55 , 60 , 65])

# Binning ages into equal width intervals age_bins = pd . cut ( ages , bins =5 , labels =[ ’
→ 20 -30 ’ , ’ 30 -40 ’ , ’ 40 -50 ’, ’ 50 -60 ’, ’
→ 60 -70 ’ ])

5.4. BINNING
# Creating DataFrame
df = pd . DataFrame ({ ’ Age ’: ages , ’ Age ␣ Group ’ → : age_bins })
print ( df )
Explanation:

Equal width binning divides the range of the •


variable into equally spaced intervals, regardless of the data
distribution, simplifying analysis by grouping similar values.

• Binning is useful for reducing the impact of outliers and noise, and
for transforming continuous variables into categories that are easier
to interpret and model.

This approach is especially beneficial in decision •


trees, where binning can help in creating more meaningful splits
based on categorical values.

5.4.2 Example 2: Binning into Equal Frequency Bins

Equal frequency binning on the other hand means that each bin will
have very close numbers of observations and this is helpful in
dealing with data distributions that are not equal.

Equal Frequency Binning Implementation:


# Binning ages into equal frequency → intervals
age_bins_equal_freq = pd . qcut ( ages , q =4 , → labels =[ ’ Q1 ’ , ’ Q2 ’ , ’Q3 ’, ’Q4 ’])

# Adding binned data to DataFrame


df [ ’ Age ␣ Group ␣ ( Equal ␣ Frequency ) ’] = → age_bins_equal_freq
print ( df )
Explanation:

• Equal frequency binning (pd.qcut()) adjusts the bin intervals so


that each contains a similar number of data points, balancing the
representation of each group.

• This method is particularly effective for normalizing data


distributions, making the categories more comparable, and reducing
the skewness of variables.

• Binning enhances the interpretability of the data and can improve


the performance of models that are sensitive to the distribution of
input variables.

VARIABLES)

5.5 Indicator Variables (Dummy Variables)


The indicator variables are also known as the dummy variables since
they are used in converting the categorical data to an easily
manageable numerical form. This makes it possible to incorporate
categorical data into models that expect numerical inputs like
regression models and a majority of machine learning algorithms.

5.5.1 Example 1: Creating Indicator Variables for


Categorical Data

Sample Data:

Customer
1
2
3
4
5
6
7
8
9
10
Product Category Electronics
Clothing
Home
Electronics
Home
Clothing
Electronics
Home
Clothing
Electronics

Dummy Variables Implementation: import pandas as pd


# Sample data
data = {
’ Customer ’: [1 , 2 , 3 , 4, 5, 6, 7, 8, 9, → 10] ,

’ Product ␣ Category ’: [ ’ Electronics ’, ’ → Clothing ’ , ’ Home ’ , ’ Electronics ’, → ’ Home ’ ,


’ Clothing ’ , ’ Electronics → ’ , ’ Home ’ , ’ Clothing ’, ’
→ Electronics ’]

}
df = pd . DataFrame ( data )

# Creating dummy variables for ’ Product → Category ’


df_dummies = pd . get_dummies ( df , columns =[ ’ → Product ␣ Category ’] , drop_first =
True )

print ( df_dummies )
Explanation:
• It is important to explain that the pd. get dummies()
function prepares data in such a way by creating binary variables for
each category in the specified column so that machine learning
algorithms can accept it.

• This way we exclude the first variable to prevent multicollinearity


issue; drop first=True leads to dummy variables being linear
independent.

• Indicator variables are very useful in regression and especially


classification where the model needs to interpret and exploit
categorical information and does not imply any between them an
order or distance.

5.5.2 Example 2: Creating Indicator Variables for Multiple


Categorical Features

Sample Data:
VARIABLES)

Customer Region Product Category


1 North Electronics
2 South Clothing
3 East Home
4 West Electronics
5 North Home
6 South Clothing
7 East Electronics
8 West Home
9 North Clothing 10 South Electronics

Dummy Variables for Multiple Features:


# Creating dummy → ’ Product
df_dummies_multiple = pd . get_dummies (df , → columns =[ ’ Region ’ , ’ Product ␣
Category ’ → ] , drop_first = True )
variables for ’ Region ’ and Category ’
print ( df_dummies_multiple )
Explanation:

Dummy variables can be created for several cat•


egorical variables at once, creating a better set of attributes with
more detailed information.

This approach engages categorical data to the •


optimum level since the model can differentiate between different
categories, enhancing the model’s accuracy and interpretability.

• The use of dummy variables for categorical data is crucial in model


development for correct discretization and for robustness, specifically
in regression and classification analysis.

5.6 Practice Questions and Solutions


5.6.1 Practice Questions on Model Refinement

Question 1: Grid Search

Perform a grid search to find the best hyperparameters for a Ridge


regression model using the following data:

• Features: [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
• Target: [3, 5, 7, 10, 12, 15, 18, 20, 23, 25]
• Hyperparameter Grid: alpha = [0.1, 1, 10], solver = [’auto’, ’svd’]
Solution:
from sklearn . model_selection import
→ GridSearchCV
from sklearn . linear_model import Ridge
import numpy as np

# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , [6] ,

→ [7] , [8] , [9] , [10]])


y = np . array ([3 , 5 , 7 , 10 , 12 , 15 , 18 , 20 ,
→ 23 , 25])
# Defining the model
model = Ridge ()
# Defining the hyperparameter grid
param_grid = {
’ alpha ’: [0.1 , 1 , 10] ,
’ solver ’: [ ’ auto ’ , ’ svd ’] }

# Performing Grid Search


grid_search = GridSearchCV ( estimator = model , → param_grid = param_grid , cv =3 ,
scoring
→ = ’ r2 ’)
grid_search . fit (X , y )

# Best parameters and score


print ( " Best ␣ Parameters : " , grid_search .

→ best_params_ )
print ( " Best ␣ Score : " , grid_search .
→ best_score_ )

Question 2: Model Refinement


Use cross-validation to refine a Linear Regression model on the
following data and report the mean and standard deviation of the
MSE:

• Features: [[5], [10], [15], [20], [25], [30], [35], [40], [45], [50]]
• Target: [8, 16, 24, 32, 40, 48, 56, 64, 72, 80]
Solution:
from sklearn . linear_model import → LinearRegression
from sklearn . model_selection import → cross_val_score
import numpy as np

# Sample data
X = np . array ([[5] , [10] , [15] , [20] , [25] ,

→ [30] , [35] , [40] , [45] , [50]]) y = np . array ([8 , 16 , 24 , 32 , 40 , 48 , 56 ,


→ 64 , 72 , 80])

# Defining the model model = LinearRegression ()

# Performing cross - validation


scores = cross_val_score ( model , X , y , cv =5 , → scoring = ’ neg_mean_squared_error ’)

# Calculating mean and standard deviation → of the scores


mean_score = np . mean ( - scores )
std_dev_score = np . std ( - scores )

print ( f " Mean ␣ MSE : ␣ { mean_score },␣ Standard ␣ → Deviation ␣ of ␣ MSE : ␣ {


std_dev_score }")

Question 3: Binning
Bin the following age data into equal width and equal frequency
categories and add these as new columns in the DataFrame:

• Ages: [23, 27, 31, 36, 42, 47, 53, 59, 65, 71]
Solution:
import pandas as pd
# Sample data
ages = pd . Series ([23 , 27 , 31 , 36 , 42 , 47 , → 53 , 59 , 65 , 71])
# Equal width binning
age_bins = pd . cut ( ages , bins =4 , labels =[ ’ → 20 -35 ’ , ’ 35 -50 ’ , ’ 50 -65 ’, ’ 65 -80
’])
# Equal frequency binning
age_bins_equal_freq = pd . qcut ( ages , q =4 , → labels =[ ’ Q1 ’ , ’ Q2 ’ , ’ Q3 ’, ’Q4 ’])

# Creating DataFrame
df = pd . DataFrame ({ ’ Age ’: ages , ’ Age ␣ Group ␣ → ( Equal ␣ Width ) ’: age_bins , ’ Age
␣ Group ␣
→ ( Equal ␣ Frequency ) ’:
→ age_bins_equal_freq })
print ( df )

Question 4: Indicator Variables


Create indicator variables for the following categorical data on
customer preferences:
• Regions: [’North’, ’South’, ’East’, ’West’, ’North’, ’East’, ’South’,
’West’]
• Preferences: [’High’, ’Low’, ’Medium’, ’High’, ’Low’, ’Medium’, ’High’,
’Low’]
Solution:
import pandas as pd
# Sample data
data = {

’ Region ’: [ ’ North ’ , ’ South ’, ’ East ’, ’ → West ’ , ’ North ’ , ’ East ’, ’ South ’, → ’ West ’] ,

’ Preference ’: [ ’ High ’ , ’ Low ’, ’ Medium ’, → ’ High ’ , ’ Low ’ , ’ Medium ’, ’ High ’, → ’


Low ’]
}
df = pd . DataFrame ( data )

# Creating dummy variables for ’ Region ’ and → ’ Preference ’


df_dummies = pd . get_dummies (df , columns =[ ’ → Region ’ , ’ Preference ’], drop_first
= → True )

print ( df_dummies )

5.7 Conclusion
This comprehensive chapter covers essential model refinement
techniques, including Grid Search for hyperparameter tuning,
iterative model refinement through cross-validation and
regularization, binning continuous variables into categories, and
creating indicator variables for categorical data. By applying these
techniques, students and practitioners can enhance their models’
accuracy, interpretability, and robustness, ultimately leading to more
reliable and effective predictive analytics.

You might also like