Practical Data Science
Practical Data Science
Introduction
This book provides a comprehensive guide to data analysis and
machine learning using Python, designed to equip readers with the
necessary tools and knowledge to excel in the field of data science.
It covers a wide range of topics from data manipulation using
libraries like Pandas and NumPy to advanced scientific computing
with SciPy, making it suitable for both beginners and experienced
practitioners.
Features
• Step-by-step tutorials with practical examples
• Hands-on practice questions with solutions
• Integration of Python libraries for comprehensive data analysis
Target Audience
This book is ideal for students, data enthusiasts, and professionals
looking to acquire new skills and face the challenges of the modern
world of data science.
Preface
With data gradually becoming the new oil that is revolutionizing
industries and products, how to make sense of the ever-growing
volumes of data comes as a critical necessity. Practical Data Science:
An Introductory to Data Analysis and Machine Learning with Python:
A Complete Guide has been written for the purpose of putting at the
disposal of the reader sufficient data processing and machine
learning knowledge for anybody interested in data science field.
It is our desire that this book be than just a reference manual to the
field of data science: rather, readers should be able to view it as a
guide to their path through this territory, one that allows them to
navigate through it with both speed and purpose. As an introductory
text, this book is intended for students, data lovers or professionals
who want to acquire new skills and face the challenges of the
modern world of data science.
ii
Acknowledgments
Finnishing this book has been a great inspiring event and this could
have been made possible due to the support of several people.
Lastly, I would like to thank the readers who have selected this book
as their companion in their journey to becoming data scientist. Your
passion and interest are the top motivators behind this work, and I
still hope that the present book offers you enough information to be
useful on your way to becoming a true data science expert.
iv
Contents
1 Introduction to Data Analysis with Python 7 1.1 Introduction to
Data Analysis . . . . . 8 1.2 Pandas: Data Manipulation and Analysis 8
2 Pandas Library 49
2.1 Introduction to Pandas Library . . . . 49
2.1.1 Why Use Pandas? . . . . . . . . 49
2.1.2 Installation . . . . . . . . . . . 50
2.2 Creating DataFrames and Series . . . . 51
2.2.1 Creating DataFrames . . . . . . 51
2.2.2 Creating Series . . . . . . . . . 54
2.3 Basic DataFrame Operations . . . . . . 56
2.3.1 Viewing and Inspecting DataFrames 56 2.3.2 Sorting and
Filtering Data . . . 59
2.4 Handling Missing Data . . . . . . . . . 63
2.4.1 Detecting Missing Data . . . . . 63
2.4.2 Handling Missing Data . . . . . 64
2.5 Data Aggregation and Grouping . . . . 67
2.5.1 Grouping and Aggregation . . . 67
2.6 Merging and Joining DataFrames . . . 70
2.6.1 Merging DataFrames . . . . . . 70
2.7 Working with Dates and Times . . . . 74
2.7.1 Handling DateTime Data . . . 74
2.8 Case Studies and Practical Examples . 78
2.8.1 Case Study 1: Analyzing Sales Data . . . . . . . . . . . . . . . 78
2.8.2 Case Study 2: Customer Segmentation . . . . . . . . . . . . 80
2.8.3 Case Study 3: Time Series Forecasting . . . . . . . . . . . . . . 81
2.9 Practice Questions and Solutions . . . 84 2.9.1 Practice Questions
on Pandas . 84 2.10 Conclusion . . . . . . . . . . . . . . . . 87
3 Scikit-Learn Library 89
3.1 Introduction to Scikit-Learn . . . . . . 89
3.1.1 Why Use Scikit-Learn? . . . . . 89
3.1.2 Installation . . . . . . . . . . . 90
3.2 Supervised Learning . . . . . . . . . . 91
3.2.1 Linear Regression . . . . . . . . 91
3.2.2 Logistic Regression . . . . . . . 94
3.2.3 Decision Trees . . . . . . . . . . 97
3.3 Unsupervised Learning . . . . . . . . . 100
3.3.1 K-Means Clustering . . . . . . . 100
3.3.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . 102
3.4 Model Evaluation and Parameter Tuning106
3.4.1 Cross-Validation . . . . . . . . 106
3.4.2 Hyperparameter Tuning . . . . 108
3.5 Practice Questions and Solutions . . . 111
3.5.1 Practice Questions on Scikit-Learn111
3.6 Conclusion . . . . . . . . . . . . . . . . 116
4 Statistics 117
4.1 Introduction to Statistics . . . . . . . . 117
4.2 Descriptive Statistics . . . . . . . . . . 117
4.2.1 Key Measures . . . . . . . . . . 118
4.2.2 Example: Descriptive Statistics Calculation . . . . . . . . . . . 118
4.3 Basics of Grouping . . . . . . . . . . . 121
4.3.1 Example 1: Grouping and Aggregation by Class . . . . . . . 121
4.3.2 Example 2: Grouping by Multiple Columns . . . . . . . . . . . 122
4.4 Analysis of Variance (ANOVA) . . . . 124 4.4.1 Example 1: One-
Way ANOVA . 124 4.4.2 Example 2: Two-Way ANOVA 125 4.5
Correlation . . . . . . . . . . . . . . . 127 4.5.1 Example 1: Correlation
Calculation . . . . . . . . . . . . . . 127 4.5.2 Example 2: Correlation
Matrix 128 4.6 Polynomial Regression and Pipelines . 130 4.6.1
Example 1: Polynomial Regression with Pipelines . . . . . . . 130 4.6.2
Example 2: Using Pipelines for Scaling and Regression . . . . . 132
4.7 R-squared and MSE for In-Sample Evaluation . . . . . . . . . . . . . .
. . . . 134 4.7.1 Example 1: Calculating R-squared and MSE . . . . . . .
. . . . . 134 4.7.2 Example 2: Comparing Models Using R-squared and
MSE . . . 135 4.8 Prediction and Decision Making . . . . 138 4.8.1
Example 1: Prediction and Decision Making in Business . . . 138 4.8.2
Example 2: Decision Making in Healthcare . . . . . . . . . . . . 139 4.9
Practice Questions and Solutions . . . 142 4.9.1 Practice Questions on
Statistics 142 4.10 Conclusion . . . . . . . . . . . . . . . . 147
ANALYSIS
• Series:An object with linear structure storing an array or data and
an array of labels, the indexes.
Explanation: The head() and tail() methods are used to quickly view
the top and bottom rows of the DataFrame. The describe() method
provides summary statistics of numerical columns, and info()
displays the structure of the DataFrame.
ANALYSIS
# Filtering rows where Salary is between → 55000 and 65000
salary_range = df [( df [ ’ Salary ’] > 55000) & → ( df [ ’ Salary ’] < 65000) ]
print ( salary_range )
# Filtering using the ’ isin ’ method location_filter = df [ df [ ’ Location ’]. isin ([ ’ → New ␣
York ’ , ’ Chicago ’ ]) ]
print ( location_filter )
# Using ’ not ’ conditions
not_new_york = df [~ df [ ’ Location ’]. str . → contains ( ’ New ␣ York ’) ]
print ( not_new_york )
ANALYSIS WITH PYTHON
print ( df )
Example 3: Dropping Missing Data # Dropping rows with any missing values
df_cleaned = df . dropna ()
print ( df_cleaned )
print ( df_cleaned_specific )
# Aggregating data
average_age = grouped [ ’ Age ’]. mean ()
total_salary = grouped [ ’ Salary ’]. sum ()
print ( average_age )
print ( total_salary )
print ( pivot )
df2 = pd . DataFrame ({
’ Name ’: [ ’ Alice ’ , ’ Bob ’] , ’ Salary ’: [50000 , 60000]
})
# Merging on the ’ Name ’ column
merged_df = pd . merge ( df1 , df2 , on = ’ Name ’)
print ( merged_df )
ANALYSIS
print ( left_join )
print ( right_join )
print ( outer_join )
print ( inner_join )
df2 = pd . DataFrame ({
’ Name ’: [ ’ Charlie ’ , ’ David ’], ’ Age ’: [35 , 40]
})
# Concatenating along rows ( default ) concat_df = pd . concat ([ df1 , df2 ])
# Concatenating along columns
concat_df_columns = pd . concat ([ df1 , df2 ], → axis =1)
print ( concat_df )
print ( concat_df_columns )
’ Date ’: [ ’ 2023 -01 -01 ’ , ’ 2023 -01 -02 ’, ’ → 2023 -01 -03 ’] ,
’ Value ’: [100 , 150 , 200]
}
df = pd . DataFrame ( data )
print ( monthly_avg )
print ( arr1 )
print ( arr2 )
print ( arr3 )
Explanation: Arrays can also be created from python lists by the use
of np. array(). The arange() function creates an array out of values
which are within a certain range and have a certain increase factor.
print ( zeros )
print ( ones )
print ( identity )
→ each element
arr_times_2 = arr * 2
→ each element by 2 arr_squared = arr ** 2
→ each element
operations
# Adds 5 to
# Multiplies
# Squares
# Matrix addition
matrix_sum = matrix1 + matrix2
# Matrix multiplication ( element - wise ) matrix_mult = matrix1 * matrix2
# Dot product ( matrix multiplication ) dot_product = np . dot ( matrix1 , matrix2 )
print ( matrix_sum )
print ( matrix_mult )
print ( dot_product )
Explanation: NumPy supports both element-wise matrix operations
and matrix multiplication using the dot() function.
print ( sin_values )
print ( cos_values )
print ( exp_values )
print ( log_values )
→ to 3
slice2 = arr [:3] # slice3 = arr [3:] #
→ onwards
Elements from index 1
First three elements Elements from index 3
print ( reshaped_arr )
print ( flattened_arr )
# Boolean indexing
bool_index = matrix [ matrix > 50] # → Elements greater than 50
print ( elements )
print ( bool_index )
1.3.5 Broadcasting
print ( result )
print ( broadcast_result )
→ functions
sin_values = np . sin ( angles )
cos_values = np . cos ( angles )
tan_values = np . tan ( angles )
print ( sin_values )
print ( cos_values )
print ( tan_values )
print ( exp_values )
print ( log_values )
print ( log10_values )
print ( mean_value )
print ( median_value )
print ( std_value )
print ( uniform_random )
print ( normal_random )
1.4.2 Optimization
# Define a constraint : x + y = 1
constraints = ({ ’ type ’: ’ eq ’, ’ fun ’: lambda → variables : variables [0] + variables
→ [1] - 1})
1.4.3 Integration
equation with
print ( ’ Solution : ’ , solution .y)
1.4.4 Interpolation
print ( ’U : ’ , U )
print ( ’S : ’ , S )
print ( ’ Vh : ’ , Vh )
# Sample data
x = np . array ([1 , 2 , 3 , 4 , 5]) y = np . array ([2 , 4 , 6 , 8 , 10])
COMPUTING
# Sample data
x_data = np . array ([1 , 2 , 3 , 4, 5])
y_data = np . array ([2.2 , 4.1 , 6.0 , 8.3 ,
→ 10.1])
# Fitting the curve
params , covariance = curve_fit ( linear_func , → x_data , y_data )
print ( ’ Fitted ␣ parameters : ’ , params )
→ 7 , 8])
time_series2 = np . array ([3 , 4 , 5, 6, 7, 8,
→ 9 , 10])
Solution:
import pandas as pd
data = {
’ Name ’: [ ’ John ’ , ’ Jane ’, ’ Jim ’, ’ Jack ’
→],
’ Age ’: [28 , 34 , 29 , 40] ,
’ City ’: [ ’ Boston ’ , ’ Austin ’, ’ Denver ’,
→ ’ Seattle ’]
}
df = pd . DataFrame ( data )
print ( df . head (3) )
Question 2: Data Selection and Filtering
Solution:
data = {
’ Name ’: [ ’ John ’ , ’ Jane ’ , None , ’ Jack ’], ’ Age ’: [28 , None , 29 , 40] ,
’ City ’: [ ’ Boston ’ , None , ’ Denver ’, ’
→ Seattle ’]
}
df = pd . DataFrame ( data )
df [ ’ Age ’ ]. fillna ( df [ ’ Age ’ ]. mean () , inplace = → True )
print ( df )
Question 3: Broadcasting
Create a 3x3 matrix with values from 1 to 9 and add a vector [10,
20, 30] to each row using broadcasting.
Solution:
matrix = np . array ([[1 , 2 , 3] , [4 , 5, 6] , → [7 , 8 , 9]])
vector = np . array ([10 , 20 , 30])
broadcast_result = matrix + vector
print ( broadcast_result )
Question 1: Optimization
Use SciPy to find the minimum of the function
f(x) = (x − 2)2.
Solution:
from scipy . optimize import minimize
def func ( x ) :
return ( x - 2) ** 2
result = minimize ( func , x0 =0)
print ( ’ Minimum ␣ at : ’ , result .x)
Question 2: Numerical Integration Integrate the function f(x) = x3
from 0 to 1 using SciPy.
Solution:
from scipy . integrate import quad
def integrand ( x ) :
return x ** 3
result , _ = quad ( integrand , 0, 1)
print ( ’ Integral ␣ result : ’ , result )
ANALYSIS WITH PYTHON
Question 3: Linear Algebra
1.6 Conclusion
This chapter provides an introduction to data analysis using Python’s
powerful libraries: Specifically, Pandas, NumPy along with the SciPy
tool. These libraries can be used by the students and practitioners to
manage the data; numerical computation, data analysis and
scientific computing for solving practical problems. ANALYSIS WITH
PYTHON
Chapter 2 Pandas Library
2.1 Introduction to Pandas Library
Pandas is an open-source software, which is a library of tools for
analysing data in Python as a programming language, widely used in
data science and AI. It delivers basic and highly effective data
structures such as DataFrames and Series, features of which are
designed to analyze structured data such as tables, time series and
extensive data. Pandas is compatible with other Python libraries and
it is used alongside these libraries; NumPy, Matplotlib, and Scikit-
learn among others is why it is widely adopted by data professionals.
Pandas simplifies data handling and analysis through its intuitive API
and powerful functionalities:
2.1.2 Installation
df = pd . DataFrame ( data )
print ( df )
Explanation:
• It should be noted that all the contents within the list are
dictionaries and each of these dictionaries is a row in the DataFrame
and
→ rows
Explanation:
• There is the sheet name parameter that allows you to specify the
sheet to extract data from in case the source Excel file has several
sheets.
Explanation:
Explanation:
• When you create a Series from a dictionary then keys hold the
index position and values filled the data.
• This approach is useful for creating Series with meaningful labels
like city name from database or any other categorical identifier.
Explanation:
• The index parameter allows you to specify custom labels for each
element in the Series, providing greater flexibility in data selection
and alignment.
Explanation:
of numerical
Explanation:
• To delete a column or a row, you can use the drop() method and
while axis=1 is for column any value other than 0 is for a row.
print ( df_sorted )
print ( df_sorted_multi )
Explanation:
• The sort values() method is used to sort the DataFrame in one or
several columns.
• Sorting can be done either ascending or descending and this is
regulated by the ascending parameter whereby it can take a list for
sorting in a number of columns.
print ( filtered_df )
print ( filtered_by_index ) Explanation:
→ Senior ’ , ’ Junior ’)
print ( df ) Explanation:
Explanation:
print ( df )
Explanation:
• Replace missing value using the fillna() with either mean, median
value or mode of the given array.
Example 2: Dropping Missing Data # Dropping rows with any missing values
df_cleaned = df . dropna ()
print ( df_cleaned )
print ( df_cleaned_specific ) print ( df_cleaned_columns )
Explanation:
• One must have seen that the dataset can be simplified sometimes
by removing the certain number of columns in which there is a
percentage of missing data values that is high.
print ( grouped )
print ( grouped_count )
Explanation:
print ( grouped )
Explanation:
• The agg() function permits the use of more than one aggregation
function to be applied on grouped data giving a more enhanced
output.
• Aggregations can be set on per column basis so that the way how
the summary has to be calculated can be defined.
print ( pivot )
print ( pivot_with_totals )
Explanation:
print ( cross_tab )
Explanation:
df2 = pd . DataFrame ({
’ Name ’: [ ’ Alice ’ , ’ Bob ’] , ’ Salary ’: [50000 , 60000]
})
# Merging on the ’ Name ’ column
merged_df = pd . merge ( df1 , df2 , on = ’ Name ’)
print ( merged_df )
Explanation:
• The merge() function is used to join two or more DataFrames by
some column or an index.
Explanation:
• Different types of joins control how DataFrames are combined:
– Left Join: Includes all rows from the left DataFrame and matched
rows from the right.
– Right Join: Includes all rows from the right DataFrame and
matched rows from the left.
df2 = pd . DataFrame ({
’ Name ’: [ ’ Charlie ’ , ’ David ’], ’ Age ’: [35 , 40]
})
# Concatenating along rows ( default ) concat_df = pd . concat ([ df1 , df2 ])
# Concatenating along columns
concat_df_columns = pd . concat ([ df1 , df2 ], → axis =1)
print ( concat_df )
print ( concat_df_columns ) Explanation:
• This is useful for adding more rows of data or for joining two
DataFrames with different set of columns for bringing data together
at scale. Concatenation is typical when data of similar format are
combined, but for different objects or at different time points.
df2 = pd . DataFrame ({
’ Salary ’: [50000 , 60000]
} , index =[ ’ Alice ’ , ’ Bob ’ ])
’ Date ’: [ ’ 2023 -01 -01 ’ , ’ 2023 -01 -02 ’, ’ → 2023 -01 -03 ’] ,
’ Value ’: [100 , 150 , 200]
}
df = pd . DataFrame ( data )
# Extracting components
df [ ’ Year ’] = df [ ’ Date ’ ]. dt . year df [ ’ Month ’] = df [ ’ Date ’ ]. dt . month df [ ’ Day ’]
= df [ ’ Date ’ ]. dt . day
print ( df ) Explanation:
• To that end, one can set the index to be a DateTime column which
enables a number of timebased operations like resampling.
• When working with timed data, it’s possible to learn a lot about
time-based measures, including processing time, response rate, and
temporal distribution.
print ( df )
Explanation:
Steps:
1. Load the sales data from a CSV file.
Solution:
import pandas as pd
import matplotlib . pyplot as plt
# Load the sales data
sales_data = pd . read_csv ( ’ sales_data . csv ’)
# Convert the ’ Date ’ column to DateTime and → set as index
EXAMPLES
sales_data [ ’ Date ’] = pd . to_datetime ( → sales_data [ ’ Date ’ ])
sales_data . set_index ( ’ Date ’, inplace = True )
# Identify the top 5 items by sales amount top_items = sales_data . groupby ( ’ Item ’)[ ’
→ Amount ’ ]. sum () . sort_values ( ascending =
→ False ) . head (5)
Steps:
1. Load the customer data from a CSV file.
2. Aggregate the data by customer ID to calculate total spend and
purchase frequency.
3. Create customer segments based on spend and frequency.
4. Analyze the characteristics of each segment. Solution:
import pandas as pd
# Load the customer data
customer_data = pd . read_csv ( ’ customer_data . → csv ’)
→ ( ’ Segment ’) . agg ({
’ TotalSpend ’: [ ’ mean ’ , ’ count ’], ’ Frequency ’: ’ mean ’
})
print ( customer_summary )
print ( segment_analysis )
Explanation:
Steps:
1. Load the stock price data from a CSV file. 2. Convert the date
column to DateTime format and set it as the index.
3. Use a simple moving average to forecast future prices.
4. Evaluate the forecast accuracy using historical data.
Solution:
import pandas as pd
# Load the stock price data
stock_data = pd . read_csv ( ’ stock_data . csv ’)
# Convert the ’ Date ’ column to DateTime and → set as index
stock_data [ ’ Date ’] = pd . to_datetime ( → stock_data [ ’ Date ’ ])
stock_data . set_index ( ’ Date ’ , inplace = True )
→],
’ Age ’: [28 , 34 , 29 , 40] ,
’ City ’: [ ’ Boston ’ , ’ Austin ’, ’ Denver ’,
→ ’ Seattle ’]
}
df = pd . DataFrame ( data )
print ( df . head (3) )
Question 2: Data Selection and Filtering
Solution:
data = {
’ Name ’: [ ’ John ’ , ’ Jane ’, None , ’ Jack ’], ’ Age ’: [28 , None , 29 , 40] ,
’ City ’: [ ’ Boston ’ , None , ’ Denver ’, ’
→ Seattle ’]
}
df = pd . DataFrame ( data )
df [ ’ Age ’ ]. fillna ( df [ ’ Age ’ ]. mean () , inplace = → True )
print ( df )
Question 4: Aggregation and Grouping
Using the DataFrame created in Question 1, group by ’City’ and
calculate the mean age.
Solution:
grouped = df . groupby ( ’ City ’)[ ’ Age ’]. mean () print ( grouped )
2.10. CONCLUSION
2.10 Conclusion
This chapter provides an in-depth introduction to the Pandas library
for data analysis in Python. Through detailed explanations,
examples, and practical applications, students and practitioners can
learn to effectively handle, manipulate, and analyze data, perform
numerical computations, and apply advanced data transformation
techniques using Pandas. The practice questions reinforce the
concepts covered, ensuring a comprehensive understanding of
Pandas’ capabilities.
Chapter 3 Scikit-Learn Library
3.1 Introduction to Scikit-Learn
Scikit-learn is a versatile open-source Python library that provides
simple and efficient tools for data mining and machine learning. Built
on NumPy, SciPy, and Matplotlib, Scikit-learn integrates seamlessly
with other Python libraries, making it an essential tool for developing
machine learning models. It offers a wide range of supervised and
unsupervised learning algorithms, model evaluation metrics,
preprocessing tools, and hyperparameter tuning techniques.
3.1.2 Installation
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) #
→ Input features
y = np . array ([2 , 4 , 5 , 4 , 5]) #
→ Target variable
# Making predictions
predictions = model . predict ( np . array ([[6]]) → )
print ( " Predicted ␣ value ␣ for ␣ input ␣ 6: " , → predictions )
Explanation:
• Linear Regression fits a line to the data that minimizes the sum of
squared residuals between the predicted and actual values.
• These metrics help evaluate how well the linear regression model
captures the underlying relationship between the features and the
target variable.
# Making predictions
predictions = model . predict ( np . array → ([[3.5]]) )
print ( " Predicted ␣ class ␣ for ␣ input ␣ 3.5: " , → predictions )
Explanation:
Logistic Regression outputs probabilities for each •
class, and the class with the highest probability is chosen as the
prediction.
→ predictions )
class_report = classification_report (y ,
→ predictions )
→ class_report )
Explanation:
# Making predictions
predictions = model . predict ( np . array ([[3]]) → )
print ( " Predicted ␣ class ␣ for ␣ input ␣ 3: " , → predictions )
Explanation:
# Making predictions
predictions = model . predict ( np . array → ([[3.5]]) )
print ( " Predicted ␣ class ␣ for ␣ input ␣ 3.5: " , → predictions )
Explanation:
• Decision Trees split the data based on feature values that best
separate the target classes, continuing recursively to create a tree
structure.
• Each node in the tree represents a decision point, and each leaf
node represents a class label.
Explanation:
→ cmap = ’ viridis ’)
plt . scatter ( model . cluster_centers_ [: , 0] ,
→ model . cluster_centers_ [: , 1] , s =200 ,
→ c = ’ red ’ , marker = ’X ’)
plt . title ( ’K - Means ␣ Clustering ’)
plt . xlabel ( ’ Feature ␣ 1 ’)
plt . ylabel ( ’ Feature ␣ 2 ’)
plt . grid ( True )
plt . show ()
Explanation:
Explanation:
print ( " Silhouette ␣ Score : " , sil_score )
• Silhouette Score: Measures how similar each
→ 1])
plt . title ( ’ PCA ␣ Dimensionality ␣ Reduction ’) plt . xlabel ( ’ Principal ␣ Component ␣1 ’) plt
. ylabel ( ’ Principal ␣ Component ␣2 ’) plt . grid ( True )
plt . show ()
Explanation:
• Visualizing the reduced data in two dimensions helps in
understanding how PCA has transformed the original dataset and
highlights patterns or clusters in the data.
Example 4: Using PCA for Feature Selection # Sample data with redundant
features X = np . array ([[1 , 2 , 1] , [2 , 3, 2] , [3 , 4, → 3] , [4 , 5 , 4]])
3.4.1 Cross-Validation
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([2 , 4 , 5 , 4 , 5])
# Creating the Linear Regression model model = LinearRegression ()
print ( " Cross - Validation ␣ Scores :" , scores ) print ( " Mean ␣ Score : " , scores . mean ()
)
TUNING
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 0 , 0 , 1 , 1, 1])
# Creating the Logistic Regression model model = LogisticRegression ()
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 0 , 0 , 1 , 1 , 1])
# Creating the Random Forest model
model = RandomForestClassifier ( random_state → =0)
TUNING
}
# Performing Grid Search
grid_search = GridSearchCV ( model , → param_grid , cv =3)
grid_search . fit (X , y )
Explanation:
• Grid Search exhaustively searches over a specified parameter grid,
training and evaluating the model on each combination of
parameters.
Explanation:
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([2 , 4 , 5 , 4 , 5, 6])
# Making a prediction
prediction = model . predict ( np . array ([[7]]) ) print ( " Predicted ␣ value ␣ for ␣ input ␣ 7:
",
→ prediction )
Question 2: Logistic Regression
Create a Logistic Regression model using the data below and
evaluate its accuracy.
• Input Features (X): [1, 2, 3, 4, 5, 6]
• Target Labels (y): [0, 0, 0, 1, 1, 1]
Solution:
from sklearn . linear_model import
→ LogisticRegression
from sklearn . metrics import accuracy_score
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 0 , 0 , 1 , 1 , 1])
# Calculating accuracy
accuracy = accuracy_score (y , predictions ) print ( " Accuracy : " , accuracy )
# Sample data
X = np . array ([[1 , 2] , [1 , 4] , [1 , 0] , [10 , → 2] , [10 , 4] , [10 , 0]])
# Creating and fitting the model model = KMeans ( n_clusters =2 , random_state → =0)
model . fit ( X )
# Predicting cluster labels labels = model . predict ( X )
# Visualizing the clustered data
plt . scatter ( X [: , 0] , X [: , 1] , c= labels ,
→ cmap = ’ viridis ’)
plt . scatter ( model . cluster_centers_ [: , 0] ,
→ model . cluster_centers_ [: , 1] , s =200 ,
→ c = ’ red ’ , marker = ’X ’)
plt . title ( ’K - Means ␣ Clustering ’)
plt . xlabel ( ’ Feature ␣ 1 ’)
plt . ylabel ( ’ Feature ␣ 2 ’)
plt . grid ( True )
plt . show ()
• Data Points (X): [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
Solution:
from sklearn . decomposition import PCA import numpy as np
# Sample data
X = np . array ([[1 , 2 , 3] , [4 , 5, 6] , [7 , 8, → 9] , [10 , 11 , 12]])
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , → [6]])
y = np . array ([0 , 0 , 0 , 1 , 1 , 1])
# Creating the model
model = RandomForestClassifier ( random_state → =0)
3.6 Conclusion
Chapter 3 provides a comprehensive introduction to Scikit-learn,
covering essential machine learning algorithms, model evaluation
techniques, and parameter tuning strategies. Through detailed
examples and practical applications, students and practitioners can
learn to build, evaluate, and optimize machine learning models using
Scikit-learn, empowering them to make data-driven decisions and
predictions.
Chapter 4 Statistics
4.1 Introduction to Statistics
Statistics is a branch of mathematics that deals with the collection,
analysis, interpretation, and presentation of data. In data science
and machine learning, statistical methods are used to describe data
distributions, test hypotheses, and make data-driven decisions. This
module covers essential statistical concepts, including descriptive
statistics, grouping, ANOVA, correlation, polynomial regression, and
in-sample evaluation metrics such as R-squared and Mean Squared
Error (MSE).
Student Scores A 85
B 90
C 78
D 88
E 92
F 70
G 95
H 80
I 85
J 77
Calculations:
import numpy as np
from scipy . stats import mode
# Sample data
scores = np . array ([85 , 90 , 78 , 88 , 92 , 70 , → 95 , 80 , 85 , 77])
# Mean
mean_score = np . mean ( scores )
# Median
median_score = np . median ( scores )
# Mode
mode_score = mode ( scores )
# Variance
variance_score = np . var ( scores )
# Standard Deviation
std_dev_score = np . std ( scores )
# Range
range_score = np . ptp ( scores )
Explanation:
• The np.mean() function calculates the average score.
• The np.median() function identifies the central value of the
dataset.
• The mode() function from scipy.stats determines the most frequent
score.
Sample Data:
Student A
B
C
D
E
F
G
H
I
J
’ Student ’: [ ’A ’ , ’B ’ , ’C ’, ’D ’, ’E ’, ’F → ’ , ’G ’ , ’H ’ , ’I ’ , ’J ’],
’ Class ’: [ ’ Science ’ , ’ Science ’, ’ Arts ’, → ’ Arts ’ , ’ Science ’, ’ Arts ’, ’ → Commerce ’ , ’
Commerce ’, ’ Commerce ’ → , ’ Science ’] ,
’ Scores ’: [85 , 90 , 78 , 88 , 92 , 75 , 80 , → 85 , 78 , 87]
}
df = pd . DataFrame ( data )
Sample Data:
Student Class Gender Scores A Science Male 85 B Science Female 90
C Arts Female 78 D Arts Male 88 E Science Male 92 F Arts Female 75
G Commerce Male 80 H Commerce Female 85 I Commerce Female
78 J Science Female 87
Calculations:
# Grouping by ’ Class ’ and ’ Gender ’, then → calculating mean scores
grouped_multi_columns = df . groupby ([ ’ Class ’ → , ’ Gender ’ ]) . mean ()
Sample Data:
Student A
B
C
D
E
F
G
H
I
J
Class Scores
Science 85
Science 90
Arts 78
Arts 88
Commerce 92
Commerce 85
Science 87
Arts 75
Commerce 80
Science 91
# Performing ANOVA
f_statistic , p_value = f_oneway ( → science_scores , arts_scores ,
→ commerce_scores )
• A low p-value (typically less than 0.05) indicates that there are
significant differences between the group means, suggesting that
not all group means are equal.
Sample Data:
Student Class A Science B Science C Arts D Arts E Commerce F
Commerce G Science H Arts I Commerce J Science Study Method
Scores
Online 85
Offline 90
Online 78
Offline 88
Online 92
Offline 85
Online 87
Online 75
Offline 80
Offline 91
4.5 Correlation
Correlation measures the strength and direction of a linear
relationship between two variables. The correlation coefficient
ranges from -1 to 1:
Sample Data:
Calculations:
import numpy as np
# Sample data
hours_studied = np . array ([2 , 3, 4, 5, 6]) scores = np . array ([50 , 60 , 70 , 80 , 90])
Sample Data:
Calculations:
# Sample data
data = np . array ([ [2 , 50 , 80] , [3 , 60 , 85] , [4 , 70 , 90] , [5 , 80 , 95] , [6 , 90 , 100]
])
# Calculating the correlation matrix correlation_matrix = np . corrcoef ( data , → rowvar =
False )
print ( " Correlation ␣ Matrix :\ n" ,
→ correlation_matrix )
Explanation:
Sample Data:
XY
12
24
36
48
5 11
Calculations:
from sklearn . preprocessing import → PolynomialFeatures
from sklearn . linear_model import → LinearRegression
from sklearn . pipeline import Pipeline
import numpy as np
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([2 , 4 , 6 , 8 , 11])
→ degree =2) ) ,
( ’ linear_regression ’ , LinearRegression
→ () )
])
• Pipelines make the process efficient and reduce the risk of errors in
data processing and model training.
Sample Data:
XY
13
26
39
4 12
5 15
Calculations:
from sklearn . preprocessing import
→ StandardScaler
from sklearn . pipeline import make_pipeline
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([3 , 6 , 9 , 12 , 15])
Sample Data:
Actual Predicted
3 2.8
7 7.1
5 5.2
9 9.3
11 10.9
Calculations:
from sklearn . metrics import r2_score , → mean_squared_error
# Sample data
actual = np . array ([3 , 7 , 5 , 9 , 11]) predicted = np . array ([2.8 , 7.1 , 5.2 , 9.3 ,
→ 10.9])
# Calculating R - squared
r2 = r2_score ( actual , predicted )
# Calculating Mean Squared Error
EVALUATION
mse = mean_squared_error ( actual , predicted )
print ( f "R - squared : ␣ { r2 } , ␣ Mean ␣ Squared ␣ Error → : ␣ { mse } " )
Explanation:
• The r2 score() function calculates the R-squared value, providing a
measure of how well the model explains the variance in the target
variable.
• These metrics help determine if a model is a good fit for the data,
guiding further model tuning or selection.
Sample Data:
XY12
2 4.1
3 6.5
4 8.7
5 11.2 Calculations:
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([2 , 4.1 , 6.5 , 8.7 , 11.2])
# Evaluation metrics
linear_r2 = r2_score (y , linear_pred ) linear_mse = mean_squared_error (y ,
→ linear_pred )
poly_r2 = r2_score (y , poly_pred )
poly_mse = mean_squared_error (y , poly_pred )
print ( f " Linear ␣ Model ␣ ->␣R - squared :␣{ → linear_r2 } , ␣ MSE : ␣ { linear_mse }")
print ( f " Polynomial ␣ Model ␣ ->␣R - squared :␣{ → poly_r2 } , ␣ MSE : ␣ { poly_mse }")
Explanation:
Scenario:
A company uses a regression model to predict future sales based on
advertising spend. The goal is to decide on the advertising budget
for the next quarter to maximize sales.
Sample Data:
Calculations:
import numpy as np
from sklearn . linear_model import → LinearRegression
# Sample data
X = np . array ([[1000] , [2000] , [3000] , → [4000] , [5000]])
y = np . array ([150 , 250 , 350 , 450 , 550])
Explanation:
• The model uses linear regression to learn the relationship between
advertising spend and sales.
Scenario:
Sample Data:
Patient Age Severity Score Readmitted (Yes/No) 1 65 7 Yes 2 50 4
No 3 80 9 Yes 4 45 3 No 5 70 6 Yes
Calculations:
from sklearn . linear_model import
→ LogisticRegression
# Sample data
X = np . array ([[65 , 7] , [50 , 4] , [80 , 9] ,
Explanation:
Given the scores data below, calculate the mean, median, mode,
variance, standard deviation, range, and percentiles.
→ # Same as median
percentile_75 = np . percentile ( scores , 75)
’ Student ’: [ ’A ’ , ’B ’ , ’C ’, ’D ’, ’E ’, ’F → ’ , ’G ’] ,
’ Class ’: [ ’ Science ’ , ’ Science ’, ’ Arts ’, → ’ Arts ’ , ’ Commerce ’, ’ Commerce ’, → ’
Science ’] ,
’ Scores ’: [85 , 90 , 78 , 88 , 92 , 85 , 87] }
df = pd . DataFrame ( data )
print ( grouped )
Question 3: ANOVA
Perform a one-way ANOVA on the following groups to determine if
their means are significantly different.
Solution:
from scipy . stats import f_oneway
Solution:
import numpy as np
# Sample data
hours_studied = np . array ([1 , 2, 3, 4, 5]) scores = np . array ([50 , 60 , 70 , 80 , 90])
• X: [1, 2, 3, 4, 5]
• Y: [2, 4, 6, 8, 11]
Solution:
from sklearn . preprocessing import
→ PolynomialFeatures
from sklearn . linear_model import
→ LinearRegression
from sklearn . pipeline import Pipeline
import numpy as np
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5]]) y = np . array ([2 , 4 , 6 , 8 , 11])
# Sample data
actual = np . array ([3 , 7 , 5 , 9 , 11]) predicted = np . array ([2.8 , 7.1 , 5.2 , 9.3 ,
→ 10.9])
# Calculating R - squared
r2 = r2_score ( actual , predicted )
# Calculating Mean Squared Error
mse = mean_squared_error ( actual , predicted )
print ( f "R - squared : ␣ { r2 } , ␣ Mean ␣ Squared ␣ Error → : ␣ { mse } " )
4.10. CONCLUSION
4.10 Conclusion
This chapter provides a comprehensive overview of essential
statistical concepts, including descriptive statistics, grouping, ANOVA,
correlation, polynomial regression, and in-sample evaluation metrics
like R-squared and MSE. Through detailed examples, tabular data,
and practice questions, students and practitioners can gain a solid
understanding of statistical analysis techniques and their applications
in data-driven decisionmaking.
Chapter 5 Model Refinement
5.1 Introduction to Model Refinement
Model refinement is an essential step in the machine learning
pipeline, focusing on improving model performance through
optimization of parameters, feature transformation, and data
preprocessing techniques. Key methods include Grid Search for
hyperparameter tuning, iterative model refinement through
crossvalidation, binning for data simplification, and indicator
(dummy) variables for categorical encoding. These techniques help
achieve a balance between model complexity and generalizability,
leading to robust predictive models.
Sample Data:
Feature Target
13
25
37
4 10
5 12
6 15
7 18
8 20
9 23 10 25
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , [6] ,
→ lsqr ’]
}
→ best_params_ )
print ( " Best ␣ Score : " , - grid_search .
→ best_score_ )
Explanation:
Sample Data:
Feature 1 Feature 2 Target 1 2 0 2 3 0 3 4 1 4 5 1 5 6 1 6 7 0 7 8 1
8 9 1 9 10 0
10 11 1
Grid Search Implementation:
from sklearn . model_selection import → GridSearchCV
from sklearn . svm import SVC
import numpy as np
# Sample data
X = np . array ([[1 , 2] , [2 , 3] , [3 , 4] , [4 , → 5] , [5 , 6] , [6 , 7] , [7 , 8] , [8 , 9] , → [9
, 10] , [10 , 11]])
→ validation
grid_search = GridSearchCV ( estimator = model , → param_grid = param_grid , cv =3 ,
scoring → = ’ accuracy ’)
grid_search . fit (X , y )
# Best parameters and score
print ( " Best ␣ Parameters : " , grid_search .
→ best_params_ )
print ( " Best ␣ Score : " , grid_search .
→ best_score_ )
Explanation:
Sample Data:
Cross-Validation Implementation:
from sklearn . linear_model import → LinearRegression
from sklearn . model_selection import → cross_val_score
import numpy as np
Sample Data:
Feature Target
1 1.5
2 2.3
3 3.6
4 4.8
5 5.9
6 6.5
7 7.8
8 8.1
9 9.3
10 10.5
print ( f " Mean ␣ MSE ␣ with ␣ Lasso :␣{ mean_score },␣ → Standard ␣ Deviation ␣ of ␣ MSE
:␣{
→ std_dev_score } " )
Explanation:
5.4 Binning
Binning, or discretization, is a technique used to transform
continuous variables into discrete categories. It simplifies the data,
reduces the impact of noise, and can enhance model performance,
particularly in algorithms that prefer categorical inputs.
Sample Data:
# Binning ages into equal width intervals age_bins = pd . cut ( ages , bins =5 , labels =[ ’
→ 20 -30 ’ , ’ 30 -40 ’ , ’ 40 -50 ’, ’ 50 -60 ’, ’
→ 60 -70 ’ ])
5.4. BINNING
# Creating DataFrame
df = pd . DataFrame ({ ’ Age ’: ages , ’ Age ␣ Group ’ → : age_bins })
print ( df )
Explanation:
• Binning is useful for reducing the impact of outliers and noise, and
for transforming continuous variables into categories that are easier
to interpret and model.
Equal frequency binning on the other hand means that each bin will
have very close numbers of observations and this is helpful in
dealing with data distributions that are not equal.
VARIABLES)
Sample Data:
Customer
1
2
3
4
5
6
7
8
9
10
Product Category Electronics
Clothing
Home
Electronics
Home
Clothing
Electronics
Home
Clothing
Electronics
}
df = pd . DataFrame ( data )
print ( df_dummies )
Explanation:
• It is important to explain that the pd. get dummies()
function prepares data in such a way by creating binary variables for
each category in the specified column so that machine learning
algorithms can accept it.
Sample Data:
VARIABLES)
• Features: [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
• Target: [3, 5, 7, 10, 12, 15, 18, 20, 23, 25]
• Hyperparameter Grid: alpha = [0.1, 1, 10], solver = [’auto’, ’svd’]
Solution:
from sklearn . model_selection import
→ GridSearchCV
from sklearn . linear_model import Ridge
import numpy as np
# Sample data
X = np . array ([[1] , [2] , [3] , [4] , [5] , [6] ,
→ best_params_ )
print ( " Best ␣ Score : " , grid_search .
→ best_score_ )
• Features: [[5], [10], [15], [20], [25], [30], [35], [40], [45], [50]]
• Target: [8, 16, 24, 32, 40, 48, 56, 64, 72, 80]
Solution:
from sklearn . linear_model import → LinearRegression
from sklearn . model_selection import → cross_val_score
import numpy as np
# Sample data
X = np . array ([[5] , [10] , [15] , [20] , [25] ,
Question 3: Binning
Bin the following age data into equal width and equal frequency
categories and add these as new columns in the DataFrame:
• Ages: [23, 27, 31, 36, 42, 47, 53, 59, 65, 71]
Solution:
import pandas as pd
# Sample data
ages = pd . Series ([23 , 27 , 31 , 36 , 42 , 47 , → 53 , 59 , 65 , 71])
# Equal width binning
age_bins = pd . cut ( ages , bins =4 , labels =[ ’ → 20 -35 ’ , ’ 35 -50 ’ , ’ 50 -65 ’, ’ 65 -80
’])
# Equal frequency binning
age_bins_equal_freq = pd . qcut ( ages , q =4 , → labels =[ ’ Q1 ’ , ’ Q2 ’ , ’ Q3 ’, ’Q4 ’])
# Creating DataFrame
df = pd . DataFrame ({ ’ Age ’: ages , ’ Age ␣ Group ␣ → ( Equal ␣ Width ) ’: age_bins , ’ Age
␣ Group ␣
→ ( Equal ␣ Frequency ) ’:
→ age_bins_equal_freq })
print ( df )
print ( df_dummies )
5.7 Conclusion
This comprehensive chapter covers essential model refinement
techniques, including Grid Search for hyperparameter tuning,
iterative model refinement through cross-validation and
regularization, binning continuous variables into categories, and
creating indicator variables for categorical data. By applying these
techniques, students and practitioners can enhance their models’
accuracy, interpretability, and robustness, ultimately leading to more
reliable and effective predictive analytics.