DAP_3_module
DAP_3_module
• Pandas offers two primary data structures — Series and DataFrame — optimized
for fast and flexible data operations.
Feature Description
Powerful Grouping
Group by, pivot tables, aggregation
Operations
• Real-world datasets are often messy, and pandas provides powerful tools to clean,
transform, and prepare data.
• Works seamlessly with big data when combined with Dask or Vaex.
• Built on NumPy arrays: Thus, it inherits the speed and functionality of NumPy.
• Data is stored in blocks: Homogeneous types (like integers or floats) are grouped
together in memory for efficiency.
df.loc[0] # label-based
df.iloc[0] # position-based
3. GroupBy Operations
• Perform operations like sum, mean, count per group (e.g., total sales by city).
df.groupby('City')['Sales'].sum()
6. Time-Series Data
• A Series is a one-dimensional labeled array that can hold any type of data
(integers, strings, floats, Python objects, etc.).
import pandas as pd
# Creating a Series
print(s)
import pandas as pd
# Creating a DataFrame
data = {
df = pd.DataFrame(data)
print(df)
df = pd.DataFrame([
[1, 'A'],
[2, 'B'],
[3, 'C']
], columns=['ID', 'Letter'])
print(df)
• Columns can hold different types of data (int, float, string, boolean, object, etc.).
Creating a DataFrame
From Dictionary
import pandas as pd
data = {
df = pd.DataFrame(data)
print(df)
1. Viewing Data
2. Selecting Data
Operation Example Output
df['Country'] = 'USA'
df['Age'] = df['Age'] + 1
print(df)
# Delete column
# Delete row
6. Renaming Columns
df.rename(columns={'Name': 'FullName'}, inplace=True)
7. Sorting
# Sort by column
# Detect missing
df.isnull()
df.dropna(inplace=True)
# Fill missing
df.fillna('Unknown', inplace=True)
9. Aggregations
Function Example
Sum df['Age'].sum()
Mean df['Age'].mean()
Count df['City'].count()
df.groupby('City')['Age'].mean()
11. Merging, Joining, and Concatenating
13. Reshaping
df.stack()
df.unstack()
Data Preprocessing
What is Data Preprocessing?
Data Preprocessing is the process of cleaning, transforming, and organizing raw data
into a usable format for machine learning models.
Real-world data is often incomplete, inconsistent, noisy, or unstructured, and
preprocessing improves the quality of the data.
Without preprocessing, models can give poor predictions, lower accuracy, or even fail to
run.
• Normalizing Data
• Data Transformation
Missing values occur when some data points are not recorded.
If not handled properly, they can lead to biased models or errors during training.
Common ways to handle missing data:
import numpy as np
import pandas as pd
df = pd.DataFrame(data)
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing Salary with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
print(df)
2) Handling Duplicates
Method Code
import pandas as pd
df = pd.DataFrame(data)
df = df.drop_duplicates()
print(df)
3) Normalizing Data
Normalization means scaling numerical data into a specific range (like 0–1).
It ensures that all features contribute equally to the result.
Especially important for models based on distance (e.g., KNN, SVM).
Method Description
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
df = pd.DataFrame(data)
# Label Encoding
encoder = LabelEncoder()
df['Color_encoded'] = encoder.fit_transform(df['Color'])
print(df)
5) Data Transformation
• Box-Cox Transformation
Transformation Purpose
import numpy as np
print(log_transformed)
Data Wrangling
Introduction to Data Wrangling:
Data wrangling, also known as data munging, is the process of collecting, organizing,
cleaning, enriching, validating, and publishing data for analysis.
It is a critical step in data science because real-world data is often messy, incomplete, and
inconsistent.
Effective data wrangling ensures that the data is usable, reliable, and insightful for further
analysis, reporting, or machine learning.
• This step involves identifying and collecting raw data from multiple sources such as
databases, CSV files, APIs, web scraping, or manual records.
• The goal is to bring all relevant data into a single working environment.
import pandas as pd
data = pd.read_csv('sales_data.csv')
Ways to do:
2. Organization
• Understand the dataset's structure, inspect columns, data types, null values, and
basic statistics.
• Helps in planning what cleaning and transformation are needed.
print(data.head())
# Summary statistics
print(data.describe())
print(data.info())
Ways to do:
• Visual inspection or exploratory data analysis (EDA) using libraries like matplotlib,
seaborn.
3. Cleaning
data_cleaned = data.dropna()
data_cleaned = data_cleaned.drop_duplicates()
data['price'] = data['price'].fillna(data['price'].mean())
Ways to do:
4. Data Enrichment
• Improve the dataset by adding new features, transforming variables, normalizing
data, and aggregating information.
# Normalize a column
Ways to do:
• Feature engineering.
5. Validation
• Validate the data to ensure it meets business requirements, logical rules, and
statistical thresholds.
print(invalid_sales)
Ways to do:
6. Publish
• Save the clean, validated dataset in appropriate formats (CSV, JSON, database) and
document the wrangling process for reproducibility.
data.to_csv('cleaned_sales_data.csv', index=False)
Ways to do:
Format Explanation
Analytical Base A structured table where each row is an entity (e.g., customer) and
Table (ABT) each column is a feature, used in AI/ML.
Document Library Text-based data stored in documents; needs techniques like text
Data mining and NLP for analysis.
• Data Consistency:
Uniform and error-free data enhances analysis accuracy.
• Improved Insights:
Clean and enriched data helps in generating actionable business insights.
• Cost Efficiency:
Reduces time and computational resources required for downstream data tasks.
A,100,10,North
B,200,,South
C,150,20,North
A,100,10,North
D,300,15,
import pandas as pd
# Step 1: Discovery
data = pd.read_csv('sales_data.csv')
# Step 2: Organization
print(data.head())
print(data.info())
# Step 3: Cleaning
data = data.drop_duplicates()
data['Sales'] = data['Sales'].fillna(0)
data['Region'] = data['Region'].fillna('Unknown')
# Step 5: Validation
# Step 6: Publish
data.to_csv('cleaned_sales_data.csv', index=False)
Once loaded, the data often requires cleaning to address issues like missing values,
duplicates, or inconsistencies.
Describing the data is then essential to understand its structure, content, and
characteristics. This process is foundational in data science and machine learning
workflows.
A CSV file is a plain-text file that stores tabular data, where each row represents a record,
and columns are separated by commas (or other delimiters like tabs or semicolons). It is
widely used due to its simplicity, compatibility with tools like Excel, and support in
programming languages like Python.
• Lightweight: Smaller file size compared to Excel or JSON for tabular data.
• Structured Data: Ideal for datasets with rows and columns (e.g., sales records,
customer data).
• Cleaning: Ensures data quality by fixing errors, missing values, and inconsistencies.
• Describing: Provides insights into the data’s structure, distribution, and potential
issues, guiding further preprocessing.
Loading or importing involves reading a CSV file into a Python environment, typically as a
DataFrame (a tabular data structure) for easy manipulation. Python’s pandas library is the
most popular tool for this task due to its flexibility and functionality.
1. Using pandas.read_csv():
o Key Parameters:
o Example:
import pandas as pd
o Methods:
o Example:
import csv
reader = csv.DictReader(file)
• File Not Found: Ensure the file path is correct (use absolute paths or check the
working directory with os.getcwd()).
• Encoding Errors: Specify the correct encoding (e.g., 'latin1' for non-UTF-8 files).
• Delimiter Mismatch: Use the correct sep parameter (e.g., sep=';' for semicolon-
separated files).
Data cleaning involves fixing errors, inconsistencies, and missing values in the loaded
dataset to ensure it is accurate and usable. CSV files often contain issues like missing
data, duplicates, incorrect formats, or outliers that must be addressed.
Cleaning Techniques
o Methods:
▪ Drop Missing Values: Remove rows or columns with missing data.
2. Removing Duplicates:
o Identify Duplicates:
o Remove Duplicates:
o Standardize Text:
o Standardize Dates:
o Fix Typos:
o Displays column names, data types, non-null counts, and memory usage.
o Example:
df.info()
o Use Case: Check for missing values and incorrect data types.
o Provides statistical measures for numerical columns (count, mean, std, min,
max, quartiles).
o Example:
df.describe()
o Example:
df['city'].value_counts()
df.nunique()
df.isnull().sum()
df.groupby('city')['salary'].mean()
• sqlite3 is a built-in Python library to work with SQLite databases (no installation
needed).
• You use SQL commands (like SELECT, INSERT, UPDATE) through Python to interact
with the database.
• It abstracts away SQL complexity by allowing you to interact with databases using
Python classes.
• Supports many database engines: SQLite, PostgreSQL, MySQL, etc.
Operation Meaning
execute Run SQL queries (like create, insert, update, select, delete).
import sqlite3
import sqlite3
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, name
TEXT, age INTEGER)''')
The SQL inside execute() can be CREATE, INSERT, UPDATE, DELETE, SELECT, etc.
• Make sure to use IF NOT EXISTS to avoid errors if the table already exists.
cursor.execute('''
name TEXT,
department TEXT
''')
cursor = conn.cursor()
cursor.execute('''
name TEXT,
age INTEGER
''')
Using ? and a tuple (("John Doe", "Finance")) protects against SQL Injection attacks
(safer practice).
cursor.execute('INSERT INTO employees (name, age) VALUES (?, ?)', ('Alice', 30))
conn.commit()
rows = cursor.fetchall()
print(row)
• After making changes (inserting, updating, deleting), you need to commit them.
conn.commit()
• After a SELECT query, you can fetch all rows at once into a list.
rows = cursor.fetchall()
print(row)
• Instead of fetching all at once, you can fetch one row at a time.
row = cursor.fetchone()
print(row)
Useful when you know the query returns only one or few rows.
• After finishing database operations, you must close the connection properly.
conn.close()
Python, with its powerful libraries like Pandas, NumPy, and others, offers numerous
methods to clean data and remove extraneous information.
Data cleaning, also known as data cleansing or scrubbing, is the process of detecting and
correcting (or removing) errors, inconsistencies, and inaccuracies in a dataset. It
addresses issues such as missing values, duplicates, incorrect formats, outliers, and
irrelevant data to improve data quality. Clean data is essential for reliable analysis,
visualization, and machine learning model performance.
What is Stripping Out Extraneous Information?
• Enhances Analysis: Clean data leads to reliable insights and model performance.
• Focuses Analysis: Keeps only relevant data for the task at hand.
Before diving into cleaning methods, it’s important to understand the typical issues found
in datasets:
7. Typos and Errors: Misspellings or invalid entries (e.g., 'New Yrok' instead of 'New
York').
Python, especially with libraries like pandas, numpy, and scikit-learn, provides many
tools to identify, visualize, and handle missing data.
In pandas:
Common methods:
import pandas as pd
df = pd.DataFrame({
})
print(df.isnull().sum())
Other functions:
There are multiple strategies depending on the problem and dataset size:
A. Removing Data
If the dataset is large and only a few rows have missing data, dropping them may be
acceptable.
df.dropna(inplace=True)
If a column has too many missing values (e.g., >70%), you might drop the column.
df.dropna(axis=1, inplace=True)
• Mean, Median, or Mode (most frequent value) are commonly used for numeric
data.
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['City'].fillna(df['City'].mode()[0], inplace=True)
# Forward fill
df.fillna(method='ffill', inplace=True)
# Backward fill
df.fillna(method='bfill', inplace=True)
Removing Duplicates
df.drop_duplicates(inplace=True)
df.drop_duplicates(keep='first', inplace=True)
Correcting Data Types
• Correct data types are crucial: e.g., date columns should be datetime, numbers
should be int/float.
df['age'] = df['age'].astype(int)
df['date'] = pd.to_datetime(df['date'])
Handling Outliers
• Outliers are extreme values that differ greatly from most data points.
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
scaler = StandardScaler()
df['normalized'] = scaler.fit_transform(df[['column']])
scaler = MinMaxScaler()
df['normalized'] = scaler.fit_transform(df[['column']])
• Formatting problems include inconsistent dates, extra spaces, wrong text case.
df['column'] = df['column'].str.strip()
• Text standardization:
df['name'] = df['name'].str.lower()
• ML algorithms can't work with text. Categories must be converted into numbers.
df = pd.get_dummies(df, columns=['category'])
Label Encoding:
df['category_encoded'] = le.fit_transform(df['category'])
• After cleaning, save the dataset for later use without losing progress.
df.to_csv('cleaned_data.csv', index=False)
df.to_excel('cleaned_data.xlsx', index=False)
df['column_name'] = df['column_name'].str.strip()
df = df[df['column_name'] != 'irrelevant_value']
• Equal Feature Contribution: Prevents features with larger scales from dominating
models (e.g., in distance-based algorithms like KNN or SVM).
Types of Normalization
Normalization techniques adjust data based on its distribution or range. The main methods
are described below.
What is Data Formatting?
• Compatibility: Prepares data for algorithms that require specific formats (e.g.,
numerical inputs).
Types of Formatting
Formatting applies to numerical, categorical, text, and temporal data, with techniques
tailored to each type.
• Formula: Xnorm=X−XminXmax−XminXnorm=Xmax−XminX−Xmin
import pandas as pd
df = pd.read_csv('data.csv')
Scikit-learn Method:
scaler = MinMaxScaler()
df['salary_norm'] = scaler.fit_transform(df[['salary']])
• Centers data around the mean (μμ) with a standard deviation (σσ) of 1.
• Formula: Xnorm=X−μσXnorm=σX−μ
• Assumes data follows a normal distribution but is less affected by outliers than min-
max.
Scikit-learn Method:
scaler = StandardScaler()
df['age_norm'] = scaler.fit_transform(df[['age']])
3. Robust Scaling
• Scales data using the median and interquartile range (IQR) instead of mean and
standard deviation.
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
Scikit-learn Method:
scaler = RobustScaler()
df['salary_norm'] = scaler.fit_transform(df[['salary']])
Formatting ensures data is consistently represented across the dataset. Below are the key
methods, with theory, use cases, and concise code examples.
• Common tasks include converting case, removing whitespace, and cleaning special
characters.
Methods:
• Convert Case:
• Remove Whitespace:
• Replace Substrings:
Methods:
• Convert to Datetime:
• Standardize Format:
• Extract Components:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
• Ensures numerical data is in the correct format (e.g., float, integer) and precision.
Methods:
• Convert to Numeric:
• Round Numbers:
• Convert to Integer:
• Standardize Categories:
df['department'] = df['department'].str.lower().str.strip()
• Label Encoding:
le = LabelEncoder()
df['department_code'] = le.fit_transform(df['department'])
• One-Hot Encoding:
threshold = 10
counts = df['category'].value_counts()
File handling allows a program to persist data — saving outputs, reading inputs, managing
configurations, or exchanging information between systems.
In Python, files are handled with built-in functions and modules.
Key concepts:
• Opening a file.
• Writing to a file.
Example:
Important: Always close the file after operations using .close() OR use with open(...) as ...:
to auto-close.
Mode Meaning
Methods:
• read(): Reads the whole file as one string.
Example:
content = file.read()
print(content)
print(line.strip())
file.write('Hello World\n')
file.writelines(lines)
Example:
• 1 – current position
• 2 – end of file
Many times, data is not just plain text but structured in formats like CSV or JSON.
import csv
# Writing
writer = csv.writer(file)
writer.writerow(['Name', 'Age'])
writer.writerow(['Alice', 25])
# Reading
reader = csv.reader(file)
print(row)
# Writing JSON
# Reading JSON
data = json.load(f)
These techniques are essential when data is sourced from different files, databases, or
systems, and need to be consolidated to provide a comprehensive view. Python,
particularly with the Pandas library, offers powerful tools for combining and merging data
efficiently.
Merging data involves joining datasets based on common keys or indices, similar to SQL
joins, to integrate related information from different sources. It is used when datasets
share a common column or index (e.g., customer ID) but contain different attributes.
• Unified Analysis: Integrates data from multiple sources for comprehensive insights.
1. Combining (Concatenation):
2. Merging (Joining):
▪ Left: Keeps all records from the left dataset, with NaNs for non-
matching right records.
▪ Right: Keeps all records from the right dataset, with NaNs for non-
matching left records.
▪ Outer (Full): Keeps all records, with NaNs where there’s no match.
• Requires datasets to have the same columns (or a subset) for alignment.
• Used when datasets represent different records of the same structure (e.g., sales
data from different months).
Code Example:
import pandas as pd
• join='outer': Include all columns (default); use 'inner' for common columns.
• Used when datasets contain different attributes for the same records (e.g.,
customer demographics and purchase amounts).
Code Example:
Merging data involves joining datasets based on a common key or index, similar to SQL
joins. Below are the key methods, with theory, use cases, and concise code examples.
1. Inner Merge
• Used when only complete matches are needed (e.g., customers with orders).
Code Example:
Use Case:
Pros:
Cons:
2. Left Merge
• Keeps all records from the left dataset, with NaNs for non-matching records from
the right dataset.
• Used when the left dataset is the primary source (e.g., all customers, even without
orders).
Code Example:
Pros:
Cons:
3. Right Merge
• Keeps all records from the right dataset, with NaNs for non-matching records from
the left dataset.
• Used when the right dataset is the priority (e.g., all orders, even for unknown
customers).
Code Example:
Use Case:
• Ensuring all order records are included, even for missing customers.
Pros:
Cons:
• Keeps all records from both datasets, with NaNs where there’s no match.
Code Example:
Use Case:
Pros:
Cons:
5. Merging on Index
• Useful when datasets are indexed by a common identifier (e.g., time series data).
Code Example:
• Used when a single key is insufficient (e.g., matching by customer_id and date).
Code Example:
df1 = pd.DataFrame({'id': [1, 1], 'date': ['2023-01-01', '2023-01-02'], 'name': ['Alice', 'Alice']})
df2 = pd.DataFrame({'id': [1, 1], 'date': ['2023-01-01', '2023-01-03'], 'age': [25, 26]})
df_merged = pd.merge(df1, df2, on=['id', 'date'], how='inner')
1. Joining DataFrames
Code Example:
df_joined = df1.join(df2)
Code Example:
These processes transform the structure of data, such as changing its layout from wide to
long format or creating summary tables.
Python’s Pandas library provides powerful tools like melt, pivot, pivot_table, and others to
perform these tasks efficiently.
Reshaping involves changing the structure or layout of a dataset without altering its
content. It typically transforms data between wide format (where each variable has its own
column) and long format (where variables are stacked into a single column with
corresponding values).
Pivoting is a specific type of reshaping that creates a summary table by spreading data
across rows and columns based on key variables. It is often used to aggregate data or
create cross-tabulations for easier interpretation.
• Flexibility: Converts data between formats (e.g., long for time-series, wide for
summaries).
• Data Compatibility: Prepares data for tools or algorithms that expect specific
structures.
o Wide Format: Each variable is a column, and each row represents a record
(e.g., columns for sales_2021, sales_2022).
o Long Format: Variables are stacked into a single column, with another
column indicating the variable name (e.g., columns for year and sales).
• Function: pd.melt.
Code Example:
import pandas as pd
var_name='year', value_name='sales')
• Function: pd.pivot.
Code Example:
df = pd.DataFrame({'id': [1, 1, 2], 'year': [2021, 2022, 2021], 'sales': [100, 120, 150]})
• Similar to pivot, but allows aggregation of duplicate entries using a function (e.g.,
mean, sum).
• Function: pd.pivot_table.
• Used for summarizing data or handling datasets with multiple values for the same
index-column pair.
Code Example:
df = pd.DataFrame({'id': [1, 1, 2], 'year': [2021, 2021, 2022], 'sales': [100, 110, 150]})
• Stacking: Moves column labels into a row index, converting wide data to a long
format (multi-level index).
• Unstacking: Moves row index levels to columns, converting long data to a wide
format.
df_stacked = df.set_index('id').stack().reset_index(name='sales')
df = pd.DataFrame({'id': [1, 1, 2], 'year': [2021, 2022, 2021], 'sales': [100, 120, 150]})
Python, particularly with the Pandas, NumPy, and Scikit-learn libraries, provides powerful
tools for data transformation.
Data transformation refers to the process of converting raw or unprocessed data into a
format that is more appropriate for analysis, modeling, or reporting.
Introduction to Strings
• Python provides many methods for creating, transforming, and analyzing strings.
Operations that change how the string looks without changing the original.
Example:
print(s.strip()) # 'hello'
• Concatenation: Using +
• Repetition: Using *
• Joining: .join(list)
Example:
s = "apple,banana,grape"
Example:
s = "hello world"
print(s.find("world")) # 6
print(s.isalpha()) # False
5. Formatting Strings
Example:
name = "Alice"
• Replacing (re.sub)
• Splitting (re.split)
Example:
import re
print(s)
• Syntax: string[start:end:step].
Example:
s = "Python"
print(s[0]) # Output: P
print(s[-1]) # Output: n
Example:
s1 = "Hello"
s2 = "World"
print(s1 * 3) # HelloHelloHello
Example:
s = "hello WORLD"
Example:
s = "hello world"
print(s.find('world')) # Output: 6
6. String Replacement
s = "bad dog"
Example:
s = "apple,banana,grape"
fruits = s.split(",")
print(",".join(fruits)) # apple,banana,grape
Example:
s = "Python3"
print(s.isalnum()) # True
9. String Formatting
• .format() method.
• % old-style formatting.
Example:
name = "Alice"
Example: