0% found this document useful (0 votes)
5 views62 pages

DAP_3_module

Pandas is an open-source Python library designed for data analysis, providing high-performance data structures like Series and DataFrame for efficient data manipulation. It offers features such as handling missing data, flexible data alignment, powerful grouping operations, and rich input/output support. Data preprocessing and wrangling with Pandas are crucial for cleaning and organizing raw data into a usable format for machine learning and analysis.

Uploaded by

sharadhi080808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views62 pages

DAP_3_module

Pandas is an open-source Python library designed for data analysis, providing high-performance data structures like Series and DataFrame for efficient data manipulation. It offers features such as handling missing data, flexible data alignment, powerful grouping operations, and rich input/output support. Data preprocessing and wrangling with Pandas are crucial for cleaning and organizing raw data into a usable format for machine learning and analysis.

Uploaded by

sharadhi080808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

MODULE – 3

Pandas Data Structures


What is Pandas?

• Pandas stands for "Python Data Analysis Library."

• It is an open-source library that provides high-performance, easy-to-use data


structures and data analysis tools for Python.

• It is built on top of NumPy and is designed for practical, real-world data


manipulation.

• Pandas offers two primary data structures — Series and DataFrame — optimized
for fast and flexible data operations.

Feature Description

Easy Handling of Missing


Functions like .fillna(), .dropna()
Data

Flexible Data Alignment Automatic data alignment by labels

Powerful Grouping
Group by, pivot tables, aggregation
Operations

Efficient Data Filtering Boolean indexing, slicing, subsetting

High Performance Optimized with C and NumPy under the hood

Rich IO support Read/write to CSV, Excel, SQL, JSON, HDF5, etc.

Built-in functions for date range generation, moving


Time-Series Analysis
averages

Visualization Integrated plotting with Matplotlib

Why Use Pandas?

• Real-world datasets are often messy, and pandas provides powerful tools to clean,
transform, and prepare data.
• Works seamlessly with big data when combined with Dask or Vaex.

• Essential for Machine Learning, Business Intelligence, Statistical Modeling, and


ETL processes (Extract, Transform, Load).

• Reduces thousands of lines of manual coding into few lines.

How Pandas Works Internally

• Built on NumPy arrays: Thus, it inherits the speed and functionality of NumPy.

• Data is stored in blocks: Homogeneous types (like integers or floats) are grouped
together in memory for efficiency.

• Indexing is optimized for fast lookups and manipulations.

Important Concepts in Pandas

1. Indexing and Slicing

• Every Series or DataFrame has an Index for labeling.

• Indexing can be positional (iloc) or label-based (loc).

df.loc[0] # label-based

df.iloc[0] # position-based

2. Handling Missing Data

• Real-world data is often incomplete.

• Pandas makes it easy to detect (isnull()) and handle (fillna(), dropna()).

3. GroupBy Operations

• Perform operations like sum, mean, count per group (e.g., total sales by city).

df.groupby('City')['Sales'].sum()

4. Merging and Joining DataFrames


• Combine multiple datasets like SQL JOINs (merge(), join(), concat()).

5. Pivot Tables and Reshaping

• Summarize data and rearrange tables easily.

df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='sum')

6. Time-Series Data

• Special tools for date indexing, resampling, moving averages, etc.

df.resample('M').mean() # monthly average

1. Series (1-Dimensional Data Structure)

• A Series is a one-dimensional labeled array that can hold any type of data
(integers, strings, floats, Python objects, etc.).

• It is like a column in a spreadsheet or a database table.

• It has two main components:

o Index (labels for rows)

o Values (actual data)

import pandas as pd

# Creating a Series

s = pd.Series([10, 20, 30, 40, 50])

print(s)

2. DataFrame (2-Dimensional Data Structure)

• A DataFrame is a two-dimensional labeled data structure with columns of


potentially different types.

• You can think of it like an Excel spreadsheet or a SQL table or a dictionary of


Series objects.
• It has:

o Row Index (label for each row)

o Column Index (label for each column)

o Values (table of data)

import pandas as pd

# Creating a DataFrame

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Paris', 'London']

df = pd.DataFrame(data)

print(df)

Creating DataFrame from Lists:

df = pd.DataFrame([

[1, 'A'],

[2, 'B'],

[3, 'C']

], columns=['ID', 'Letter'])

print(df)

Pandas DataFrame and All Operations


What is a DataFrame?

• A DataFrame is a 2-dimensional, size-mutable, tabular data structure in pandas.


• It is made up of:

o Rows (horizontal entries)

o Columns (vertical entries)

o Indexes (labels for rows and columns)

• Columns can hold different types of data (int, float, string, boolean, object, etc.).

Creating a DataFrame

From Dictionary

import pandas as pd

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Paris', 'London']

df = pd.DataFrame(data)

print(df)

1. Viewing Data

Function Description Example

head() Show first N rows df.head(3)

tail() Show last N rows df.tail(2)

shape Get (rows, columns) df.shape

info() Summary (non-null counts, dtypes) df.info()

describe() Statistical summary df.describe()

columns List of column names df.columns

2. Selecting Data
Operation Example Output

Column selection df['Name'] Series (single column)

Multiple columns df[['Name', 'City']] DataFrame

Row by label df.loc[0] Row as Series

Row by position df.iloc[1] Row as Series

Slice rows df[0:2] First two rows

3. Filtering / Conditional Selection

# People aged above 25

df[df['Age'] > 25]

4. Adding and Modifying Columns

# Add new column

df['Country'] = 'USA'

# Modify existing column

df['Age'] = df['Age'] + 1

print(df)

5. Deleting Columns and Rows

# Delete column

df.drop('Country', axis=1, inplace=True)

# Delete row

df.drop(1, axis=0, inplace=True)

6. Renaming Columns
df.rename(columns={'Name': 'FullName'}, inplace=True)

7. Sorting

# Sort by column

df.sort_values(by='Age', ascending=False, inplace=True)

8. Handling Missing Data

# Detect missing

df.isnull()

# Drop missing rows

df.dropna(inplace=True)

# Fill missing

df.fillna('Unknown', inplace=True)

9. Aggregations

Function Example

Sum df['Age'].sum()

Mean df['Age'].mean()

Count df['City'].count()

Max/Min df['Age'].max() / df['Age'].min()

10. GroupBy Operations

# Group by city and calculate average age

df.groupby('City')['Age'].mean()
11. Merging, Joining, and Concatenating

• Merging DataFrames (like SQL JOIN):

pd.merge(df1, df2, on='ID')

• Concatenating DataFrames (stacking vertically or horizontally):

pd.concat([df1, df2], axis=0) # vertically

pd.concat([df1, df2], axis=1) # horizontally

12. Pivot Tables

df.pivot_table(values='Age', index='City', aggfunc='mean')

13. Reshaping

• Melt: Convert wide DataFrame to long format.

pd.melt(df, id_vars=['City'], value_vars=['Age'])

• Stack/Unstack: Pivot the levels of the index.

df.stack()

df.unstack()

Data Preprocessing
What is Data Preprocessing?

Data Preprocessing is the process of cleaning, transforming, and organizing raw data
into a usable format for machine learning models.
Real-world data is often incomplete, inconsistent, noisy, or unstructured, and
preprocessing improves the quality of the data.
Without preprocessing, models can give poor predictions, lower accuracy, or even fail to
run.

Main Steps in Data Preprocessing:

• Handling Missing Data


• Removing Duplicates

• Normalizing Data

• Encoding Categorical Variables

• Data Transformation

1) Handling Missing Data

Missing values occur when some data points are not recorded.
If not handled properly, they can lead to biased models or errors during training.
Common ways to handle missing data:

• Removal: Drop rows or columns with missing data.

• Imputation: Fill missing values using mean, median, mode, or a constant.

🛠 Ways to Handle Missing Data

Method When to Use

Remove rows/columns If too much data is missing

Fill with mean/median For numerical columns

Fill with mode For categorical columns

import numpy as np

import pandas as pd

# Create data with missing values

data = {'Age': [25, np.nan, 30, 22, np.nan],

'Salary': [50000, 60000, np.nan, 52000, 58000]}

df = pd.DataFrame(data)

# Fill missing Age with mean

df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing Salary with median

df['Salary'].fillna(df['Salary'].median(), inplace=True)

print(df)

2) Handling Duplicates

Duplicate entries are repeated rows in the dataset.


They bias the model and misrepresent the actual distribution of data.
Duplicates should be identified and removed to maintain the integrity of the dataset.

🛠 Ways to Handle Duplicates

Method Code

Find duplicates df.duplicated()

Remove duplicates df.drop_duplicates()

import pandas as pd

data = {'Name': ['John', 'Anna', 'John', 'Mike'],

'Age': [25, 22, 25, 30]}

df = pd.DataFrame(data)

# Remove duplicate rows

df = df.drop_duplicates()

print(df)

3) Normalizing Data

Normalization means scaling numerical data into a specific range (like 0–1).
It ensures that all features contribute equally to the result.
Especially important for models based on distance (e.g., KNN, SVM).

Two common types:


• Min-Max Normalization: Scale values between 0 and 1.

• Z-score Standardization: Center the data around mean 0, std 1.

🛠 Ways to Normalize Data

Method Description

Min-Max Scaling (value - min) / (max - min)

Standardization (value - mean) / std deviation

from sklearn.preprocessing import MinMaxScaler

data = [[10], [20], [30]]

scaler = MinMaxScaler()

normalized_data = scaler.fit_transform(data)

print(normalized_data)

4) Encoding Categorical Data

Machine learning models only understand numbers — not text.


Categorical (text) data must be converted into numbers.
Techniques:

• Label Encoding: Assigns each category a unique number.

• One-Hot Encoding: Creates binary columns for each category.

🛠 Ways to Encode Categorical Data

Method Use When

Label Encoding Categories have order (e.g., Small, Medium, Large)

One-Hot Encoding Categories are unordered (e.g., Red, Blue, Green)


from sklearn.preprocessing import LabelEncoder

data = {'Color': ['Red', 'Blue', 'Green', 'Blue']}

df = pd.DataFrame(data)

# Label Encoding

encoder = LabelEncoder()

df['Color_encoded'] = encoder.fit_transform(df['Color'])

print(df)

5) Data Transformation

Transformation is changing the structure or distribution of data.


It makes the data more suitable for modeling, by reducing skewness, improving linearity,
etc.
Common transformations:

• Log Transformation (reduce right skew)

• Square Root Transformation

• Box-Cox Transformation

🛠 Ways to Transform Data

Transformation Purpose

Log Reduce large differences between values

Square Root Handle moderate skew

Box-Cox Make data more normal (Gaussian)

import numpy as np

data = np.array([1, 10, 100, 1000])


log_transformed = np.log(data)

print(log_transformed)

Data Wrangling
Introduction to Data Wrangling:

Data wrangling, also known as data munging, is the process of collecting, organizing,
cleaning, enriching, validating, and publishing data for analysis.

It is a critical step in data science because real-world data is often messy, incomplete, and
inconsistent.

Effective data wrangling ensures that the data is usable, reliable, and insightful for further
analysis, reporting, or machine learning.

Steps of Data Wrangling:

1. Discovery (Data Gathering)

• This step involves identifying and collecting raw data from multiple sources such as
databases, CSV files, APIs, web scraping, or manual records.

• The goal is to bring all relevant data into a single working environment.

import pandas as pd

# Load data from a CSV file

data = pd.read_csv('sales_data.csv')

Ways to do:

• Identify data sources (internal databases, APIs, spreadsheets, etc.).

• Use pandas, requests, sqlalchemy in Python to import the data.

2. Organization

• Understand the dataset's structure, inspect columns, data types, null values, and
basic statistics.
• Helps in planning what cleaning and transformation are needed.

# Check the first 5 rows

print(data.head())

# Summary statistics

print(data.describe())

# Data types and null counts

print(data.info())

Ways to do:

• Use .head(), .describe(), .info() methods.

• Visual inspection or exploratory data analysis (EDA) using libraries like matplotlib,
seaborn.

3. Cleaning

• Handle missing values, duplicate records, incorrect formats, and inconsistent


categories.

• This step makes the data reliable for analysis.

# Drop rows with missing values

data_cleaned = data.dropna()

# Remove duplicate rows

data_cleaned = data_cleaned.drop_duplicates()

# Fill missing values with mean

data['price'] = data['price'].fillna(data['price'].mean())

Ways to do:

• dropna(), fillna(), drop_duplicates() in pandas.

• Replace incorrect values manually or through conditional logic.

4. Data Enrichment
• Improve the dataset by adding new features, transforming variables, normalizing
data, and aggregating information.

# Normalize a column

data['normalized_sales'] = (data['sales'] - data['sales'].min()) / (data['sales'].max() -


data['sales'].min())

# Create a new feature

data['revenue'] = data['sales'] * data['price']

# One-hot encode categorical data

data = pd.get_dummies(data, columns=['region'])

Ways to do:

• Feature engineering.

• Encoding categorical variables (LabelEncoder, OneHotEncoder).

• Data normalization (MinMaxScaler, StandardScaler from sklearn).

5. Validation

• Validate the data to ensure it meets business requirements, logical rules, and
statistical thresholds.

# Check if any sales value is negative

invalid_sales = data[data['sales'] < 0]

print(invalid_sales)

Ways to do:

• Logical tests (assert, conditional checks).

• Automated validation scripts.

• Use business rules and thresholds to test consistency.

6. Publish
• Save the clean, validated dataset in appropriate formats (CSV, JSON, database) and
document the wrangling process for reproducibility.

# Save cleaned data

data.to_csv('cleaned_sales_data.csv', index=False)

Ways to do:

• Export using to_csv(), to_json(), database exports.

• Maintain logs or scripts of data wrangling steps.

Data Wrangling Formats:

Format Explanation

Detailed records of business activities like invoices, receipts,


Transactional Data
customer interactions.

Analytical Base A structured table where each row is an entity (e.g., customer) and
Table (ABT) each column is a feature, used in AI/ML.

Data collected over time intervals (daily sales, stock prices) to


Time-Series Data
detect trends and patterns.

Document Library Text-based data stored in documents; needs techniques like text
Data mining and NLP for analysis.

Data Wrangling Benefits:

• Data Consistency:
Uniform and error-free data enhances analysis accuracy.

• Improved Insights:
Clean and enriched data helps in generating actionable business insights.

• Cost Efficiency:
Reduces time and computational resources required for downstream data tasks.

Imagine you have a messy sales dataset:


Product,Price,Sales,Region

A,100,10,North

B,200,,South

C,150,20,North

A,100,10,North

D,300,15,

Step-by-step Wrangling in Python:

import pandas as pd

# Step 1: Discovery

data = pd.read_csv('sales_data.csv')

# Step 2: Organization

print(data.head())

print(data.info())

# Step 3: Cleaning

data = data.drop_duplicates()

data['Sales'] = data['Sales'].fillna(0)

data['Region'] = data['Region'].fillna('Unknown')

# Step 4: Data Enrichment

data['Revenue'] = data['Price'] * data['Sales']

# Step 5: Validation

assert data['Sales'].min() >= 0

# Step 6: Publish

data.to_csv('cleaned_sales_data.csv', index=False)

Data Acquisition with Python


Loading Data from CSV Files, Importing, Cleaning, and
Describing Data in Python
Loading data from CSV (Comma-Separated Values) files is one of the most common ways
to acquire data in Python, especially for structured, tabular datasets.

Once loaded, the data often requires cleaning to address issues like missing values,
duplicates, or inconsistencies.

Describing the data is then essential to understand its structure, content, and
characteristics. This process is foundational in data science and machine learning
workflows.

What is a CSV File?

A CSV file is a plain-text file that stores tabular data, where each row represents a record,
and columns are separated by commas (or other delimiters like tabs or semicolons). It is
widely used due to its simplicity, compatibility with tools like Excel, and support in
programming languages like Python.

Why Use CSV Files?

• Simplicity: Easy to read, write, and share.

• Compatibility: Supported by databases, spreadsheets, and programming libraries.

• Lightweight: Smaller file size compared to Excel or JSON for tabular data.

• Structured Data: Ideal for datasets with rows and columns (e.g., sales records,
customer data).

Why is This Process Important?

• Loading: Imports data into Python for analysis or modeling.

• Cleaning: Ensures data quality by fixing errors, missing values, and inconsistencies.

• Describing: Provides insights into the data’s structure, distribution, and potential
issues, guiding further preprocessing.

1. Loading and Importing Data from CSV Files


What is Loading/Importing?

Loading or importing involves reading a CSV file into a Python environment, typically as a
DataFrame (a tabular data structure) for easy manipulation. Python’s pandas library is the
most popular tool for this task due to its flexibility and functionality.

Methods to Load CSV Files

1. Using pandas.read_csv():

o The primary method to load CSV files into a Pandas DataFrame.

o Key Parameters:

▪ filepath: Path to the CSV file (e.g., 'data.csv').

▪ nrows: Number of rows to read (e.g., nrows=100).

▪ chunksize: Read large files in chunks (e.g., chunksize=1000).

o Example:

import pandas as pd

df = pd.read_csv('data.csv', sep=',', encoding='utf-8', index_col='id')

o Use Case: Loading a customer dataset with an ID column as the index.

2. Using Python’s Built-in csv Module:

o Reads CSV files as lists or dictionaries without creating a DataFrame.

o Useful for lightweight or custom processing.

o Methods:

▪ csv.reader: Reads rows as lists.

▪ csv.DictReader: Reads rows as dictionaries with column names as


keys.

o Example:

import csv

with open('data.csv', 'r') as file:

reader = csv.DictReader(file)

data = [row for row in reader] # List of dictionaries


o Use Case: Reading small CSV files for simple processing.

Handling Common Issues During Import

• File Not Found: Ensure the file path is correct (use absolute paths or check the
working directory with os.getcwd()).

• Encoding Errors: Specify the correct encoding (e.g., 'latin1' for non-UTF-8 files).

• Delimiter Mismatch: Use the correct sep parameter (e.g., sep=';' for semicolon-
separated files).

• Large Files: Use chunksize to read in chunks or nrows to limit rows.

• Malformed CSV: Skip bad lines with error_bad_lines=False or on_bad_lines='skip'


in Pandas.

2. Cleaning Data Loaded from CSV Files

What is Data Cleaning?

Data cleaning involves fixing errors, inconsistencies, and missing values in the loaded
dataset to ensure it is accurate and usable. CSV files often contain issues like missing
data, duplicates, incorrect formats, or outliers that must be addressed.

Common Issues in CSV Data

• Missing values (e.g., empty cells, 'NA', or 'null').

• Duplicate rows or records.

• Inconsistent formats (e.g., mixed date formats, varying capitalization).

• Outliers or erroneous values (e.g., negative ages).

• Incorrect data types (e.g., numbers stored as strings).

Cleaning Techniques

1. Handling Missing Data:

o Identify Missing Values:

df.isnull().sum() # Count missing values per column

o Methods:
▪ Drop Missing Values: Remove rows or columns with missing data.

df.dropna() # Drop rows with any missing values

df.dropna(subset=['age'], inplace=True) # Drop rows where 'age' is missing

▪ Impute Missing Values: Replace missing values with a statistic or


rule.

df['age'].fillna(df['age'].mean(), inplace=True) # Impute with mean

df['city'].fillna('Unknown', inplace=True) # Impute with a constant

▪ Forward/Backward Fill: Use previous or next values (for time-series


data).

df['sales'].fillna(method='ffill', inplace=True) # Forward fill

2. Removing Duplicates:

o Identify Duplicates:

df.duplicated().sum() # Count duplicate rows

o Remove Duplicates:

df.drop_duplicates(inplace=True) # Remove all duplicate rows

df.drop_duplicates(subset=['customer_id'], keep='last', inplace=True) # Keep last


occurrence

3. Correcting Inconsistent Formats:

o Standardize Text:

df['name'] = df['name'].str.lower().str.strip() # Convert to lowercase, remove whitespace

o Standardize Dates:

df['date'] = pd.to_datetime(df['date'], errors='coerce') # Convert to datetime

o Fix Typos:

df['city'].replace({'New Yrok': 'New York', 'NY': 'New York'}, inplace=True)

3. Describing Data: Methods to Understand the Dataset

What is Data Description?


Describing data involves summarizing and exploring the dataset to understand its
structure, content, distribution, and potential issues. This step is critical for identifying
patterns, anomalies, or preprocessing needs.

Methods to Describe Data

1. Basic Information with df.info():

o Displays column names, data types, non-null counts, and memory usage.

o Example:

df.info()

o Use Case: Check for missing values and incorrect data types.

2. Summary Statistics with df.describe():

o Provides statistical measures for numerical columns (count, mean, std, min,
max, quartiles).

o Example:

df.describe()

3. Value Counts with df.value_counts():

o Counts unique values in a column, useful for categorical data.

o Example:

df['city'].value_counts()

4. Unique Values with df.nunique() and df.unique():

o df.nunique(): Counts unique values per column.

df.nunique()

5. Missing Values with df.isnull():

o Quantifies missing data per column.

df.isnull().sum()

6. Correlation Matrix with df.corr():

o Measures pairwise correlations between numerical columns (e.g., Pearson


correlation).
df.corr()

7. GroupBy Summaries with df.groupby():

o Aggregates data by categories to compute statistics.

df.groupby('city')['salary'].mean()

Accessing SQL Databases in Python

1⃣ Accessing SQL Database using sqlite3 (Built-in)

• sqlite3 is a built-in Python library to work with SQLite databases (no installation
needed).

• SQLite is a lightweight, disk-based database that doesn't require a separate


server.

• You use SQL commands (like SELECT, INSERT, UPDATE) through Python to interact
with the database.

Steps to Access using sqlite3

1. Import the sqlite3 library.

2. Connect to a database (creates file if it doesn't exist).

3. Create a Cursor object to execute SQL queries.

4. Execute SQL commands.

5. Commit changes (if modifying data).

6. Close the connection.

2⃣ Accessing SQL Database using SQLAlchemy

• SQLAlchemy is a powerful ORM (Object-Relational Mapper) for Python.

• It abstracts away SQL complexity by allowing you to interact with databases using
Python classes.
• Supports many database engines: SQLite, PostgreSQL, MySQL, etc.

• Safer, more scalable than direct SQL queries.

Steps to Access using SQLAlchemy

1. Install SQLAlchemy (pip install sqlalchemy).

2. Import necessary modules.

3. Create an Engine (database connection).

4. Connect and execute queries.

5. Use ORM models or raw SQL.

6. Close the connection (optional).

3⃣ Accessing SQL Database using pandas (read_sql_query, to_sql)

• pandas can directly query SQL databases.

• Useful for reading entire tables or custom queries into DataFrames.

• You can also write DataFrames back into SQL tables.

Steps to Access using pandas

1. Install pandas (pip install pandas).

2. Import pandas and sqlite3/SQLAlchemy.

3. Connect to the database.

4. Use read_sql_query to fetch data into a DataFrame.

5. Use to_sql to write DataFrame into database.

Database Operations in Python (10 Marks)


When working with databases in Python (like SQLite, MySQL, PostgreSQL), the basic
operations are:

Operation Meaning

connect Establish a connection to the database.

create Create new tables or structures.

insert Insert records/data into tables.

execute Run SQL queries (like create, insert, update, select, delete).

fetchall Retrieve all rows from a query result.

fetchone Retrieve one row at a time.

commit Save (commit) changes made to the database.

close Close the database connection properly.

1⃣ connect() — Connecting to the Database

• It is the first step.

• Using sqlite3.connect('database_name.db') creates or connects to an existing


database.

import sqlite3

conn = sqlite3.connect('my_database.db') # Connects to database

If the file my_database.db doesn’t exist, it will be created automatically.

import sqlite3

conn = sqlite3.connect('example.db') # Creates/opens a database file

2⃣ execute() — Executing SQL Queries

• After connecting, we execute SQL queries using a cursor.

• A cursor allows you to send SQL commands to the database.

cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, name
TEXT, age INTEGER)''')

The SQL inside execute() can be CREATE, INSERT, UPDATE, DELETE, SELECT, etc.

3⃣ create — Creating Tables

• Creating a table is done via an SQL CREATE TABLE statement.

• Make sure to use IF NOT EXISTS to avoid errors if the table already exists.

cursor.execute('''

CREATE TABLE IF NOT EXISTS employees (

id INTEGER PRIMARY KEY,

name TEXT,

department TEXT

''')

Tables define the structure of how your data will be stored.

cursor = conn.cursor()

cursor.execute('''

CREATE TABLE IF NOT EXISTS employees (

id INTEGER PRIMARY KEY,

name TEXT,

age INTEGER

''')

4⃣ insert — Inserting Data into Tables

• We use SQL INSERT INTO statements to add data into tables.


cursor.execute("INSERT INTO employees (name, department) VALUES (?, ?)", ("John Doe",
"Finance"))

Using ? and a tuple (("John Doe", "Finance")) protects against SQL Injection attacks
(safer practice).

cursor.execute('INSERT INTO employees (name, age) VALUES (?, ?)', ('Alice', 30))

conn.commit()

cursor.execute('SELECT * FROM employees')

rows = cursor.fetchall()

for row in rows:

print(row)

5⃣ commit() — Saving Changes

• After making changes (inserting, updating, deleting), you need to commit them.

• If you forget commit(), changes will not be saved permanently.

conn.commit()

commit() finalizes your changes in the database.

6⃣ fetchall() — Fetching All Records

• After a SELECT query, you can fetch all rows at once into a list.

cursor.execute("SELECT * FROM employees")

rows = cursor.fetchall()

for row in rows:

print(row)

fetchall() returns a list of tuples, each tuple being a row.


7⃣ fetchone() — Fetching One Record

• Instead of fetching all at once, you can fetch one row at a time.

cursor.execute("SELECT * FROM employees")

row = cursor.fetchone()

print(row)

Useful when you know the query returns only one or few rows.

8⃣ close() — Closing the Connection

• After finishing database operations, you must close the connection properly.

conn.close()

Prevents memory leaks and keeps your application efficient.

Cleaning Data in Python and Stripping Out Extraneous


Information
Data cleaning is a critical step in the data science pipeline, involving the identification and
correction of errors, inconsistencies, and irrelevant information in a dataset to ensure it is
accurate, consistent, and suitable for analysis or modeling.

Stripping out extraneous information is a subset of data cleaning, focusing on removing


unnecessary or redundant data that does not contribute to the analysis.

Python, with its powerful libraries like Pandas, NumPy, and others, offers numerous
methods to clean data and remove extraneous information.

What is Data Cleaning?

Data cleaning, also known as data cleansing or scrubbing, is the process of detecting and
correcting (or removing) errors, inconsistencies, and inaccuracies in a dataset. It
addresses issues such as missing values, duplicates, incorrect formats, outliers, and
irrelevant data to improve data quality. Clean data is essential for reliable analysis,
visualization, and machine learning model performance.
What is Stripping Out Extraneous Information?

Stripping out extraneous information involves removing unnecessary, redundant, or


irrelevant data from a dataset. This could include:

• Unneeded columns or rows.

• Redundant text (e.g., extra spaces, prefixes, or suffixes).

• Metadata or formatting artifacts (e.g., HTML tags, comments).

• Data unrelated to the analysis goal (e.g., irrelevant features).

Why is Data Cleaning Important?

• Improves Data Quality: Ensures accuracy and consistency.

• Enhances Analysis: Clean data leads to reliable insights and model performance.

• Reduces Errors: Minimizes biases or incorrect conclusions.

• Saves Resources: Removes unnecessary data to improve processing efficiency.

• Meets Algorithm Requirements: Many machine learning algorithms require clean,


numerical data.

Why Focus on Stripping Extraneous Information?

• Simplifies Datasets: Reduces complexity and improves interpretability.

• Optimizes Performance: Decreases memory usage and computation time.

• Focuses Analysis: Keeps only relevant data for the task at hand.

Common Data Issues Requiring Cleaning

Before diving into cleaning methods, it’s important to understand the typical issues found
in datasets:

1. Missing Values: Empty cells or placeholders (e.g., 'NA', 'null').

2. Duplicates: Identical or near-identical records.

3. Inconsistent Formats: Mixed date formats, varying capitalization, or units.

4. Outliers: Extreme values that skew analysis.

5. Incorrect Data Types: Numbers stored as strings or vice versa.


6. Extraneous Information: Irrelevant columns, redundant text, or formatting
artifacts.

7. Typos and Errors: Misspellings or invalid entries (e.g., 'New Yrok' instead of 'New
York').

8. Noise: Random or irrelevant data points.

Handling Missing Values in Python


Missing data is a common problem in real-world datasets. Before applying any data
analysis or machine learning models, it’s essential to understand and handle missing
values properly. Poor handling can lead to inaccurate models or misleading results.

Python, especially with libraries like pandas, numpy, and scikit-learn, provides many
tools to identify, visualize, and handle missing data.

1. Identifying Missing Values

First, we need to detect missing values.

In pandas:

• Missing values are usually represented as NaN (numpy.nan) or None.

Common methods:

import pandas as pd

# Create a simple DataFrame

df = pd.DataFrame({

'Name': ['Alice', 'Bob', None, 'Dave'],

'Age': [25, None, 22, 23],

'City': ['NY', 'LA', 'NY', None]

})

# Check for missing values


print(df.isnull())

# Check the number of missing values in each column

print(df.isnull().sum())

Other functions:

• df.info() → shows non-null counts.

• df.describe() → can reveal suspicious counts if some data is missing.

2. Ways to Handle Missing Values

There are multiple strategies depending on the problem and dataset size:

A. Removing Data

1. Remove rows with missing values

If the dataset is large and only a few rows have missing data, dropping them may be
acceptable.

# Drop rows with any missing value

df.dropna(inplace=True)

2. Remove columns with missing values

If a column has too many missing values (e.g., >70%), you might drop the column.

# Drop columns with missing data

df.dropna(axis=1, inplace=True)

Caution: Dropping data may lead to loss of important information.

B. Imputing Data (Filling Missing Values)

Instead of removing, we can fill missing values with plausible data.

1. Fill with a fixed value

# Fill NaN with a specified value


df.fillna('Unknown', inplace=True)

2. Fill with statistical values

• Mean, Median, or Mode (most frequent value) are commonly used for numeric
data.

# Fill missing 'Age' with the mean age

df['Age'].fillna(df['Age'].mean(), inplace=True)

# Fill missing 'City' with mode

df['City'].fillna(df['City'].mode()[0], inplace=True)

3. Forward Fill (ffill)

• Propagate the previous value forward.

# Forward fill

df.fillna(method='ffill', inplace=True)

4. Backward Fill (bfill)

• Use the next value to fill the missing one.

# Backward fill

df.fillna(method='bfill', inplace=True)

Removing Duplicates

• Duplicate rows are identical copies of data.

• Duplicates can affect analysis, cause bias, or inflate numbers.

• Use drop_duplicates() to remove them.

df.drop_duplicates(inplace=True)

• To remove duplicates based only on specific columns:

df.drop_duplicates(subset=['column1', 'column2'], inplace=True)

• Keep first occurrence:

df.drop_duplicates(keep='first', inplace=True)
Correcting Data Types

• Correct data types are crucial: e.g., date columns should be datetime, numbers
should be int/float.

• Use .astype() to convert types.

• Wrong types cause errors during analysis or modeling.

df['age'] = df['age'].astype(int)

df['date'] = pd.to_datetime(df['date'])

• Convert multiple columns:

df = df.astype({'age': 'float64', 'salary': 'int64'})

Handling Outliers

• Outliers are extreme values that differ greatly from most data points.

• Use statistical techniques to detect them (IQR, Z-Score).

• Remove or treat outliers to make models perform better.

Using IQR (Interquartile Range):

Q1 = df['column'].quantile(0.25)

Q3 = df['column'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

df = df[(df['column'] > lower_bound) & (df['column'] < upper_bound)]

Standardizing and Normalizing Data

• Standardization: Center around mean (mean = 0, std = 1).

• Normalization: Rescale between 0 and 1.


• Useful for ML models sensitive to scale.

Standardization using StandardScaler:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df['normalized'] = scaler.fit_transform(df[['column']])

Normalization using MinMaxScaler:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['normalized'] = scaler.fit_transform(df[['column']])

Fixing Formatting Issues

• Formatting problems include inconsistent dates, extra spaces, wrong text case.

• Fix them to avoid hidden issues in analysis.

# Remove extra whitespace

df['column'] = df['column'].str.strip()

# Standardize date format

df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

• Text standardization:

df['name'] = df['name'].str.lower()

Encoding Categorical Variables

• ML algorithms can't work with text. Categories must be converted into numbers.

• Techniques: Label Encoding, One-Hot Encoding.

df = pd.get_dummies(df, columns=['category'])

Label Encoding:

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()

df['category_encoded'] = le.fit_transform(df['category'])

Saving Cleaned Data

• After cleaning, save the dataset for later use without losing progress.

• Formats: CSV, Excel, SQL Database, etc.

df.to_csv('cleaned_data.csv', index=False)

df.to_excel('cleaned_data.xlsx', index=False)

Stripping Out Extraneous Information

Step 1: Removing Unnecessary Columns

df.drop(columns=['unnecessary_col1', 'unnecessary_col2'], inplace=True)

Step 2: Removing Extra Whitespace

df['column_name'] = df['column_name'].str.strip()

Step 3: Removing Duplicates

Already explained above with drop_duplicates().

Step 4: Filtering Irrelevant Data

df = df[df['column_name'] != 'irrelevant_value']

Can also filter based on multiple conditions.

df = df[(df['age'] > 18) & (df['status'] == 'active')]


Step 5: Fixing Inconsistent Data

df['column_name'] = df['column_name'].str.lower() # or str.upper()

Step 6: Removing Special Characters

df['column_name'] = df['column_name'].str.replace('[^\w\s]', '', regex=True)

Use regex=True for powerful text cleaning.

Normalizing and Formatting Data in Python


Normalizing and formatting data are essential steps in data preprocessing, ensuring that
data is structured, consistent, and suitable for analysis or machine learning.

Normalization involves scaling numerical data to a standard range or distribution to


ensure fair comparisons across features, while formatting focuses on standardizing data
representations (e.g., text, dates, or categorical data) to improve consistency and usability.

What is Data Normalization?

Normalization is the process of transforming numerical data to a common scale, typically


[0, 1] or another standard range, to eliminate the influence of differing units or magnitudes.
It ensures that features with larger ranges (e.g., salary in thousands vs. age in years)
contribute equally to analysis or machine learning models.

Why Normalize Data?

• Equal Feature Contribution: Prevents features with larger scales from dominating
models (e.g., in distance-based algorithms like KNN or SVM).

• Improves Convergence: Speeds up optimization in gradient-based algorithms (e.g.,


neural networks, linear regression).

• Enhances Comparability: Allows fair comparisons across variables.

• Meets Algorithm Requirements: Some algorithms assume normalized data for


optimal performance.

Types of Normalization

Normalization techniques adjust data based on its distribution or range. The main methods
are described below.
What is Data Formatting?

Formatting involves standardizing the representation of data to ensure consistency and


compatibility with analysis or modeling tasks. It addresses issues like inconsistent text
cases, date formats, or categorical encodings, making data easier to process and interpret.

Why Format Data?

• Consistency: Ensures uniform data representation (e.g., all dates in YYYY-MM-DD).

• Compatibility: Prepares data for algorithms that require specific formats (e.g.,
numerical inputs).

• Readability: Improves human interpretability for reporting or visualization.

• Error Prevention: Reduces errors caused by inconsistent or ambiguous data.

Types of Formatting

Formatting applies to numerical, categorical, text, and temporal data, with techniques
tailored to each type.

Python Libraries for Normalizing and Formatting Data

Python offers several libraries for these tasks:

• Pandas: For data manipulation, formatting, and basic normalization.

• NumPy: For numerical operations and advanced normalization.

• Scikit-learn: For machine learning-friendly normalization and encoding.

• re (Regular Expressions): For text formatting.

• datetime: For date and time formatting.

• String: For basic text manipulation.

Methods for Normalizing Data

Normalization techniques scale numerical data to a common range or distribution. Below


are the key methods, with theory, use cases, and concise code examples.
1. Min-Max Normalization

• Scales data to a fixed range, typically [0, 1].

• Formula: Xnorm=X−XminXmax−XminXnorm=Xmax−XminX−Xmin

• Preserves the original data distribution but is sensitive to outliers.

import pandas as pd

df = pd.read_csv('data.csv')

df['salary_norm'] = (df['salary'] - df['salary'].min()) / (df['salary'].max() - df['salary'].min())

Scikit-learn Method:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['salary_norm'] = scaler.fit_transform(df[['salary']])

2. Z-Score Normalization (Standardization)

• Centers data around the mean (μμ) with a standard deviation (σσ) of 1.

• Formula: Xnorm=X−μσXnorm=σX−μ

• Assumes data follows a normal distribution but is less affected by outliers than min-
max.

df['age_norm'] = (df['age'] - df['age'].mean()) / df['age'].std()

Scikit-learn Method:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df['age_norm'] = scaler.fit_transform(df[['age']])

3. Robust Scaling

• Scales data using the median and interquartile range (IQR) instead of mean and
standard deviation.

• Formula: Xnorm=X−medianIQRXnorm=IQRX−median, where IQR = Q3 - Q1.


• Highly robust to outliers.

Q1 = df['salary'].quantile(0.25)

Q3 = df['salary'].quantile(0.75)

IQR = Q3 - Q1

df['salary_norm'] = (df['salary'] - df['salary'].median()) / IQR

Scikit-learn Method:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

df['salary_norm'] = scaler.fit_transform(df[['salary']])

Methods for Formatting Data

Formatting ensures data is consistently represented across the dataset. Below are the key
methods, with theory, use cases, and concise code examples.

1. Formatting Text Data

• Standardizes text to ensure consistency in case, spacing, or content.

• Common tasks include converting case, removing whitespace, and cleaning special
characters.

Methods:

• Convert Case:

df['name'] = df['name'].str.lower() # Lowercase

df['name'] = df['name'].str.title() # Title case

• Remove Whitespace:

df['name'] = df['name'].str.strip() # Remove leading/trailing spaces

• Replace Substrings:

df['city'] = df['city'].str.replace('NY', 'New York') # Standardize names

2. Formatting Dates and Times


• Converts date and time data to a consistent format (e.g., YYYY-MM-DD).

• Ensures compatibility with time-series analysis or database storage.

Methods:

• Convert to Datetime:

df['date'] = pd.to_datetime(df['date'], errors='coerce') # Handle invalid dates

• Standardize Format:

df['date'] = df['date'].dt.strftime('%Y-%m-%d') # Format as YYYY-MM-DD

• Extract Components:

df['year'] = df['date'].dt.year

df['month'] = df['date'].dt.month

3. Formatting Numerical Data

• Ensures numerical data is in the correct format (e.g., float, integer) and precision.

• Handles issues like strings stored as numbers or excessive decimal places.

Methods:

• Convert to Numeric:

df['price'] = pd.to_numeric(df['price'], errors='coerce') # Convert to float, NaN for invalid

• Round Numbers:

df['price'] = df['price'].round(2) # Round to 2 decimals

• Convert to Integer:

df['age'] = df['age'].astype(int) # Convert to integer

4. Formatting Categorical Data

• Standardizes categorical data by encoding or cleaning categories for analysis or


modeling.

• Ensures categories are consistent and machine-readable.


Methods:

• Standardize Categories:

df['department'] = df['department'].str.lower().str.strip()

df['department'] = df['department'].replace({'hr': 'Human Resources'})

• Label Encoding:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['department_code'] = le.fit_transform(df['department'])

• One-Hot Encoding:

df = pd.get_dummies(df, columns=['city'], prefix='city') # Create binary columns

• Handle Rare Categories:

threshold = 10

counts = df['category'].value_counts()

df['category'] = df['category'].apply(lambda x: x if counts[x] >= threshold else 'Other')

Reading and Writing Data in Text Format

1. Introduction to File Handling in Python

File handling allows a program to persist data — saving outputs, reading inputs, managing
configurations, or exchanging information between systems.
In Python, files are handled with built-in functions and modules.

Key concepts:

• Opening a file.

• Reading from a file.

• Writing to a file.

• Closing the file after use.


2. Opening a File

Use the built-in open() function:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None,


closefd=True, opener=None)

• file: path to the file.

• mode: defines the operation — read, write, append, etc.

• encoding: specifies character encoding (like 'utf-8').

Example:

f = open('example.txt', 'r') # Open for reading

Important: Always close the file after operations using .close() OR use with open(...) as ...:
to auto-close.

3. Modes of Opening Files

Mode Meaning

'r' Read-only mode (default). File must exist.

'w' Write mode. Creates or overwrites the file.

'a' Append mode. Adds data to the end of file.

'r+' Read and write mode. File must exist.

't' Text mode (default).

'b' Binary mode (e.g., images, executable files).

Modes can be combined: e.g., 'rb' (read binary).

4. Reading from a Text File

Once a file is opened for reading ('r' mode):

Methods:
• read(): Reads the whole file as one string.

• readline(): Reads one line at a time.

• readlines(): Reads all lines into a list.

Example:

with open('example.txt', 'r') as file:

content = file.read()

print(content)

Looping through lines:

with open('example.txt', 'r') as file:

for line in file:

print(line.strip())

Tip: strip() removes \n (newline) characters at the end.

5. Writing to a Text File

Writing happens in 'w' or 'a' mode:

• 'w' will overwrite existing content.

• 'a' will append content to existing content.

Writing a single string:

with open('example.txt', 'w') as file:

file.write('Hello World\n')

Writing multiple lines:

lines = ['Line 1\n', 'Line 2\n']

with open('example.txt', 'w') as file:

file.writelines(lines)

Note: You need to add \n manually after each line!


8. Advanced: File Positioning

• file.tell(): Returns the current file cursor position.

• file.seek(offset, whence): Changes the file cursor position.

Example:

file.seek(0) # Move cursor to the start of the file

whence can be:

• 0 – beginning of file (default)

• 1 – current position

• 2 – end of file

Useful when doing partial reads or rewinds.

9. Reading and Writing Structured Text (CSV, JSON)

Many times, data is not just plain text but structured in formats like CSV or JSON.

Reading/Writing CSV files:

import csv

# Writing

with open('people.csv', 'w', newline='') as file:

writer = csv.writer(file)

writer.writerow(['Name', 'Age'])

writer.writerow(['Alice', 25])

# Reading

with open('people.csv', 'r') as file:

reader = csv.reader(file)

for row in reader:

print(row)

Reading/Writing JSON files:


import json

# Writing JSON

with open('data.json', 'w') as f:

json.dump({'name': 'Alice', 'age': 25}, f)

# Reading JSON

with open('data.json', 'r') as f:

data = json.load(f)

Combining and Merging Data in Python


Combining and merging data are critical processes in data preprocessing that involve
integrating multiple datasets into a unified structure for analysis or modeling.

These techniques are essential when data is sourced from different files, databases, or
systems, and need to be consolidated to provide a comprehensive view. Python,
particularly with the Pandas library, offers powerful tools for combining and merging data
efficiently.

What is Combining and Merging Data?

Combining data refers to stacking or concatenating datasets vertically (row-wise) or


horizontally (column-wise) to create a single dataset. It is typically used when datasets
have similar structures (e.g., same columns) or complementary information (e.g.,
additional columns).

Merging data involves joining datasets based on common keys or indices, similar to SQL
joins, to integrate related information from different sources. It is used when datasets
share a common column or index (e.g., customer ID) but contain different attributes.

Why Combine and Merge Data?

• Unified Analysis: Integrates data from multiple sources for comprehensive insights.

• Enriched Datasets: Combines complementary information (e.g., customer


demographics with purchase history).
• Data Consolidation: Simplifies workflows by creating a single dataset from
fragmented sources.

• Improved Model Performance: Provides more features or records for machine


learning.

• Real-World Relevance: Most real-world data is stored in multiple tables or files,


requiring integration.

Key Concepts in Combining and Merging Data

1. Combining (Concatenation):

o Vertical Concatenation: Stacks datasets row-wise (appending rows).

o Horizontal Concatenation: Stacks datasets column-wise (adding columns).

o Requires datasets to have compatible structures (e.g., same columns for


vertical, same rows for horizontal).

2. Merging (Joining):

o Combines datasets based on a common key (e.g., customer_id) or index.

o Types of merges (similar to SQL joins):

▪ Inner: Keeps only matching records.

▪ Left: Keeps all records from the left dataset, with NaNs for non-
matching right records.

▪ Right: Keeps all records from the right dataset, with NaNs for non-
matching left records.

▪ Outer (Full): Keeps all records, with NaNs where there’s no match.

Methods for Combining Data

Combining data involves stacking datasets vertically or horizontally using concatenation.


Below are the key methods, with theory, use cases, and concise code examples.

1. Vertical Concatenation (Appending Rows)

• Stacks datasets row-wise, adding rows from one dataset to another.

• Requires datasets to have the same columns (or a subset) for alignment.
• Used when datasets represent different records of the same structure (e.g., sales
data from different months).

Code Example:

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})

df2 = pd.DataFrame({'id': [3, 4], 'name': ['Charlie', 'David']})

df_combined = pd.concat([df1, df2], axis=0, ignore_index=True)

Key Parameters (pd.concat):

• axis=0: Concatenate vertically (rows).

• ignore_index=True: Reset index to avoid duplicate indices.

• join='outer': Include all columns (default); use 'inner' for common columns.

• keys: Add a hierarchical index to distinguish sources (e.g., keys=['df1', 'df2']).

2. Horizontal Concatenation (Appending Columns)

• Stacks datasets column-wise, adding columns from one dataset to another.

• Requires datasets to have the same number of rows or aligned indices.

• Used when datasets contain different attributes for the same records (e.g.,
customer demographics and purchase amounts).

Code Example:

df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})

df2 = pd.DataFrame({'age': [25, 30], 'city': ['New York', 'Chicago']})

df_combined = pd.concat([df1, df2], axis=1)

Key Parameters (pd.concat):

• axis=1: Concatenate horizontally (columns).

• join='outer': Include all rows; use 'inner' for matching rows.

• ignore_index=False: Retain column names (default).


Methods for Merging Data

Merging data involves joining datasets based on a common key or index, similar to SQL
joins. Below are the key methods, with theory, use cases, and concise code examples.

1. Inner Merge

• Keeps only records with matching keys in both datasets.

• Equivalent to SQL INNER JOIN.

• Used when only complete matches are needed (e.g., customers with orders).

Code Example:

df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [25, 30, 35]})

df_merged = pd.merge(df1, df2, on='id', how='inner')

Use Case:

• Matching customer profiles with their purchase history.

Pros:

• Produces a clean dataset with no missing values.

• Efficient for focused analysis.

Cons:

• Excludes non-matching records, reducing dataset size.

2. Left Merge

• Keeps all records from the left dataset, with NaNs for non-matching records from
the right dataset.

• Equivalent to SQL LEFT OUTER JOIN.

• Used when the left dataset is the primary source (e.g., all customers, even without
orders).

Code Example:

df_merged = pd.merge(df1, df2, on='id', how='left')


Use Case:

• Including all customer records, with optional order data.

Pros:

• Preserves all left dataset records.

• Useful for comprehensive datasets.

Cons:

• Introduces NaNs for non-matches.

3. Right Merge

• Keeps all records from the right dataset, with NaNs for non-matching records from
the left dataset.

• Equivalent to SQL RIGHT OUTER JOIN.

• Used when the right dataset is the priority (e.g., all orders, even for unknown
customers).

Code Example:

df_merged = pd.merge(df1, df2, on='id', how='right')

Use Case:

• Ensuring all order records are included, even for missing customers.

Pros:

• Preserves all right dataset records.

• Useful for right-centric analysis.

Cons:

• Introduces NaNs for non-matches.

4. Outer (Full) Merge

• Keeps all records from both datasets, with NaNs where there’s no match.

• Equivalent to SQL FULL OUTER JOIN.


• Used when all data is needed, regardless of matches (e.g., complete customer and
order data).

Code Example:

df_merged = pd.merge(df1, df2, on='id', how='outer')

Use Case:

• Combining all available data for exploratory analysis.

Pros:

• Retains all information.

• Comprehensive for data integration.

Cons:

• Introduces many NaNs, requiring further cleaning.

5. Merging on Index

• Joins datasets based on their index instead of a column.

• Useful when datasets are indexed by a common identifier (e.g., time series data).

Code Example:

df1 = pd.DataFrame({'name': ['Alice', 'Bob']}, index=[1, 2])

df2 = pd.DataFrame({'age': [25, 30]}, index=[1, 2])

df_merged = df1.merge(df2, left_index=True, right_index=True)

6. Merging on Multiple Keys

• Joins datasets using multiple columns as keys for precise matching.

• Used when a single key is insufficient (e.g., matching by customer_id and date).

Code Example:

df1 = pd.DataFrame({'id': [1, 1], 'date': ['2023-01-01', '2023-01-02'], 'name': ['Alice', 'Alice']})

df2 = pd.DataFrame({'id': [1, 1], 'date': ['2023-01-01', '2023-01-03'], 'age': [25, 26]})
df_merged = pd.merge(df1, df2, on=['id', 'date'], how='inner')

Additional Methods for Combining and Merging

1. Joining DataFrames

• A Pandas-specific method to merge DataFrames based on indices or columns.

• Similar to merge but optimized for index-based joins.

Code Example:

df1 = pd.DataFrame({'name': ['Alice', 'Bob']}, index=[1, 2])

df2 = pd.DataFrame({'age': [25, 30]}, index=[1, 2])

df_joined = df1.join(df2)

2. Appending DataFrames (Deprecated)

• A legacy method to append rows (similar to vertical concatenation).

• Replaced by pd.concat in modern Pandas versions.

Code Example:

df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})

df2 = pd.DataFrame({'id': [3, 4], 'name': ['Charlie', 'David']})

df_appended = df1.append(df2, ignore_index=True) # Deprecated

# Use pd.concat instead:

df_appended = pd.concat([df1, df2], ignore_index=True)

Reshaping and Pivoting Data in Python


Reshaping and pivoting are essential data manipulation techniques in Python used to
reorganize datasets to make them more suitable for analysis, visualization, or modeling.

These processes transform the structure of data, such as changing its layout from wide to
long format or creating summary tables.
Python’s Pandas library provides powerful tools like melt, pivot, pivot_table, and others to
perform these tasks efficiently.

What is Reshaping and Pivoting?

Reshaping involves changing the structure or layout of a dataset without altering its
content. It typically transforms data between wide format (where each variable has its own
column) and long format (where variables are stacked into a single column with
corresponding values).

Pivoting is a specific type of reshaping that creates a summary table by spreading data
across rows and columns based on key variables. It is often used to aggregate data or
create cross-tabulations for easier interpretation.

Why Reshape and Pivot Data?

• Improved Analysis: Reorganizes data to match the requirements of statistical


methods or visualizations.

• Flexibility: Converts data between formats (e.g., long for time-series, wide for
summaries).

• Simplified Summaries: Pivoting creates concise tables for reporting or


dashboards.

• Data Compatibility: Prepares data for tools or algorithms that expect specific
structures.

• Real-World Relevance: Many datasets require reshaping to align with analytical


goals (e.g., converting survey data for analysis).

Key Concepts in Reshaping and Pivoting

1. Wide vs. Long Format:

o Wide Format: Each variable is a column, and each row represents a record
(e.g., columns for sales_2021, sales_2022).

o Long Format: Variables are stacked into a single column, with another
column indicating the variable name (e.g., columns for year and sales).

Methods for Reshaping and Pivoting Data


Below are the key methods for reshaping and pivoting data, with theory, use cases, and
concise code examples.

1. Melting (Wide to Long Format)

• Converts a wide-format dataset to a long format by stacking multiple columns into a


single column.

• Used when data needs to be reformatted for time-series analysis, visualization, or


statistical modeling.

• Function: pd.melt.

Code Example:

import pandas as pd

df = pd.DataFrame({'id': [1, 2], 'sales_2021': [100, 150], 'sales_2022': [120, 180]})

df_melted = pd.melt(df, id_vars=['id'], value_vars=['sales_2021', 'sales_2022'],

var_name='year', value_name='sales')

Key Parameters (pd.melt):

• id_vars: Columns to keep as identifiers (not melted).

• value_vars: Columns to melt into a single column (default: all non-id_vars).

• var_name: Name of the column storing variable names (e.g., year).

• value_name: Name of the column storing values (e.g., sales).

2. Pivoting (Long to Wide Format)

• Converts a long-format dataset to a wide format by spreading a column’s values


across multiple columns.

• Creates a table with specified rows, columns, and values.

• Function: pd.pivot.

• Assumes no duplicate entries for the index-column combination (otherwise, use


pivot_table).

Code Example:
df = pd.DataFrame({'id': [1, 1, 2], 'year': [2021, 2022, 2021], 'sales': [100, 120, 150]})

df_pivoted = pd.pivot(df, index='id', columns='year', values='sales')

Key Parameters (pd.pivot):

• index: Column to use as rows.

• columns: Column to spread into new columns.

• values: Column containing values to populate the table.

3. Pivot Table (Aggregated Pivoting)

• Similar to pivot, but allows aggregation of duplicate entries using a function (e.g.,
mean, sum).

• Function: pd.pivot_table.

• Used for summarizing data or handling datasets with multiple values for the same
index-column pair.

Code Example:

df = pd.DataFrame({'id': [1, 1, 2], 'year': [2021, 2021, 2022], 'sales': [100, 110, 150]})

df_pivot_table = pd.pivot_table(df, index='id', columns='year', values='sales',


aggfunc='mean')

Key Parameters (pd.pivot_table):

• index: Column for rows.

• columns: Column for new columns.

• values: Column to aggregate.

• aggfunc: Aggregation function (e.g., 'mean', 'sum', 'count').

• fill_value: Replace NaNs with a value (e.g., fill_value=0).

4. Stacking and Unstacking

• Stacking: Moves column labels into a row index, converting wide data to a long
format (multi-level index).
• Unstacking: Moves row index levels to columns, converting long data to a wide
format.

• Functions: stack, unstack.

• Used for hierarchical indices or multi-level data.

Code Example (Stack):

df = pd.DataFrame({'id': [1, 2], 'sales_2021': [100, 150], 'sales_2022': [120, 180]})

df_stacked = df.set_index('id').stack().reset_index(name='sales')

Code Example (Unstack):

df = pd.DataFrame({'id': [1, 1, 2], 'year': [2021, 2022, 2021], 'sales': [100, 120, 150]})

df_unstacked = df.set_index(['id', 'year'])['sales'].unstack()

Data Transformation in Python


Data transformation is a critical step in data preprocessing that involves modifying,
restructuring, or reformatting a dataset to make it more suitable for analysis, visualization,
or machine learning.

It encompasses a wide range of techniques, such as scaling, encoding, aggregating,


reshaping, and deriving new features, to ensure data meets the requirements of the
intended task.

Python, particularly with the Pandas, NumPy, and Scikit-learn libraries, provides powerful
tools for data transformation.

What is Data Transformation?

Data transformation refers to the process of converting raw or unprocessed data into a
format that is more appropriate for analysis, modeling, or reporting.

It involves applying mathematical, statistical, or logical operations to change the structure,


scale, or representation of data. The goal is to enhance data quality, compatibility, and
interpretability.

Why Transform Data?

• Improves Analysis: Transformed data is easier to analyze or visualize (e.g.,


normalized values for comparisons).
• Meets Algorithm Requirements: Many machine learning algorithms require
specific data formats (e.g., numerical inputs, scaled features).

• Enhances Interpretability: Derived features or aggregated data provide clearer


insights.

• Handles Inconsistencies: Standardizes data to resolve issues like mixed formats or


scales.

• Optimizes Performance: Reduces computational complexity or improves model


accuracy.

String Manipulation in Python

Introduction to Strings

• A string is a sequence of Unicode characters.

• Strings are immutable — once created, they cannot be changed.

• Python provides many methods for creating, transforming, and analyzing strings.

• String manipulation is essential in data cleaning, processing user inputs, text


analysis, etc.

Categories of String Manipulation in Python


String Manipulation can be grouped into different functional categories based on the
kind of operation performed.

1. Accessing Characters and Substrings

Operations where you retrieve part(s) of a string.

• Indexing: Getting a specific character.

• Slicing: Getting a substring (a section of the string).

• Negative indexing: Access from the end.


• 2. Modifying String Content

Operations that change how the string looks without changing the original.

• Changing case: .lower(), .upper(), .title(), .capitalize(), .swapcase()

• Replacing substrings: .replace(old, new)

• Trimming whitespace: .strip(), .lstrip(), .rstrip()

• Padding strings: .zfill(width)

Example:

s = " hello "

print(s.strip()) # 'hello'

print(s.upper()) # ' HELLO '

3. Combining or Splitting Strings

Operations that join or break strings apart.

• Concatenation: Using +

• Repetition: Using *

• Splitting: .split(separator), .splitlines()

• Joining: .join(list)

Example:

s = "apple,banana,grape"

print(s.split(',')) # ['apple', 'banana', 'grape']

print("-".join(["a", "b", "c"])) # 'a-b-c'

4. Searching and Analyzing Strings

Finding parts of the string or checking properties.

• Searching for substrings: .find(), .rfind(), .index()

• Counting occurrences: .count(substring)

• Checking start/end: .startswith(), .endswith()


• Property checks: .isalpha(), .isdigit(), .isalnum(), .isspace(), etc.

Example:

s = "hello world"

print(s.find("world")) # 6

print(s.isalpha()) # False

5. Formatting Strings

Inserting variables or formatting numbers inside a string.

• Old style: "Hello %s" % name

• str.format() method: "Hello {}".format(name)

• f-strings (modern Python): f"Hello {name}"

Example:

name = "Alice"

print(f"Hello {name}") # 'Hello Alice'

6. Regular Expressions (Advanced String Matching)

Powerful pattern-based searching and transformations.

• Matching (re.match, re.search)

• Finding all matches (re.findall)

• Replacing (re.sub)

• Splitting (re.split)

Example:

import re

print(re.findall(r'\d+', 'abc123xyz456')) # ['123', '456']

Basic String Operations


1. String Creation
s = "Hello World"

print(s)

Strings can be single or multi-line.

2. String Indexing and Slicing

• Strings are indexed (first character at index 0).

• Slicing extracts parts of a string.

• Syntax: string[start:end:step].

• Negative indexing starts from the end.

• Slicing does not modify the original string.

Example:

s = "Python"

print(s[0]) # Output: P

print(s[1:4]) # Output: yth

print(s[-1]) # Output: n

3. String Concatenation and Repetition

• + operator joins two strings.

• * operator repeats the string multiple times.

• Efficient for formatting dynamic content.

• Must ensure both are strings (use str() if needed).

• Works very fast internally as immutable operations.

Example:

s1 = "Hello"

s2 = "World"

print(s1 + " " + s2) # Hello World

print(s1 * 3) # HelloHelloHello

4. String Methods for Changing Case


• lower() → converts to lowercase.

• upper() → converts to uppercase.

• capitalize() → first letter capital.

• title() → first letter of each word capital.

• swapcase() → switches case (lower ↔ upper).

Example:

s = "hello WORLD"

print(s.lower()) # hello world

print(s.title()) # Hello World

5. Searching for Substrings

• find() → returns first index of substring or -1.

• rfind() → last index from right.

• index() → like find() but raises error if not found.

• Helps in text searching and parsing.

• Very fast and efficient in huge texts.

Example:

s = "hello world"

print(s.find('world')) # Output: 6

6. String Replacement

• replace(old, new) replaces substrings.

• Returns a new string.

• Can replace multiple instances at once.

• Useful for cleaning unwanted characters.

• Supports chain replacements.


Example:

s = "bad dog"

print(s.replace('bad', 'good')) # good dog

7. Splitting and Joining Strings

• split(separator) → splits string into list.

• join(iterable) → joins list into string.

• Useful in parsing CSVs, logs, texts.

• splitlines() for line breaks.

• Powerful for preprocessing.

Example:

s = "apple,banana,grape"

fruits = s.split(",")

print(fruits) # ['apple', 'banana', 'grape']

print(",".join(fruits)) # apple,banana,grape

8. Checking String Properties

• isalpha() → all letters.

• isdigit() → all numbers.

• isalnum() → letters and numbers.

• isspace() → only whitespace.

• Good for data validation.

Example:

s = "Python3"

print(s.isalnum()) # True
9. String Formatting

• .format() method.

• f-strings (Python 3.6+).

• % old-style formatting.

• Powerful for inserting dynamic data.

• Readable, clean, and recommended.

Example:

name = "Alice"

print(f"Hello {name}") # Hello Alice

10. Trimming Strings

• strip() removes leading/trailing spaces.

• lstrip(), rstrip() remove from left or right.

• Important for cleaning messy input data.

• Returns a new string.

• Custom characters can also be trimmed.

Example:

s = " Hello World "

print(s.strip()) # Hello World

Regular Expressions in Python

You might also like