0% found this document useful (0 votes)
7 views

EXP 5 DE lab

The document provides a comprehensive guide on cleaning, transforming, and enriching data using Python's pandas library. It includes steps for exploratory data analysis, handling missing values, removing duplicates, standardizing text data, and managing outliers. The document features code snippets and outputs demonstrating these processes on a sample dataset.

Uploaded by

aimlbtech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

EXP 5 DE lab

The document provides a comprehensive guide on cleaning, transforming, and enriching data using Python's pandas library. It includes steps for exploratory data analysis, handling missing values, removing duplicates, standardizing text data, and managing outliers. The document features code snippets and outputs demonstrating these processes on a sample dataset.

Uploaded by

aimlbtech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

keyboard_arrow_down *5.

Cleaning, Transforming and Enriching Data *


*a. Performing exploratory data analysis in Python *

import pandas as pd

# Sample dataset
data = {
"ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Name": ["John Doe", "Jane Smith", "Bob Johnson", "Emily Davis", "Chris Lee",
"Anna Brown", "David Wilson", "", "Jessica White", "Michael Green"],
"Product": ["Phone", "Laptop", "Tablet", "Phone", "Laptop", "Tablet", "Phone", "Tablet", "Laptop", "Phone"],
"Sales": [200.5, 1500, 400, None, 1200, None, 300, 700, 1700, 250],
"Date": ["2024-01-01", None, "2024-01-03", "2024-01-04", "2024-01-05",
"2024-01-06", "2024-01-07", "2024-01-08", "2024-01-09", None],
"Region": ["North", "East", "South", "West", "North", "East", "South", "West", "North", "South"],
"Discount": [0.1, 0.2, 0.15, 0, None, 0.1, 0.05, 0.2, 0.1, 0.1],
}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

ID Name Product Sales Date Region Discount


0 1 John Doe Phone 200.5 2024-01-01 North 0.10
1 2 Jane Smith Laptop 1500.0 None East 0.20
2 3 Bob Johnson Tablet 400.0 2024-01-03 South 0.15
3 4 Emily Davis Phone NaN 2024-01-04 West 0.00
4 5 Chris Lee Laptop 1200.0 2024-01-05 North NaN
5 6 Anna Brown Tablet NaN 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0 2024-01-07 South 0.05
7 8 Tablet 700.0 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0 2024-01-09 North 0.10
9 10 Michael Green Phone 250.0 None South 0.10

keyboard_arrow_down 1. Check the structure of the dataset:


print("Dataset Structure and First Rows:")
print(df.head())

Dataset Structure and First Rows:


ID Name Product Sales Date Region Discount
0 1 John Doe Phone 200.5 2024-01-01 North 0.10
1 2 Jane Smith Laptop 1500.0 None East 0.20
2 3 Bob Johnson Tablet 400.0 2024-01-03 South 0.15
3 4 Emily Davis Phone NaN 2024-01-04 West 0.00
4 5 Chris Lee Laptop 1200.0 2024-01-05 North NaN

keyboard_arrow_down 2.Check for missing values:


# Checking for missing values in each column
df1 = df.isnull().sum()
print("\nMissing Values in Each Column:")
print(df1)

Missing Values in Each Column:


ID 0
Name 0
Product 0
Sales 2
Date 2
Region 0
Discount 1
dtype: int64

keyboard_arrow_down 3. Summary statistics:


df2 = df.describe() #This provides a statistical summary of the numeric columns such as mean, std, min, max, and percentiles.
print("\nSummary Statistics:")
print(df2)

Summary Statistics:
ID Sales Discount
count 10.00000 8.000000 9.000000
mean 5.50000 781.312500 0.111111
std 3.02765 602.270284 0.065085
min 1.00000 200.500000 0.000000
25% 3.25000 287.500000 0.100000
50% 5.50000 550.000000 0.100000
75% 7.75000 1275.000000 0.150000
max 10.00000 1700.000000 0.200000

keyboard_arrow_down 4. Check for duplicate rows:


df3 = df.duplicated().sum()
print("\nDuplicate Rows in Dataset:")
print(df3)

Duplicate Rows in Dataset:


0

keyboard_arrow_down 5. Data types of each column:


print("\nData Types of Each Column:")
print(df.dtypes)

Data Types of Each Column:


ID int64
Name object
Product object
Sales float64
Date object
Region object
Discount float64
dtype: object

keyboard_arrow_down 6. Unique values in categorical columns:


print("\nUnique Values in 'Product' Column:")
print(df['Product'].unique())

print("\nUnique Values in 'Region' Column:")


print(df['Region'].unique())

Unique Values in 'Product' Column:


['Phone' 'Laptop' 'Tablet']

Unique Values in 'Region' Column:


['North' 'East' 'South' 'West']

keyboard_arrow_down 7. Correlation matrix for numerical columns:


# Calculate and print the correlation matrix for numerical columns
correlation_matrix = df[['Sales', 'Discount']].corr()

print("\nCorrelation Matrix:")
print(correlation_matrix)

Correlation Matrix:
Sales Discount
Sales 1.000000 0.379981
Discount 0.379981 1.000000
keyboard_arrow_down b. Handling common data issues using pandas
import pandas as pd
import numpy as np

# Step 1: Create the raw dataset


data = {
"ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Name": [
"john Doe", "jane Smith", "bob Johnson", "emily Davis", "chris Lee",
"Anna Brown", "David Wilson", "", "Jessica White", "Michael green"
],
"Product": ["Phone", "Laptop", "Tablet", "Phone", "Laptop", "Tablet", "Phone", "Tablet", "Laptop", "Phone"],
"Sales": [200.5, 1500, 400, None, 1200, None, 300, 700, 1700, 250],
"Date": [
"2024-01-01", None, "2024-01-03", "2024-01-04", "2024-01-05",
"2024-01-06", "2024-01-07", "2024-01-08", "2024-01-09", None
],
"Region": ["North", "East", "South", "West", "North", "East", "South", "West", "North", "South"],
"Discount": [0.1, 0.2, 0.15, 0, None, 0.1, 0.05, 0.2, 0.1, 0.1],
}

df = pd.DataFrame(data)

print("Raw Dataset:")
print(df)

Raw Dataset:
ID Name Product Sales Date Region Discount
0 1 john Doe Phone 200.5 2024-01-01 North 0.10
1 2 jane Smith Laptop 1500.0 None East 0.20
2 3 bob Johnson Tablet 400.0 2024-01-03 South 0.15
3 4 emily Davis Phone NaN 2024-01-04 West 0.00
4 5 chris Lee Laptop 1200.0 2024-01-05 North NaN
5 6 Anna Brown Tablet NaN 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0 2024-01-07 South 0.05
7 8 Tablet 700.0 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0 2024-01-09 North 0.10
9 10 Michael green Phone 250.0 None South 0.10

keyboard_arrow_down 1.Fill Missing Values


# Fill missing values in 'Sales' with the column's mean
df['Sales'] = df['Sales'].fillna(df['Sales'].mean())

# Fill missing values in 'Discount' with 0


df['Discount'] = df['Discount'].fillna(0)

# Replace empty strings in 'Name' with NaN, then fill with "Unknown"
df['Name'] = df['Name'].replace('', np.nan).fillna('Unknown')

print("\nDataset After Filling Missing Values:")


print(df)

Dataset After Filling Missing Values:


ID Name Product Sales Date Region Discount
0 1 john Doe Phone 200.5000 2024-01-01 North 0.10
1 2 jane Smith Laptop 1500.0000 None East 0.20
2 3 bob Johnson Tablet 400.0000 2024-01-03 South 0.15
3 4 emily Davis Phone 781.3125 2024-01-04 West 0.00
4 5 chris Lee Laptop 1200.0000 2024-01-05 North 0.00
5 6 Anna Brown Tablet 781.3125 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0000 2024-01-07 South 0.05
7 8 Unknown Tablet 700.0000 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0000 2024-01-09 North 0.10
9 10 Michael green Phone 250.0000 None South 0.10

keyboard_arrow_down 2.Remove Duplicate Rows


# Remove duplicates (if any exist)
df = df.drop_duplicates()
print("\nDataset After Removing Duplicates:")
print(df)

Dataset After Removing Duplicates:


ID Name Product Sales Date Region Discount
0 1 john Doe Phone 200.5000 2024-01-01 North 0.10
1 2 jane Smith Laptop 1500.0000 None East 0.20
2 3 bob Johnson Tablet 400.0000 2024-01-03 South 0.15
3 4 emily Davis Phone 781.3125 2024-01-04 West 0.00
4 5 chris Lee Laptop 1200.0000 2024-01-05 North 0.00
5 6 Anna Brown Tablet 781.3125 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0000 2024-01-07 South 0.05
7 8 Unknown Tablet 700.0000 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0000 2024-01-09 North 0.10
9 10 Michael green Phone 250.0000 None South 0.10

keyboard_arrow_down 3.Standardize Text Data


# Standardize 'Name' column
df['Name'] = df['Name'].str.strip().str.title()

print("\nDataset After Standardizing Text Data:")


print(df)

Dataset After Standardizing Text Data:


ID Name Product Sales Date Region Discount
0 1 John Doe Phone 200.5000 2024-01-01 North 0.10
1 2 Jane Smith Laptop 1500.0000 None East 0.20
2 3 Bob Johnson Tablet 400.0000 2024-01-03 South 0.15
3 4 Emily Davis Phone 781.3125 2024-01-04 West 0.00
4 5 Chris Lee Laptop 1200.0000 2024-01-05 North 0.00
5 6 Anna Brown Tablet 781.3125 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0000 2024-01-07 South 0.05
7 8 Unknown Tablet 700.0000 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0000 2024-01-09 North 0.10
9 10 Michael Green Phone 250.0000 None South 0.10

keyboard_arrow_down 4.Handle Outliers


# Remove outliers from 'Sales' using the IQR method

#Q1 is the value below which 25% of the data lies.


#Q3 is the value below which 75% of the data lies.

Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
#IQR is the range between Q3 and Q1, representing the middle 50% of the data.
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df = df[(df['Sales'] >= lower_bound) & (df['Sales'] <= upper_bound)]


original_shape = df.shape
# Check if the dataset was modified
if df.shape != original_shape:
print("Data has been modified. Outliers were removed.")
else:
print("Data has not been modified. No outliers were found or removed.")

# Display the modified dataset


print("\nDataset After Removing Outliers:")
print(df)

Data has not been modified. No outliers were found or removed.

Dataset After Removing Outliers:


ID Name Product Sales Date Region Discount
0 1 John Doe Phone 200.5000 2024-01-01 North 0.10
1 2 Jane Smith Laptop 1500.0000 None East 0.20
2 3 Bob Johnson Tablet 400.0000 2024-01-03 South 0.15
3 4 Emily Davis Phone 781.3125 2024-01-04 West 0.00
4 5 Chris Lee Laptop 1200.0000 2024-01-05 North 0.00
5 6 Anna Brown Tablet 781.3125 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0000 2024-01-07 South 0.05
7 8 Unknown Tablet 700.0000 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0000 2024-01-09 North 0.10
9 10 Michael Green Phone 250.0000 None South 0.10

keyboard_arrow_down 5.Add New Features


# Add a column for Total Revenue
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Total_Revenue'] = df['Sales'] * (1 - df['Discount'])

# Extract year from the 'Date' column


df['Year'] = df['Date'].dt.year

print("\nFinal Dataset After Adding New Features:")


print(df)

Final Dataset After Adding New Features:


ID Name Product Sales Date Region Discount \
0 1 John Doe Phone 200.5000 2024-01-01 North 0.10
1 2 Jane Smith Laptop 1500.0000 NaT East 0.20
2 3 Bob Johnson Tablet 400.0000 2024-01-03 South 0.15
3 4 Emily Davis Phone 781.3125 2024-01-04 West 0.00
4 5 Chris Lee Laptop 1200.0000 2024-01-05 North 0.00
5 6 Anna Brown Tablet 781.3125 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0000 2024-01-07 South 0.05
7 8 Unknown Tablet 700.0000 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0000 2024-01-09 North 0.10
9 10 Michael Green Phone 250.0000 NaT South 0.10

Total_Revenue Year
0 180.45000 2024.0
1 1200.00000 NaN
2 340.00000 2024.0
3 781.31250 2024.0
4 1200.00000 2024.0
5 703.18125 2024.0
6 285.00000 2024.0
7 560.00000 2024.0
8 1530.00000 2024.0
9 225.00000 NaN

You might also like