0% found this document useful (0 votes)

7 views

EXP 5 DE lab

The document provides a comprehensive guide on cleaning, transforming, and enriching data using Python's pandas library. It includes steps for exploratory data analysis, handling missing values, removing duplicates, standardizing text data, and managing outliers. The document features code snippets and outputs demonstrating these processes on a sample dataset.

Uploaded by

aimlbtech7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

EXP 5 DE lab

Uploaded by

aimlbtech7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

keyboard_arrow_down *5.

Cleaning, Transforming and Enriching Data *

*a. Performing exploratory data analysis in Python *

import pandas as pd

# Sample dataset
data = {
"ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Name": ["John Doe", "Jane Smith", "Bob Johnson", "Emily Davis", "Chris Lee",
"Anna Brown", "David Wilson", "", "Jessica White", "Michael Green"],
"Product": ["Phone", "Laptop", "Tablet", "Phone", "Laptop", "Tablet", "Phone", "Tablet", "Laptop", "Phone"],
"Sales": [200.5, 1500, 400, None, 1200, None, 300, 700, 1700, 250],
"Date": ["2024-01-01", None, "2024-01-03", "2024-01-04", "2024-01-05",
"2024-01-06", "2024-01-07", "2024-01-08", "2024-01-09", None],
"Region": ["North", "East", "South", "West", "North", "East", "South", "West", "North", "South"],
"Discount": [0.1, 0.2, 0.15, 0, None, 0.1, 0.05, 0.2, 0.1, 0.1],
}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

ID Name Product Sales Date Region Discount

0 1 John Doe Phone 200.5 2024-01-01 North 0.10
1 2 Jane Smith Laptop 1500.0 None East 0.20
2 3 Bob Johnson Tablet 400.0 2024-01-03 South 0.15
3 4 Emily Davis Phone NaN 2024-01-04 West 0.00
4 5 Chris Lee Laptop 1200.0 2024-01-05 North NaN
5 6 Anna Brown Tablet NaN 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0 2024-01-07 South 0.05
7 8 Tablet 700.0 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0 2024-01-09 North 0.10
9 10 Michael Green Phone 250.0 None South 0.10

keyboard_arrow_down 1. Check the structure of the dataset:

print("Dataset Structure and First Rows:")
print(df.head())

Dataset Structure and First Rows:

ID Name Product Sales Date Region Discount
0 1 John Doe Phone 200.5 2024-01-01 North 0.10
1 2 Jane Smith Laptop 1500.0 None East 0.20
2 3 Bob Johnson Tablet 400.0 2024-01-03 South 0.15
3 4 Emily Davis Phone NaN 2024-01-04 West 0.00
4 5 Chris Lee Laptop 1200.0 2024-01-05 North NaN

keyboard_arrow_down 2.Check for missing values:

# Checking for missing values in each column
df1 = df.isnull().sum()
print("\nMissing Values in Each Column:")
print(df1)

Missing Values in Each Column:

ID 0
Name 0
Product 0
Sales 2
Date 2
Region 0
Discount 1
dtype: int64

keyboard_arrow_down 3. Summary statistics:

df2 = df.describe() #This provides a statistical summary of the numeric columns such as mean, std, min, max, and percentiles.
print("\nSummary Statistics:")
print(df2)

Summary Statistics:
ID Sales Discount
count 10.00000 8.000000 9.000000
mean 5.50000 781.312500 0.111111
std 3.02765 602.270284 0.065085
min 1.00000 200.500000 0.000000
25% 3.25000 287.500000 0.100000
50% 5.50000 550.000000 0.100000
75% 7.75000 1275.000000 0.150000
max 10.00000 1700.000000 0.200000

keyboard_arrow_down 4. Check for duplicate rows:

df3 = df.duplicated().sum()
print("\nDuplicate Rows in Dataset:")
print(df3)

Duplicate Rows in Dataset:

keyboard_arrow_down 5. Data types of each column:

print("\nData Types of Each Column:")
print(df.dtypes)

Data Types of Each Column:

ID int64
Name object
Product object
Sales float64
Date object
Region object
Discount float64
dtype: object

keyboard_arrow_down 6. Unique values in categorical columns:

print("\nUnique Values in 'Product' Column:")
print(df['Product'].unique())

print("\nUnique Values in 'Region' Column:")

print(df['Region'].unique())

Unique Values in 'Product' Column:

['Phone' 'Laptop' 'Tablet']

Unique Values in 'Region' Column:

['North' 'East' 'South' 'West']

keyboard_arrow_down 7. Correlation matrix for numerical columns:

# Calculate and print the correlation matrix for numerical columns
correlation_matrix = df[['Sales', 'Discount']].corr()

print("\nCorrelation Matrix:")
print(correlation_matrix)

Correlation Matrix:
Sales Discount
Sales 1.000000 0.379981
Discount 0.379981 1.000000
keyboard_arrow_down b. Handling common data issues using pandas
import pandas as pd
import numpy as np

# Step 1: Create the raw dataset

data = {
"ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Name": [
"john Doe", "jane Smith", "bob Johnson", "emily Davis", "chris Lee",
"Anna Brown", "David Wilson", "", "Jessica White", "Michael green"
],
"Product": ["Phone", "Laptop", "Tablet", "Phone", "Laptop", "Tablet", "Phone", "Tablet", "Laptop", "Phone"],
"Sales": [200.5, 1500, 400, None, 1200, None, 300, 700, 1700, 250],
"Date": [
"2024-01-01", None, "2024-01-03", "2024-01-04", "2024-01-05",
"2024-01-06", "2024-01-07", "2024-01-08", "2024-01-09", None
],
"Region": ["North", "East", "South", "West", "North", "East", "South", "West", "North", "South"],
"Discount": [0.1, 0.2, 0.15, 0, None, 0.1, 0.05, 0.2, 0.1, 0.1],
}

df = pd.DataFrame(data)

print("Raw Dataset:")
print(df)

Raw Dataset:
ID Name Product Sales Date Region Discount
0 1 john Doe Phone 200.5 2024-01-01 North 0.10
1 2 jane Smith Laptop 1500.0 None East 0.20
2 3 bob Johnson Tablet 400.0 2024-01-03 South 0.15
3 4 emily Davis Phone NaN 2024-01-04 West 0.00
4 5 chris Lee Laptop 1200.0 2024-01-05 North NaN
5 6 Anna Brown Tablet NaN 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0 2024-01-07 South 0.05
7 8 Tablet 700.0 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0 2024-01-09 North 0.10
9 10 Michael green Phone 250.0 None South 0.10

keyboard_arrow_down 1.Fill Missing Values

# Fill missing values in 'Sales' with the column's mean
df['Sales'] = df['Sales'].fillna(df['Sales'].mean())

# Fill missing values in 'Discount' with 0

df['Discount'] = df['Discount'].fillna(0)

# Replace empty strings in 'Name' with NaN, then fill with "Unknown"
df['Name'] = df['Name'].replace('', np.nan).fillna('Unknown')

print("\nDataset After Filling Missing Values:")

print(df)

Dataset After Filling Missing Values:

ID Name Product Sales Date Region Discount
0 1 john Doe Phone 200.5000 2024-01-01 North 0.10
1 2 jane Smith Laptop 1500.0000 None East 0.20
2 3 bob Johnson Tablet 400.0000 2024-01-03 South 0.15
3 4 emily Davis Phone 781.3125 2024-01-04 West 0.00
4 5 chris Lee Laptop 1200.0000 2024-01-05 North 0.00
5 6 Anna Brown Tablet 781.3125 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0000 2024-01-07 South 0.05
7 8 Unknown Tablet 700.0000 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0000 2024-01-09 North 0.10
9 10 Michael green Phone 250.0000 None South 0.10

keyboard_arrow_down 2.Remove Duplicate Rows

# Remove duplicates (if any exist)
df = df.drop_duplicates()
print("\nDataset After Removing Duplicates:")
print(df)

Dataset After Removing Duplicates:

keyboard_arrow_down 3.Standardize Text Data

# Standardize 'Name' column
df['Name'] = df['Name'].str.strip().str.title()

print("\nDataset After Standardizing Text Data:")

print(df)

Dataset After Standardizing Text Data:

ID Name Product Sales Date Region Discount
0 1 John Doe Phone 200.5000 2024-01-01 North 0.10
1 2 Jane Smith Laptop 1500.0000 None East 0.20
2 3 Bob Johnson Tablet 400.0000 2024-01-03 South 0.15
3 4 Emily Davis Phone 781.3125 2024-01-04 West 0.00
4 5 Chris Lee Laptop 1200.0000 2024-01-05 North 0.00
5 6 Anna Brown Tablet 781.3125 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0000 2024-01-07 South 0.05
7 8 Unknown Tablet 700.0000 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0000 2024-01-09 North 0.10
9 10 Michael Green Phone 250.0000 None South 0.10

keyboard_arrow_down 4.Handle Outliers

# Remove outliers from 'Sales' using the IQR method

#Q1 is the value below which 25% of the data lies.

#Q3 is the value below which 75% of the data lies.

Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
#IQR is the range between Q3 and Q1, representing the middle 50% of the data.
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df = df[(df['Sales'] >= lower_bound) & (df['Sales'] <= upper_bound)]

original_shape = df.shape
# Check if the dataset was modified
if df.shape != original_shape:
print("Data has been modified. Outliers were removed.")
else:
print("Data has not been modified. No outliers were found or removed.")

# Display the modified dataset

print("\nDataset After Removing Outliers:")
print(df)

Data has not been modified. No outliers were found or removed.

Dataset After Removing Outliers:

keyboard_arrow_down 5.Add New Features

# Add a column for Total Revenue
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Total_Revenue'] = df['Sales'] * (1 - df['Discount'])

# Extract year from the 'Date' column

df['Year'] = df['Date'].dt.year

print("\nFinal Dataset After Adding New Features:")

print(df)

Final Dataset After Adding New Features:

ID Name Product Sales Date Region Discount \
0 1 John Doe Phone 200.5000 2024-01-01 North 0.10
1 2 Jane Smith Laptop 1500.0000 NaT East 0.20
2 3 Bob Johnson Tablet 400.0000 2024-01-03 South 0.15
3 4 Emily Davis Phone 781.3125 2024-01-04 West 0.00
4 5 Chris Lee Laptop 1200.0000 2024-01-05 North 0.00
5 6 Anna Brown Tablet 781.3125 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0000 2024-01-07 South 0.05
7 8 Unknown Tablet 700.0000 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0000 2024-01-09 North 0.10
9 10 Michael Green Phone 250.0000 NaT South 0.10

Total_Revenue Year
0 180.45000 2024.0
1 1200.00000 NaN
2 340.00000 2024.0
3 781.31250 2024.0
4 1200.00000 2024.0
5 703.18125 2024.0
6 285.00000 2024.0
7 560.00000 2024.0
8 1530.00000 2024.0
9 225.00000 NaN

Ernest Sosa - Epistemology. An Antology. Blackwell Books PDF
100% (13)
Ernest Sosa - Epistemology. An Antology. Blackwell Books PDF
931 pages
Studies in The Life of Joshua Lesson One
No ratings yet
Studies in The Life of Joshua Lesson One
3 pages
PROII User-Added Subroutines User Guide
100% (1)
PROII User-Added Subroutines User Guide
536 pages
??????? ????????? ????????? ?????? - Handling Null values 5p
No ratings yet
??????? ????????? ????????? ?????? - Handling Null values 5p
5 pages
Advance Data Analytics ASSIGNMENT
No ratings yet
Advance Data Analytics ASSIGNMENT
10 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Python For Business Decision Making Asm2
No ratings yet
Python For Business Decision Making Asm2
21 pages
Practical File Questions With Answers
No ratings yet
Practical File Questions With Answers
7 pages
Experiment 8
No ratings yet
Experiment 8
9 pages
Pandas Commands
No ratings yet
Pandas Commands
3 pages
Python - Pandas_Numpy Interview Q&A
No ratings yet
Python - Pandas_Numpy Interview Q&A
12 pages
GR12 RECORD PROGRAMS 6TH ONWARDS
No ratings yet
GR12 RECORD PROGRAMS 6TH ONWARDS
18 pages
Lab File
No ratings yet
Lab File
96 pages
EDA With Pandas CheatSheet
No ratings yet
EDA With Pandas CheatSheet
3 pages
SalesDataAnalysis__1693296057
No ratings yet
SalesDataAnalysis__1693296057
14 pages
Notes 20241025083428
No ratings yet
Notes 20241025083428
4 pages
Supermarket Sales Data analysis
No ratings yet
Supermarket Sales Data analysis
6 pages
Dataframe
No ratings yet
Dataframe
19 pages
Python
No ratings yet
Python
8 pages
pandas_notes
No ratings yet
pandas_notes
8 pages
Document (2)
No ratings yet
Document (2)
29 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
Code explanation for date types
No ratings yet
Code explanation for date types
8 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
NM
No ratings yet
NM
23 pages
IP Practic MINE
No ratings yet
IP Practic MINE
30 pages
Suryadatta National School Class 12 CBSE Informatics Practices Practicals List
No ratings yet
Suryadatta National School Class 12 CBSE Informatics Practices Practicals List
19 pages
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
No ratings yet
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
10 pages
Unit3_3) Pandas.ipynb - Colab
No ratings yet
Unit3_3) Pandas.ipynb - Colab
11 pages
dw lab file
No ratings yet
dw lab file
18 pages
EDA (2)
No ratings yet
EDA (2)
7 pages
Attribute Types
No ratings yet
Attribute Types
11 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
Practicals
No ratings yet
Practicals
42 pages
documentpython2
No ratings yet
documentpython2
22 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
CheatSheet
No ratings yet
CheatSheet
15 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
EcommerceAnalysis 1680541297
No ratings yet
EcommerceAnalysis 1680541297
11 pages
download
No ratings yet
download
3 pages
2-Introduction to data cleaning P02
No ratings yet
2-Introduction to data cleaning P02
7 pages
12 Pandas
100% (1)
12 Pandas
21 pages
aide memoire preparation des données
No ratings yet
aide memoire preparation des données
2 pages
Document (4)
No ratings yet
Document (4)
15 pages
Document (4)-1
No ratings yet
Document (4)-1
15 pages
Avneesh_To be printed Information Practice
No ratings yet
Avneesh_To be printed Information Practice
8 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
MeriSkill Sales Analysis
No ratings yet
MeriSkill Sales Analysis
17 pages
PRACTICALS
No ratings yet
PRACTICALS
52 pages
Data_Engineer_Interview__1740985064
No ratings yet
Data_Engineer_Interview__1740985064
14 pages
Deep Learning Assignments
No ratings yet
Deep Learning Assignments
13 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Exp9 - Apriori - Ipynb - Colaboratory
No ratings yet
Exp9 - Apriori - Ipynb - Colaboratory
16 pages
Guides
No ratings yet
Guides
23 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Task 2 - Experimentation and uplift testing - Jupyter Notebook
No ratings yet
Task 2 - Experimentation and uplift testing - Jupyter Notebook
41 pages
DOC-20250118-WA0002.
No ratings yet
DOC-20250118-WA0002.
4 pages
Practicals IP-12 1-4
No ratings yet
Practicals IP-12 1-4
9 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
Cap 793
No ratings yet
Cap 793
17 pages
Popușoi Alexandru Grupa: B-1841: Lucrare de Laborator NR 2 Tema: Operații Criptografice, Protocoale Criptografice
No ratings yet
Popușoi Alexandru Grupa: B-1841: Lucrare de Laborator NR 2 Tema: Operații Criptografice, Protocoale Criptografice
7 pages
Daftar Lengkap Regular Verb Beserta Arti
No ratings yet
Daftar Lengkap Regular Verb Beserta Arti
9 pages
Answer Sheet - English 10 Q3 - W1
No ratings yet
Answer Sheet - English 10 Q3 - W1
6 pages
Third Conditional Moodle Exercises.
No ratings yet
Third Conditional Moodle Exercises.
8 pages
Talking About The Future: - When We Know About The Future We Normally Use The Present Tense. Scheduled
No ratings yet
Talking About The Future: - When We Know About The Future We Normally Use The Present Tense. Scheduled
6 pages
Global Location Niche Skill Set Insight
No ratings yet
Global Location Niche Skill Set Insight
11 pages
TOEFL Prep For Teachers PDF
100% (2)
TOEFL Prep For Teachers PDF
350 pages
Consolidation 4
No ratings yet
Consolidation 4
3 pages
General Studies - I: Section - I: History and Cultural Heritage (India and Karnataka)
No ratings yet
General Studies - I: Section - I: History and Cultural Heritage (India and Karnataka)
9 pages
Chavara Kuriakose Elias
No ratings yet
Chavara Kuriakose Elias
5 pages
p7 English Paper
No ratings yet
p7 English Paper
16 pages
Present Simple Present Continuous
No ratings yet
Present Simple Present Continuous
22 pages
Zulfikar Ghose 1935
No ratings yet
Zulfikar Ghose 1935
5 pages
Rudell 1993
No ratings yet
Rudell 1993
9 pages
Pratip Approval
No ratings yet
Pratip Approval
1 page
Compressive Coded Modulation For Seamless Rate Adaptation: Ravindra Padsala (140010741014)
No ratings yet
Compressive Coded Modulation For Seamless Rate Adaptation: Ravindra Padsala (140010741014)
18 pages
Date Sheet Mid Term Exam 2024-25
No ratings yet
Date Sheet Mid Term Exam 2024-25
1 page
The Notebook
No ratings yet
The Notebook
2 pages
Reviewer For Math 3rd Quarter Examination For Grade 8
100% (1)
Reviewer For Math 3rd Quarter Examination For Grade 8
5 pages
DataElement Constraints
No ratings yet
DataElement Constraints
1 page
The Structure of The Poem Among School Children
100% (3)
The Structure of The Poem Among School Children
2 pages
Quiz (Fact J Opinion J Assertion - Edited
100% (1)
Quiz (Fact J Opinion J Assertion - Edited
3 pages
Grammar Lesson 14 Pronoun Refer, Agree, Pov MINI PDF
No ratings yet
Grammar Lesson 14 Pronoun Refer, Agree, Pov MINI PDF
4 pages
KRASZNAHORKAI
No ratings yet
KRASZNAHORKAI
1 page
1946 and The Story of Nazi Member Julius Streicher
No ratings yet
1946 and The Story of Nazi Member Julius Streicher
3 pages
Control Universal
No ratings yet
Control Universal
33 pages
Mantras-Final-27 12 2019
100% (1)
Mantras-Final-27 12 2019
3 pages