EXP 5 DE lab
EXP 5 DE lab
import pandas as pd
# Sample dataset
data = {
"ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Name": ["John Doe", "Jane Smith", "Bob Johnson", "Emily Davis", "Chris Lee",
"Anna Brown", "David Wilson", "", "Jessica White", "Michael Green"],
"Product": ["Phone", "Laptop", "Tablet", "Phone", "Laptop", "Tablet", "Phone", "Tablet", "Laptop", "Phone"],
"Sales": [200.5, 1500, 400, None, 1200, None, 300, 700, 1700, 250],
"Date": ["2024-01-01", None, "2024-01-03", "2024-01-04", "2024-01-05",
"2024-01-06", "2024-01-07", "2024-01-08", "2024-01-09", None],
"Region": ["North", "East", "South", "West", "North", "East", "South", "West", "North", "South"],
"Discount": [0.1, 0.2, 0.15, 0, None, 0.1, 0.05, 0.2, 0.1, 0.1],
}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
Summary Statistics:
ID Sales Discount
count 10.00000 8.000000 9.000000
mean 5.50000 781.312500 0.111111
std 3.02765 602.270284 0.065085
min 1.00000 200.500000 0.000000
25% 3.25000 287.500000 0.100000
50% 5.50000 550.000000 0.100000
75% 7.75000 1275.000000 0.150000
max 10.00000 1700.000000 0.200000
print("\nCorrelation Matrix:")
print(correlation_matrix)
Correlation Matrix:
Sales Discount
Sales 1.000000 0.379981
Discount 0.379981 1.000000
keyboard_arrow_down b. Handling common data issues using pandas
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
print("Raw Dataset:")
print(df)
Raw Dataset:
ID Name Product Sales Date Region Discount
0 1 john Doe Phone 200.5 2024-01-01 North 0.10
1 2 jane Smith Laptop 1500.0 None East 0.20
2 3 bob Johnson Tablet 400.0 2024-01-03 South 0.15
3 4 emily Davis Phone NaN 2024-01-04 West 0.00
4 5 chris Lee Laptop 1200.0 2024-01-05 North NaN
5 6 Anna Brown Tablet NaN 2024-01-06 East 0.10
6 7 David Wilson Phone 300.0 2024-01-07 South 0.05
7 8 Tablet 700.0 2024-01-08 West 0.20
8 9 Jessica White Laptop 1700.0 2024-01-09 North 0.10
9 10 Michael green Phone 250.0 None South 0.10
# Replace empty strings in 'Name' with NaN, then fill with "Unknown"
df['Name'] = df['Name'].replace('', np.nan).fillna('Unknown')
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
#IQR is the range between Q3 and Q1, representing the middle 50% of the data.
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
Total_Revenue Year
0 180.45000 2024.0
1 1200.00000 NaN
2 340.00000 2024.0
3 781.31250 2024.0
4 1200.00000 2024.0
5 703.18125 2024.0
6 285.00000 2024.0
7 560.00000 2024.0
8 1530.00000 2024.0
9 225.00000 NaN