III-Unit
III-Unit
Outliers are data points that deviate significantly from the rest of the data. Identifying and
handling them is crucial for accurate analysis.
Example:
In a dataset of student ages: [18, 19, 20, 21, 120], the value 120 is an outlier.
Handling:
Use statistical methods like the Z-score or the IQR method to identify outliers.
Decide to either remove or cap the outliers depending on the context.
import numpy as np
from scipy.stats import zscore
2. Finding Duplicates
Duplicates are multiple entries of the same record. They can skew analysis and should be
identified and handled.
Example:
Dataset:
ID Name Age
1 John 22
2 Alice 23
1 John 22
Handling:
import pandas as pd
Fuzzy matching identifies similar strings that may not be identical, useful for correcting typos or
variations.
Example:
4. RegEx Matching
Regular expressions (RegEx) help find patterns in data, such as phone numbers or email
addresses.
Example:
import re
Normalization scales data to a range (e.g., 0 to 1), while standardization scales data to have a
mean of 0 and a standard deviation of 1.
Example:
scaler = StandardScaler()
standardized = scaler.fit_transform(data)
print("Standardized:", standardized)
After cleaning, save the data for further use in formats like CSV, JSON, or databases.
Example:
df_cleaned.to_csv("cleaned_data.csv", index=False)
Analyzing the dataset to decide the best cleaning strategies, such as:
Example:
For a dataset with many missing values, you might fill them with the mean or median.
Example:
Example:
Run your cleaning script on a new dataset and check results for correctness.
By implementing these concepts, you can systematically clean and prepare your dataset for
reliable analysis!