0% found this document useful (0 votes)
6 views

III-Unit

The document outlines methods for data cleaning, including identifying outliers, duplicates, and using fuzzy matching and RegEx for pattern recognition. It emphasizes normalization and standardization of data, saving cleaned data, and scripting for automation. The importance of testing the cleaning process on new data is also highlighted to ensure robustness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

III-Unit

The document outlines methods for data cleaning, including identifying outliers, duplicates, and using fuzzy matching and RegEx for pattern recognition. It emphasizes normalization and standardization of data, saving cleaned data, and scripting for automation. The importance of testing the cleaning process on new data is also highlighted to ensure robustness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

1.

Finding Outliers and Bad Data

Outliers are data points that deviate significantly from the rest of the data. Identifying and
handling them is crucial for accurate analysis.

Example:

In a dataset of student ages: [18, 19, 20, 21, 120], the value 120 is an outlier.

Handling:

 Use statistical methods like the Z-score or the IQR method to identify outliers.
 Decide to either remove or cap the outliers depending on the context.

import numpy as np
from scipy.stats import zscore

data = [18, 19, 20, 21, 120]


z_scores = zscore(data)
outliers = np.where(abs(z_scores) > 2)
print("Outliers:", outliers)

2. Finding Duplicates

Duplicates are multiple entries of the same record. They can skew analysis and should be
identified and handled.

Example:

Dataset:

ID Name Age
1 John 22
2 Alice 23
1 John 22

Handling:

 Remove duplicates using tools like pandas in Python.

import pandas as pd

df = pd.DataFrame({"ID": [1, 2, 1], "Name": ["John", "Alice", "John"], "Age":


[22, 23, 22]})
df_cleaned = df.drop_duplicates()
print(df_cleaned)
3. Fuzzy Matching

Fuzzy matching identifies similar strings that may not be identical, useful for correcting typos or
variations.

Example:

Matching "Jonh" to "John".

Handling: Use libraries like fuzzywuzzy.

from fuzzywuzzy import fuzz, process

choices = ["John", "Alice", "Bob"]


name = "Jonh"
best_match = process.extractOne(name, choices)
print("Best Match:", best_match)

4. RegEx Matching

Regular expressions (RegEx) help find patterns in data, such as phone numbers or email
addresses.

Example:

Finding emails in a text:


"Contact us at [email protected] or [email protected]"

Handling: Use Python's re module.

import re

text = "Contact us at [email protected] or [email protected]"


emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
text)
print("Emails:", emails)

5. Normalizing and Standardizing the Data

Normalization scales data to a range (e.g., 0 to 1), while standardization scales data to have a
mean of 0 and a standard deviation of 1.

Example:

Normalize [10, 20, 30].


Handling:

from sklearn.preprocessing import MinMaxScaler, StandardScaler

data = [[10], [20], [30]]


scaler = MinMaxScaler()
normalized = scaler.fit_transform(data)
print("Normalized:", normalized)

scaler = StandardScaler()
standardized = scaler.fit_transform(data)
print("Standardized:", standardized)

6. Saving the Data

After cleaning, save the data for further use in formats like CSV, JSON, or databases.

Example:

Save cleaned data to a CSV file.

df_cleaned.to_csv("cleaned_data.csv", index=False)

7. Determining Suitable Data Cleanup

Analyzing the dataset to decide the best cleaning strategies, such as:

 Handling missing values.


 Removing or imputing outliers.
 Correcting data entry errors.

Example:

For a dataset with many missing values, you might fill them with the mean or median.

8. Scripting the Cleanup

Write a reusable script to automate data cleaning.

Example:

A script to clean duplicates and normalize numeric data.


9. Testing with New Data

Test the cleaning process on new data to ensure it works robustly.

Example:

Run your cleaning script on a new dataset and check results for correctness.

By implementing these concepts, you can systematically clean and prepare your dataset for
reliable analysis!

You might also like