0% found this document useful (0 votes)

6 views

III-Unit

The document outlines methods for data cleaning, including identifying outliers, duplicates, and using fuzzy matching and RegEx for pattern recognition. It emphasizes normalization and standardization of data, saving cleaned data, and scripting for automation. The importance of testing the cleaning process on new data is also highlighted to ensure robustness.

Uploaded by

22551a4421.aakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

III-Unit

Uploaded by

22551a4421.aakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

1.

Finding Outliers and Bad Data

Outliers are data points that deviate significantly from the rest of the data. Identifying and
handling them is crucial for accurate analysis.

Example:

In a dataset of student ages: [18, 19, 20, 21, 120], the value 120 is an outlier.

Handling:

 Use statistical methods like the Z-score or the IQR method to identify outliers.
 Decide to either remove or cap the outliers depending on the context.

import numpy as np
from scipy.stats import zscore

data = [18, 19, 20, 21, 120]

z_scores = zscore(data)
outliers = np.where(abs(z_scores) > 2)
print("Outliers:", outliers)

2. Finding Duplicates

Duplicates are multiple entries of the same record. They can skew analysis and should be
identified and handled.

Example:

Dataset:

ID Name Age
1 John 22
2 Alice 23
1 John 22

Handling:

 Remove duplicates using tools like pandas in Python.

import pandas as pd

df = pd.DataFrame({"ID": [1, 2, 1], "Name": ["John", "Alice", "John"], "Age":

[22, 23, 22]})
df_cleaned = df.drop_duplicates()
print(df_cleaned)
3. Fuzzy Matching

Fuzzy matching identifies similar strings that may not be identical, useful for correcting typos or
variations.

Example:

Matching "Jonh" to "John".

Handling: Use libraries like fuzzywuzzy.

from fuzzywuzzy import fuzz, process

choices = ["John", "Alice", "Bob"]

name = "Jonh"
best_match = process.extractOne(name, choices)
print("Best Match:", best_match)

4. RegEx Matching

Regular expressions (RegEx) help find patterns in data, such as phone numbers or email
addresses.

Example:

Finding emails in a text:

"Contact us at [email protected] or [email protected]"

Handling: Use Python's re module.

import re

text = "Contact us at [email protected] or [email protected]"

emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
text)
print("Emails:", emails)

5. Normalizing and Standardizing the Data

Normalization scales data to a range (e.g., 0 to 1), while standardization scales data to have a
mean of 0 and a standard deviation of 1.

Example:

Normalize [10, 20, 30].

Handling:

from sklearn.preprocessing import MinMaxScaler, StandardScaler

data = [[10], [20], [30]]

scaler = MinMaxScaler()
normalized = scaler.fit_transform(data)
print("Normalized:", normalized)

scaler = StandardScaler()
standardized = scaler.fit_transform(data)
print("Standardized:", standardized)

6. Saving the Data

After cleaning, save the data for further use in formats like CSV, JSON, or databases.

Example:

Save cleaned data to a CSV file.

df_cleaned.to_csv("cleaned_data.csv", index=False)

7. Determining Suitable Data Cleanup

Analyzing the dataset to decide the best cleaning strategies, such as:

 Handling missing values.

 Removing or imputing outliers.
 Correcting data entry errors.

Example:

For a dataset with many missing values, you might fill them with the mean or median.

8. Scripting the Cleanup

Write a reusable script to automate data cleaning.

Example:

A script to clean duplicates and normalize numeric data.

9. Testing with New Data

Test the cleaning process on new data to ensure it works robustly.

Example:

Run your cleaning script on a new dataset and check results for correctness.

By implementing these concepts, you can systematically clean and prepare your dataset for
reliable analysis!

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Supply Chain Management Final Assignment: Global Production Engineering
No ratings yet
Supply Chain Management Final Assignment: Global Production Engineering
17 pages
Revised Ahmedabad Metro DPR Compiled 20th May 2015 PDF
No ratings yet
Revised Ahmedabad Metro DPR Compiled 20th May 2015 PDF
666 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
1-Introduction to data cleaning
No ratings yet
1-Introduction to data cleaning
22 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Document (2)
No ratings yet
Document (2)
29 pages
DataCleaning
No ratings yet
DataCleaning
28 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
PDS_Exp_7_to_9
No ratings yet
PDS_Exp_7_to_9
10 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
6.Data Cleaning
No ratings yet
6.Data Cleaning
20 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
9 pages
Data cleaning Using R
No ratings yet
Data cleaning Using R
5 pages
task 1
No ratings yet
task 1
2 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Prac 7
No ratings yet
Prac 7
5 pages
S-9
No ratings yet
S-9
18 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Data Cleaning Approaches in Machine Learning Algorithms
No ratings yet
Data Cleaning Approaches in Machine Learning Algorithms
8 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Advance Python
No ratings yet
Advance Python
5 pages
L3
No ratings yet
L3
34 pages
DataCleaninginML
No ratings yet
DataCleaninginML
15 pages
data wrangling
No ratings yet
data wrangling
6 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
Assignment03 DataScience Report
No ratings yet
Assignment03 DataScience Report
4 pages
DAP writeups_merged
No ratings yet
DAP writeups_merged
33 pages
B Tech-AIML-question bank-2 Answer Key
No ratings yet
B Tech-AIML-question bank-2 Answer Key
9 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
DAV practical 2
No ratings yet
DAV practical 2
6 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
ds
No ratings yet
ds
114 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
SMA EXP 3
No ratings yet
SMA EXP 3
7 pages
Lec 9
No ratings yet
Lec 9
1 page
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Group 1 CIN-Act Qn (a)
No ratings yet
Group 1 CIN-Act Qn (a)
3 pages
Pandas Data Cleaning Presentation
No ratings yet
Pandas Data Cleaning Presentation
11 pages
DSUR_EA2352001010391_W7
No ratings yet
DSUR_EA2352001010391_W7
3 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
Practicals
No ratings yet
Practicals
42 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
? Data Cleaning 101❗_
No ratings yet
? Data Cleaning 101❗_
17 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
SE_Project (Sample project) (1)
No ratings yet
SE_Project (Sample project) (1)
27 pages
BST 2 QP
No ratings yet
BST 2 QP
9 pages
DLP 2 Community Power Structure
No ratings yet
DLP 2 Community Power Structure
5 pages
Annexure I - Scope & Specification For Construction of Rock Silo-D - Rev 0
No ratings yet
Annexure I - Scope & Specification For Construction of Rock Silo-D - Rev 0
121 pages
The Power of The Crowd: When Designing A Conference, Keep in Mind That
No ratings yet
The Power of The Crowd: When Designing A Conference, Keep in Mind That
5 pages
Simply Idli Final - G8
No ratings yet
Simply Idli Final - G8
21 pages
Product Information DIGSI4 V4 90
No ratings yet
Product Information DIGSI4 V4 90
68 pages
Lesson 3.1 Artists, Artisans, and Key Players of The Art Market
No ratings yet
Lesson 3.1 Artists, Artisans, and Key Players of The Art Market
5 pages
Recruitment of Propel (Responses)
No ratings yet
Recruitment of Propel (Responses)
4 pages
Unit 3 Grand Strategies
No ratings yet
Unit 3 Grand Strategies
19 pages
Legal, Ethical & Professional Issues in Information Security
No ratings yet
Legal, Ethical & Professional Issues in Information Security
33 pages
Software Requirements Specification: E-Tutorial For Engineering Students of IPU
No ratings yet
Software Requirements Specification: E-Tutorial For Engineering Students of IPU
9 pages
Smart Scholar - One Pager - BR - New Size
No ratings yet
Smart Scholar - One Pager - BR - New Size
2 pages
Severance Pay and Service Pay
No ratings yet
Severance Pay and Service Pay
3 pages
Neo Classical Approach
No ratings yet
Neo Classical Approach
3 pages
Blank Space
No ratings yet
Blank Space
1 page
Chapter Three: Job Analysis and Human Resource Planning
No ratings yet
Chapter Three: Job Analysis and Human Resource Planning
25 pages
Studio 110 User Manual - English (Rev A) PDF
No ratings yet
Studio 110 User Manual - English (Rev A) PDF
14 pages
IPOPHL Memorandum Circular No. 2022-016 Revised Implementing Rules and Regulations For Patents, Utility Models and Industrial Designs of 2022
No ratings yet
IPOPHL Memorandum Circular No. 2022-016 Revised Implementing Rules and Regulations For Patents, Utility Models and Industrial Designs of 2022
64 pages
Elbit Display and Sight Helmet (DASH) of HAL Tejas - Indian Defence Space - Quora
No ratings yet
Elbit Display and Sight Helmet (DASH) of HAL Tejas - Indian Defence Space - Quora
7 pages
Class # 12 Date: TOPIC: Review and Extension. OBJECTIVE: by The End of This Lesson, Students Will Have Learnt How To Describe
No ratings yet
Class # 12 Date: TOPIC: Review and Extension. OBJECTIVE: by The End of This Lesson, Students Will Have Learnt How To Describe
7 pages
Thesis Markoulakis-Antonia PDF
No ratings yet
Thesis Markoulakis-Antonia PDF
109 pages
Tanauan City College: Grading Sheet
No ratings yet
Tanauan City College: Grading Sheet
33 pages
USAID Nethope Epayment Toolkit
No ratings yet
USAID Nethope Epayment Toolkit
35 pages
Final HDFC LIFE Sanchay Par Advantage - Sep 24 Final Badrish Sir
No ratings yet
Final HDFC LIFE Sanchay Par Advantage - Sep 24 Final Badrish Sir
7 pages
Comprehensive STUDENT IDENTIFICATION FORM
No ratings yet
Comprehensive STUDENT IDENTIFICATION FORM
2 pages
Taxation Law Reviewer Feu
No ratings yet
Taxation Law Reviewer Feu
108 pages
15 Rrb Ntpc Unsolved
No ratings yet
15 Rrb Ntpc Unsolved
100 pages

Uploaded by

Uploaded by

1.

Finding Outliers and Bad Data

data = [18, 19, 20, 21, 120]

 Remove duplicates using tools like pandas in Python.

df = pd.DataFrame({"ID": [1, 2, 1], "Name": ["John", "Alice", "John"], "Age":

Matching "Jonh" to "John".

Handling: Use libraries like fuzzywuzzy.

from fuzzywuzzy import fuzz, process

choices = ["John", "Alice", "Bob"]

Finding emails in a text:

Handling: Use Python's re module.

text = "Contact us at [email protected] or [email protected]"

5. Normalizing and Standardizing the Data

Normalize [10, 20, 30].

from sklearn.preprocessing import MinMaxScaler, StandardScaler

data = [[10], [20], [30]]

6. Saving the Data

Save cleaned data to a CSV file.

7. Determining Suitable Data Cleanup

 Handling missing values.

8. Scripting the Cleanup

Write a reusable script to automate data cleaning.

A script to clean duplicates and normalize numeric data.

Test the cleaning process on new data to ensure it works robustly.

You might also like