0% found this document useful (0 votes)
5 views7 pages

DSBDA Lab Assignment No 2

The assignment requires students to create an 'Academic performance' dataset and perform data wrangling operations using Python, including handling missing values, detecting outliers, and applying data transformations. Students must document their approaches and utilize techniques such as boxplots, Z-scores, and IQR for outlier detection, as well as methods for filling or removing null values. The objective is to enhance students' skills in data preprocessing and cleaning using Python on an open-source dataset.

Uploaded by

deshmukhakash420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

DSBDA Lab Assignment No 2

The assignment requires students to create an 'Academic performance' dataset and perform data wrangling operations using Python, including handling missing values, detecting outliers, and applying data transformations. Students must document their approaches and utilize techniques such as boxplots, Z-scores, and IQR for outlier detection, as well as methods for filling or removing null values. The objective is to enhance students' skills in data preprocessing and cleaning using Python on an open-source dataset.

Uploaded by

deshmukhakash420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Department of Computer Subject :

Engineering DSBDAL
Group A

Assignment No: 2

Title of the Assignment: Data Wrangling, II


Create an “Academic performance” dataset of students and perform the following
operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing
values and/or inconsistencies, use any of the suitable techniques to deal with
them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for
better understanding of the variable, to convert a non-linear relation into a linear
one, or to decrease the skewness and convert the distribution into a normal
distribution.
Reason and document your approach properly.

Objective of the Assignment: Students should be able to perform thedata wrangling


operation using Python on any open source dataset

Prerequisite:
1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data
Cleaning.

Theory:
1. Creation of Dataset using Microsoft Excel.
The dataset is created in “CSV” format.
● The name of dataset is Students Performance
● The features of the dataset are: Math_Score, Reading_Score, Writing_Score,
Placement_Score, Club_Join_Date .
Department of Computer Subject :
Engineering DSBDAL
● Number of Instances: 10
● The response variable is: Placement_Offer_Count .
● Range of Values:
Math_Score [60-80], Reading_Score[75-,95], ,Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].

2. Identification and Handling of Null Values


In Pandas missing data is represented by two value:

1. None: None is a Python singleton object that is often used for missing data in
Python code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point
representation.

To facilitate the convention, there are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame :

● isnull()

● notnull()

● dropna()

● fillna()

● replace()

1. Checking for missing values using isnull() and notnull()

● Checking for missing values using isnull()


df.isnull()
● Checking for missing values using notnull()
df.notnull()

2. Filling missing values using dropna(), fillna(), replace()

In order to fill null values in a datasets, fillna(), replace() functions are used.
These functions replace NaN values with some value of their own. All these
functions help in filling null values in datasets of a DataFrame.
● Filling null values with a single value
ndf=df
ndf.fillna(0)

data['math score'] = data['math score'].fillna(data['math score'].mean())


Department of Computer Subject :
Engineering DSBDAL
data[''math score''] = data[''math score''].fillna(data[''math
score''].median())

data['math score''] = data[''math score''].fillna(data[''math score''].std())

replacing missing values in forenoon column with minimum/maximum number


of that column

data[''math score''] = data[''math score''].fillna(data[''math score''].min())

data[''math score''] = data[''math score''].fillna(data[''math score''].max())

● Filling a null values using replace() method

Following line will replace Nan value in dataframe with value -99

ndf.replace(to_replace = np.nan, value = -99)

● Deleting null values using dropna() method


In order to drop null values from a dataframe, dropna() function is used. This
function drops Rows/Columns of datasets with Null values in different ways.
df.dropna()
To Drop rows if all values in that row are missing
df.dropna(how = 'all')
To Drop columns with at least 1 null value.
df.dropna(axis = 1)

3. Identification and Handling of Outliers


3.1 Identification of Outliers
One of the most important steps as part of data preprocessing is detecting and treating the
outliers as they can negatively affect the statistical analysis and the training process of a
machine learning algorithm resulting in lower accuracy.
1. What are Outliers?
We all have heard of the idiom ‘odd one out' which means something unusual in
comparison to the others in a group.

Similarly, an Outlier is an observation in a given dataset that lies far from the rest
of the observations. That means an outlier is vastly larger or smaller than the remaining
values in the set.

2. Why do they occur?


Department of Computer Subject :
Engineering DSBDAL
An outlier may occur due to the variability in the data, or due to experimental
error/human error.

They may indicate an experimental error or heavy skewness in the data(heavy-


tailed distribution).

3. What do they affect?


In statistics, we have three measures of central tendency namely Mean, Median,
and Mode. They help us describe the data.

Mean is the accurate measure to describe the data when we do not have any
outliers present. Median is used if there is an outlier in the dataset. Mode is used if there
is an outlier AND about ½ or more of the data is the same.

‘Mean’ is the only measure of central tendency that is affected by the outliers
which in turn impacts Standard deviation.
Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other
values.

Fig. Computation with and without outlier

From the above calculations, we can clearly say the Mean is more affected than the
Median.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But
what if we have a huge dataset, how do we identify the outliers then? We need to use
visualization and mathematical techniques.
Below are some of the techniques of detecting outliers
● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)
Department of Computer Subject :
Engineering DSBDAL
col = ['math score', 'reading score' , 'writing
score','placement score']
df.boxplot(col)

4.1 Detecting outliers using Scatterplot:


It is used when you have paired numerical data, or when your dependent variable
has multiple values for each reading independent variable, or when trying to determine
the relationship between the two variables. In the process of utilizing the scatter plot, one
can also use it for outlier detection.
To plot the scatter plot one requires two variables that are somehow related to
each other. So here Placement score and Placement count features are used.

4.2 Detecting outliers using Z-Score:


Z-Score is also called a standard score. This value/score helps to
understand how far is the data point from the mean. And after setting up a
threshold value one can utilize z score values of data points to define the outliers.
Zscore = (data_point -mean) / std. deviation

4.3 Detecting outliers using Inter Quantile Range(IQR):


IQR (Inter Quartile Range) Inter Quartile Range approach to finding the
outliers is the most commonly used and most trusted approach used in the
research field.
IQR = Quartile3 – Quartile1
To define the outlier base value is defined above and below datasets
normal range namely Upper and Lower bounds, define the upper and the lower
bound (1.5*IQR value is considered) :

upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.

Handling of Outliers:
For removing the outlier, one must follow the same process of removing an entry
from the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.
Department of Computer Subject :
Engineering DSBDAL
Below are some of the methods of treating the outliers
● Trimming/removing the outlier
● Quantile based flooring and capping
● Mean/Median imputation

Data Transformation: Data transformation is the process of converting raw data into a
format or structure that would be more suitable for model building and also data discovery in
general. The process of data transformation can also be referred to as extract/transform/load
(ETL). The data transformation involves steps that are.

● Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It
helps in predicting the patterns
● Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step
since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used.
● Generalization: It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old).
● Normalization: Data normalization involves converting all data variables into a
given range. Some of the techniques that are used for accomplishing normalization
are:
○ Min–max normalization: This transforms the original data linearly.
○ Z-score normalization: In z-score normalization (or zero-mean normalization)
the values of an attribute (A), are normalized based on the mean of A and its
standard deviation.
○ Normalization by decimal scaling: It normalizes the values of an attribute by
changing the position of their decimal points
Department of Computer Subject :
Engineering DSBDAL

Conclusion: In this way we have explored the functions of the python library for Data
Identifying and handling the outliers. Data Transformations Techniques are explored with the
purpose of creating the new variable and reducing the skewness from datasets.

Viva Questions:
1. Explain the methods to detect the outlier.
2. Explain data transformation methods
3. Write the algorithm to display the statistics of Null values present in the dataset.
4. Write an algorithm to replace the outlier value with the mean of the variable.

You might also like