0% found this document useful (0 votes)
4 views5 pages

BDS306B_Module5

The document outlines Module 5 of a Python course, covering data reading and writing techniques, object serialization with pickling, data preparation, transformation, and aggregation. It explains concepts such as discretization, binning, permutation, random sampling, and outlier detection, providing examples using Python libraries like pandas. Key processes like pickling and unpickling, as well as methods for handling categorical and continuous variables, are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

BDS306B_Module5

The document outlines Module 5 of a Python course, covering data reading and writing techniques, object serialization with pickling, data preparation, transformation, and aggregation. It explains concepts such as discretization, binning, permutation, random sampling, and outlier detection, providing examples using Python libraries like pandas. Key processes like pickling and unpickling, as well as methods for handling categorical and continuous variables, are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Semester : III

Subject : Python Subject Code : BDS306B

Module 5

Contents
Reference - Textbook2 – Chapter 5 and Chapter 6

 Reading and Writing data - CSV and textual files, HTML files, XML files,
Microsoft excel files, JSON data.
 Pickle python object serialization.
 Data preparation.
 Data transformation - discretization binning, permutation, string manipulation
 Data aggregation group iteration.

Abbreviation
HTML – Hyper Text Markup Language
CSV – Comma separated values
JSON – Java script object notation
XML – Extended Markup Language

Pickling of Objects in Python

Serialization is the process of converting complex data or an object into byte stream. This process is
called pickling in python. Complex data or object can be recreated back by deserializing. This
process is called unpickling in python. Library pickle or _pickle is used to pickle and unpickle in
python.

Library pandas can also be used to pickle and unpickle

Example:
# pickling using pickle library
d1 = {"USN01":"xyz", "USN02":"abc"}
file1 = open("student","wb")
file1 = pickle.dump(file1, d1)
file1.close()
# unpickling using pickle library
file2 = open("student.dat", "rb")
d1 = pickle.load(file2)
print(d1)
file2.close()

# using pandas library


d1 = {"USN01":["xyz"], "USN03":["abc"]}
df1 = pd.DataFrame(d1)
df1.to_pickle("student1.dat") --- pickling
df2 = pd.read_pickle("student1.dat") ----- unpickling
print(df2)

Discretization and Binning

Discretization is the process of converting continuous variable to categorical variable. Categorical


variable is one which stores discrete and finite values. Example - result (pass, fail), color
(RED,BLUE, ...), days (Sunday, Monday, ....), Month (Jan, Feb, ......). Continuous variable stores
continuous numeric values like percentage, height, weight, etc.

Pandas provides two functions cut() and qcut() to perform discretization.

Example -

perc = [56.23,67.23,44.56, 89.99,76.99, 99.9,72.65, 45.34,82.34]


bins = [40,50,60,70,80,90,100]
bin_names = ["F", "E","D","C","B","A"]
grade = pd.cut(perc,bins, labels=bin_names)
print(grade)
output :
['E', 'D', 'F', 'B', 'C', 'A', 'C', 'F', 'B']
Categories (6, object): ['F' < 'E' < 'D' < 'C' < 'B' < 'A']

# if bins are not specified, and no. of bins are specified


perc = [56.23,67.23,44.56, 89.99,76.99, 99.9,72.65, 45.34,82.34]
bin_names = ["E","D","C","B","A"]
cat = pd.cut(perc, 5, labels = bin_names) # value_counts are not equal
print(cat)
output :
['D', 'C', 'E', 'A', 'C', 'A', 'C', 'E', 'B']
Categories (5, object): ['E' < 'D' < 'C' < 'B' < 'A']
# using qcut() ---- value_counts are equal but edges vary
perc = [56.23,67.23,44.56, 89.99,76.99, 99.9,72.65, 45.34,82.34]
bin_names = ["E","D","C","B","A"]
cat = pd.qcut(perc, 5, labels=bin_names) # value_counts are equal
print(cat)
output :
['D', 'D', 'E', 'A', 'B', 'A', 'C', 'E', 'B']
Categories (5, object): ['E' < 'D' < 'C' < 'B' < 'A']
Permutation

Random reordering of Series or rows of a DataFrame is called Permutation.

Example :
df = pd.DataFrame(np.arange(30).reshape(5,6))
print(df)
new_order = np.random.permutation(5)
print(df.take(new_order))
output :
0 1 2 3 4 5
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
4 24 25 26 27 28 29

Random subet of a dataframe can also be created.

Example :
df = pd.DataFrame(np.arange(30).reshape(5,6))
print(df)
new_order = [2,3,0]
print(df.take(new_order))
output:
0 1 2 3 4 5
2 12 13 14 15 16 17
3 18 19 20 21 22 23
0 0 1 2 3 4 5

Random Sampling

Extract a subset of DataFrame randomly using randomint() function in numpy is Random


Sampling.

Example :
df = pd.DataFrame(np.arange(30).reshape(6,5))
print(df)
sample = np.random.randint(0,len(df), size= 3)
print(df.take(sample))
output :
0 1 2 3 4
1 5 6 7 8 9
5 25 26 27 28 29
3 15 16 17 18 19

Detecting and Filtering outlier

Outlier is an unusual value which is very high or very low. Outliers can be considered as those
values which is greater 3 times standard deviation. It is important in data analysis to detect and
remove outliers from dataframe before model building as its presence affects accuracy. Any()
method can be used to detect outliers in a dataframe.

You might also like