BDS306B_Module5
BDS306B_Module5
Module 5
Contents
Reference - Textbook2 – Chapter 5 and Chapter 6
Reading and Writing data - CSV and textual files, HTML files, XML files,
Microsoft excel files, JSON data.
Pickle python object serialization.
Data preparation.
Data transformation - discretization binning, permutation, string manipulation
Data aggregation group iteration.
Abbreviation
HTML – Hyper Text Markup Language
CSV – Comma separated values
JSON – Java script object notation
XML – Extended Markup Language
Serialization is the process of converting complex data or an object into byte stream. This process is
called pickling in python. Complex data or object can be recreated back by deserializing. This
process is called unpickling in python. Library pickle or _pickle is used to pickle and unpickle in
python.
Example:
# pickling using pickle library
d1 = {"USN01":"xyz", "USN02":"abc"}
file1 = open("student","wb")
file1 = pickle.dump(file1, d1)
file1.close()
# unpickling using pickle library
file2 = open("student.dat", "rb")
d1 = pickle.load(file2)
print(d1)
file2.close()
Example -
Example :
df = pd.DataFrame(np.arange(30).reshape(5,6))
print(df)
new_order = np.random.permutation(5)
print(df.take(new_order))
output :
0 1 2 3 4 5
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
4 24 25 26 27 28 29
Example :
df = pd.DataFrame(np.arange(30).reshape(5,6))
print(df)
new_order = [2,3,0]
print(df.take(new_order))
output:
0 1 2 3 4 5
2 12 13 14 15 16 17
3 18 19 20 21 22 23
0 0 1 2 3 4 5
Random Sampling
Example :
df = pd.DataFrame(np.arange(30).reshape(6,5))
print(df)
sample = np.random.randint(0,len(df), size= 3)
print(df.take(sample))
output :
0 1 2 3 4
1 5 6 7 8 9
5 25 26 27 28 29
3 15 16 17 18 19
Outlier is an unusual value which is very high or very low. Outliers can be considered as those
values which is greater 3 times standard deviation. It is important in data analysis to detect and
remove outliers from dataframe before model building as its presence affects accuracy. Any()
method can be used to detect outliers in a dataframe.