Data Wrangling & Visualization - II
Data Wrangling & Visualization - II
Dr Tilottama Goswami
Professor
Department of Artificial Intelligence
Anurag University
AGENDA
HTML is the markup language which helps you to create and design web
content.
It has a variety of tag and attributes for defining the layout and structure of
the web document.
It is designed to display data in a formatted manner.
A HTML document has the extension .htm or .html.
HTML uses a pre-defined set of markup symbols (short codes) that describe
the format of content on a web page. For example, the following simple
HTML code uses tags to make some words bold and some italic:
This is how you make <b>bold text</b> and this is how you make <i>italic text</i>
WHAT IS XML ?
Definition
A file with the .xml file extension is an Extensible Markup Language (XML)
file. These are really just plain text files that use custom tags to describe the
structure and other features of the document.
A metalanguage which allows users to define their own customized markup
languages, especially in order to display documents on the internet.
<INDICATOR_SEQ>373889</INDICATOR_SEQ>
<PARENT_SEQ></PARENT_SEQ>
<AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>
<INDICATOR_NAME>Escalator Availability</INDICATOR_NAME>
the morning of regular business days only. This is a new indicator the agency
<PERIOD_YEAR>2011</PERIOD_YEAR>
Example
<PERIOD_MONTH>12</PERIOD_MONTH>
Performance_MNR.xml
<CATEGORY>Service Indicators</CATEGORY>
<FREQUENCY>M</FREQUENCY>
<DESIRED_CHANGE>U</DESIRED_CHANGE>
<INDICATOR_UNIT>%</INDICATOR_UNIT>
<DECIMAL_PLACES>1</DECIMAL_PLACES>
<YTD_TARGET>97.00</YTD_TARGET>
<YTD_ACTUAL></YTD_ACTUAL>
<MONTHLY_TARGET>97.00</MONTHLY_TARGET>
<MONTHLY_ACTUAL></MONTHLY_ACTUAL>
</INDICATOR>
Where XML is Used?
RSS and ATOM both describe how reader apps handle web feeds.
Microsoft .NET uses XML for its configuration files.
Microsoft Office 2007 and later use XML as the basis for document structure.
That’s what the “X” means in the .DOCX Word document format, for
example, and it’s also used in Excel (XLSX files) and PowerPoint (PPTX files).
Use of XML
INSTALLATION
pip install lxml
pip install beautifulsoup4 html5lib
pip install bs4
parsed = objectify.parse(open(path))
root = parsed.getroot()
data = [ ]
print(data)
print(type(el_data))
Get Access to Root Node of XML, when the
root node is not known to programmer
from bs4 import BeautifulSoup
Books.xml
<?xml version="1.0" ?>
<books>
<book>
<title>Data Wrangling and Visualization</title>
<author>Wayne</author>
<price>699</price>
</book>
<book>
<title>Database Management Systems</title>
<author>Ullman</author>
<price>899</price>
</book>
<book>
<title>Computer Networks</title>
<author>Stallings</author>
<price>935</price>
</book>
</books>
Extract HTML Tables in DataFrame
HTML Link Tag : Get the Text
Pickle
HDF5
Message Pack
Convert to Binary Format: Serialization
using pickle
One of the easiest ways to store data (also known as serialization) efficiently
in binary format is using Python’s built-in pickle serialization
pandas objects all have a to_pickle() method that writes the data to disk in
pickle format
You can read any “pickled” object stored in a file by using the built-in
pickle directly, or even more conveniently using pandas.read_pickle():
Why pickle is considered as short term
storage format
The problem is that it is hard to guarantee that the format will be stable over
time; an object pickled today may not unpickle with a later version of a library.
We have tried to maintain backward compatibility when possible, but at some
point in the future it may be necessary to “break” the pickle format.
II) pandas has built-in support for binary
data formats: HDF5
I/O Bound would be slower than Memory Bound would be slower than Cache
Bound would be slower than CPU Bound.
Is HDF5 a Database?
pandas.read_hdf()
pandas.DataFrame.to_hdf()
pandas.HDFStore(), add multiple datasets and print the stores
pandas.HDFStore.put()
pandas.HDFStore.get()
pandas.HDFStore.info()
Pandas.HDFStore.keys()
H
D
F
5
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html
H
D
F
5
H
D
F
5
H
D
F
S
t
o
r
e
Syntax: HDFStore.put(key, value, format=None, index=True, append=False, complib=None, H
complevel=None, min_itemsize=None, nan_rep=None, data_columns=None, encoding=None, errors='strict',
track_times=True, dropna=False) D
Where F
Key is the identifier for the data object S
Value is the Dataframe or series.
Format is the format used to store data objects. Values can be 'fixed'(default) or 'table'. t
Append appends the input data to the existing. It forces Table format.
data_columns is the list of columns to be used as indexed columns. To use all columns, specify as True. o
Encoding provides an encoding for strings.
track_times governs the recording of times associated with an object. If set to True, time data is recorded. r
Dropna specify if the null data values to be dropped or not.
e
Get all the data stored using HDFStore. The get method in HDFStore class can be used to read
H
the file. Mode=’r’ has to be specified to open the file in read mode.
D
HDFStore supports two storage schemas, 'fixed' and 'table'.
‘table’ format is generally slower, but it supports query operations using a special syntax
‘table’: Table format. Write as a PyTables Table structure which may perform worse but F
5
allow more flexible operations like searching / selecting subsets of the data.
What is an API?
API stands for Application Programming Interface. Don’t worry about the
AP, just focus on the I. An API is an interface. You use interfaces all the time.
A computer operating system is an interface. Buttons in an elevator are an
interface. A gas pedal in a car is an interface.
An interface sits on top of a complicated system and simplifies certain
tasks, a middleman that saves you from needing to know all the details of
what’s happening under the hood. A web API is the same sort of thing. It
sits on top of a web service, like Twitter or YouTube, and simplifies certain
tasks for you. It translates your actions into the technical details for the
computer system on the other end
What is an WebAPI ?
GeekForGeeks
Wes McKinney. Python for Data Analysis: Data Wrangling with pandas,
NumPy and I Python, O'Reilly, 2017, 2nd Edition