0% found this document useful (0 votes)
156 views41 pages

Data Wrangling & Visualization - II

The document discusses various data formats and techniques for data wrangling and visualization. It covers loading and storing data in XML, HTML, and binary formats like HDF5. It discusses using Beautiful Soup to extract data from XML and HTML documents, and describes how to read and write data to HDF5 files using pandas for efficient storage of large scientific datasets. Web scraping and interacting with APIs and databases are also covered.

Uploaded by

Ujwal mudhiraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views41 pages

Data Wrangling & Visualization - II

The document discusses various data formats and techniques for data wrangling and visualization. It covers loading and storing data in XML, HTML, and binary formats like HDF5. It discusses using Beautiful Soup to extract data from XML and HTML documents, and describes how to read and write data to HDF5 files using pandas for efficient storage of large scientific datasets. Web scraping and interacting with APIs and databases are also covered.

Uploaded by

Ujwal mudhiraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Wrangling & VISUALIZATION - II

Dr Tilottama Goswami
Professor
Department of Artificial Intelligence
Anurag University
AGENDA

 Data Loading, Storage, and File Formats: XML


 HTML - Web Scraping
 Binary Data Formats - Using HDF5 Format
 Reading Microsoft Excel Files
 Interacting with Web APIs
 Interacting with Databases
To Acquire Data from XML & HTML

 BeautifulSoup is a class in the bs4 module of python. Basic purpose of


building beautifulsoup is to parse HTML or XML documents.
What is HTML?

 HTML is the markup language which helps you to create and design web
content.
 It has a variety of tag and attributes for defining the layout and structure of
the web document.
 It is designed to display data in a formatted manner.
 A HTML document has the extension .htm or .html.
 HTML uses a pre-defined set of markup symbols (short codes) that describe
the format of content on a web page. For example, the following simple
HTML code uses tags to make some words bold and some italic:

This is how you make <b>bold text</b> and this is how you make <i>italic text</i>
WHAT IS XML ?

 Definition
 A file with the .xml file extension is an Extensible Markup Language (XML)
file. These are really just plain text files that use custom tags to describe the
structure and other features of the document.
 A metalanguage which allows users to define their own customized markup
languages, especially in order to display documents on the internet.

 What is the advantage


XML is a powerful way to store data in a format that can be stored, searched,
and shared. XML is designed to store and transport data.
<INDICATOR>

<INDICATOR_SEQ>373889</INDICATOR_SEQ>

<PARENT_SEQ></PARENT_SEQ>

<AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>

<INDICATOR_NAME>Escalator Availability</INDICATOR_NAME>

<DESCRIPTION>Percent of the time that escalators are operational

systemwide. The availability rate is based on physical observations performed

the morning of regular business days only. This is a new indicator the agency

began reporting in 2009.</DESCRIPTION>

<PERIOD_YEAR>2011</PERIOD_YEAR>
Example
<PERIOD_MONTH>12</PERIOD_MONTH>
Performance_MNR.xml
<CATEGORY>Service Indicators</CATEGORY>

<FREQUENCY>M</FREQUENCY>

<DESIRED_CHANGE>U</DESIRED_CHANGE>

<INDICATOR_UNIT>%</INDICATOR_UNIT>

<DECIMAL_PLACES>1</DECIMAL_PLACES>

<YTD_TARGET>97.00</YTD_TARGET>

<YTD_ACTUAL></YTD_ACTUAL>

<MONTHLY_TARGET>97.00</MONTHLY_TARGET>

<MONTHLY_ACTUAL></MONTHLY_ACTUAL>

</INDICATOR>
Where XML is Used?
 RSS and ATOM both describe how reader apps handle web feeds.
 Microsoft .NET uses XML for its configuration files.
 Microsoft Office 2007 and later use XML as the basis for document structure.
That’s what the “X” means in the .DOCX Word document format, for
example, and it’s also used in Excel (XLSX files) and PowerPoint (PPTX files).
Use of XML

DOCX is a better choice for just about every situation. The


format creates smaller, lighter files that are easier to read
and transfer. The open nature of the Office Open XML
standard means that it can be read by just about any full-
featured word processor, including online tools like Google
Docs

XLSX is a compress file. If you have on your computer an


xls file and you save it in xlsx, you will see that the size
has been significantly reduced. The XML code compress
your file.

A file with the .pptx file extension is a Microsoft


PowerPoint Open XML (PPTX) file created by Microsoft
PowerPoint. You can also open this type of file with other
presentation apps, like OpenOffice Impress, Google
Slides, or Apple Keynote.
Difference between HTML &XML

 XML is abbreviation for extensible Markup Language whereas HTML stands


for Hypertext Markup Language
 XML mainly focuses on transfer of data while HTML is focused on
presentation of the data.
 XML is content driven whereas HTML is format driven.
 XML is Case sensitive while HTML is Case insensitive.
 XML doesn’t have a predefined markup language, like HTML does. XML
allows users to create their own markup symbols to describe content,
making an unlimited and self-defining symbol set
WEB SCRAPING

 INSTALLATION
 pip install lxml
 pip install beautifulsoup4 html5lib
 pip install bs4

 The pandas.read_html function has a number of options, but by default it


searches for and attempts to parse all tabular data contained within
<table> tags.
Get Access to Root Node of XML, when root
node is known to programmer
Using lxml.objectify, we parse the file and get a reference to the root node of the XML file with getroot

from lxml import objectify

path= r"C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\books.xml“

parsed = objectify.parse(open(path))

root = parsed.getroot()

data = [ ]

for elt in root.book:


el_data = {}
for child in elt.getchildren():
el_data[child.tag] = child.pyval
#print(el_data)
data.append(el_data[child.tag])

print(data)
print(type(el_data))
Get Access to Root Node of XML, when the
root node is not known to programmer
from bs4 import BeautifulSoup
Books.xml
<?xml version="1.0" ?>
<books>
<book>
<title>Data Wrangling and Visualization</title>
<author>Wayne</author>
<price>699</price>
</book>
<book>
<title>Database Management Systems</title>
<author>Ullman</author>
<price>899</price>
</book>
<book>
<title>Computer Networks</title>
<author>Stallings</author>
<price>935</price>
</book>
</books>
Extract HTML Tables in DataFrame
HTML Link Tag : Get the Text

 Consider an HTML link tag, which is also valid XML:


 from io import StringIO
 tag = '<a href="http://www.google.com">Google</a>'
 root = objectify.parse(StringIO(tag)).getroot()
 You can now access any of the fields (like href) in the tag or the link text:
Binary Data Formats

 Pickle
 HDF5
 Message Pack
Convert to Binary Format: Serialization
using pickle
 One of the easiest ways to store data (also known as serialization) efficiently
in binary format is using Python’s built-in pickle serialization
 pandas objects all have a to_pickle() method that writes the data to disk in
pickle format
 You can read any “pickled” object stored in a file by using the built-in
pickle directly, or even more conveniently using pandas.read_pickle():
Why pickle is considered as short term
storage format
The problem is that it is hard to guarantee that the format will be stable over
time; an object pickled today may not unpickle with a later version of a library.
We have tried to maintain backward compatibility when possible, but at some
point in the future it may be necessary to “break” the pickle format.
II) pandas has built-in support for binary
data formats: HDF5

Hierarchical Data Format (HDF)


 HDF5 is a well-regarded file format intended for storing large quantities of
scientific array data.
 Advantages:
 available as a C library, and it has interfaces available in many other languages,
including Java, Julia, MATLAB, and Python
 HDF5 supports on-the-fly compression with a variety of compression modes,
enabling data with repeated patterns to be stored more efficiently
 HDF5 can be a good choice for working with very large datasets that don’t fit
into memory, as you can efficiently read and write small sections of much larger
arrays
 Since many data analysis problems are I/O-bound (rather than CPU-bound),
using a tool like HDF5 can massively accelerate your applications
Concepts
 CPU Bound means the rate at which process progresses is limited by the
speed of the CPU. A task that performs calculations on a small set of
numbers, for example multiplying small matrices, is likely to be CPU bound.
 I/O Bound means the rate at which a process progresses is limited by the
speed of the I/O subsystem. A task that processes data from disk, for
example, counting the number of lines in a file is likely to be I/O bound.
 Memory bound means the rate at which a process progresses is limited by
the amount memory available and the speed of that memory access. A
task that processes large amounts of in memory data, for example
multiplying large matrices, is likely to be Memory Bound.
 Cache bound means the rate at which a process progress is limited by the
amount and speed of the cache available. A task that simply processes
more data than fits in the cache will be cache bound.

I/O Bound would be slower than Memory Bound would be slower than Cache
Bound would be slower than CPU Bound.
Is HDF5 a Database?

HDF5 is not a database. It is best suited for write-once, read-many datasets.


While data can be added to a file at any time, if multiple writers do so
simultaneously, the file can become corrupted
Functions

 pandas.read_hdf()
 pandas.DataFrame.to_hdf()
 pandas.HDFStore(), add multiple datasets and print the stores
 pandas.HDFStore.put()
 pandas.HDFStore.get()
 pandas.HDFStore.info()
 Pandas.HDFStore.keys()
H
D
F
5

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html
H
D
F
5
H
D
F
5
H
D
F
S
t
o
r
e
Syntax: HDFStore.put(key, value, format=None, index=True, append=False, complib=None, H
complevel=None, min_itemsize=None, nan_rep=None, data_columns=None, encoding=None, errors='strict',
track_times=True, dropna=False) D
Where F
Key is the identifier for the data object S
Value is the Dataframe or series.
Format is the format used to store data objects. Values can be 'fixed'(default) or 'table'. t
Append appends the input data to the existing. It forces Table format.
data_columns is the list of columns to be used as indexed columns. To use all columns, specify as True. o
Encoding provides an encoding for strings.
track_times governs the recording of times associated with an object. If set to True, time data is recorded. r
Dropna specify if the null data values to be dropped or not.
e
Get all the data stored using HDFStore. The get method in HDFStore class can be used to read

H
the file. Mode=’r’ has to be specified to open the file in read mode.

D
HDFStore supports two storage schemas, 'fixed' and 'table'.

format{‘fixed’, ‘table’, None}, default ‘fixed’

‘table’ format is generally slower, but it supports query operations using a special syntax
‘table’: Table format. Write as a PyTables Table structure which may perform worse but F
5
allow more flexible operations like searching / selecting subsets of the data.

‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable

Pandas.DataFrame.to_hdf(path_or_buf, key, mode='a', complevel=None, complib=None,


append=False, format=None, index=True, min_itemsize=None, nan_rep=None, dropna=None,
data_columns=None, errors='strict', encoding='UTF-8')
H
D
F
5
H
D
F
5
Reading Microsoft Excel Files
Interacting with Web APIs

 What is an API?
 API stands for Application Programming Interface. Don’t worry about the
AP, just focus on the I. An API is an interface. You use interfaces all the time.
A computer operating system is an interface. Buttons in an elevator are an
interface. A gas pedal in a car is an interface.
 An interface sits on top of a complicated system and simplifies certain
tasks, a middleman that saves you from needing to know all the details of
what’s happening under the hood. A web API is the same sort of thing. It
sits on top of a web service, like Twitter or YouTube, and simplifies certain
tasks for you. It translates your actions into the technical details for the
computer system on the other end
What is an WebAPI ?

 What is a web API?


 There are lots of different flavors of web API. One of the most common, and
most accessible to non-programmers, is called a REST, or RESTful, API. From
now on, when I say “web API” I mean a REST API.
 A web API is an interface with URLs as the controls. In that respect, the
entire web is a sort of API. You try to access a URL in your browser (also
known as a request), and a web server somewhere makes a bunch of
complicated decisions based on that and sends you back some content
(also known as a response). A standard web API works the same way.
 The key difference between an ordinary URL and a URL that’s part of a web
API is that an ordinary URL sends back something pretty designed to look
good in your browser, whereas a web API URL sends back something ugly
designed to be useful to a computer.
Example : http://twitter.com/

When you request the URL http://twitter.com/ in a


Web Page browser you get back a nice-looking webpage with a
bunch of colors and pictures and buttons. It’s
designed for a human to look at and for a browser to
draw on a screen. But it sucks if what you want is to
gather and analyzing data.

When you request this web


API URL instead, you get
back an ugly-looking chunk of
Web API plain text with no decorations:
Web APIs are a way to strip
away all the extraneous
visual interface that you
don’t care about and get at
the data
Interacting with Web APIs
Interacting with WebAPIs
Interacting with Databases
Interacting with Databases
Interacting with Databases
Reference

 GeekForGeeks
 Wes McKinney. Python for Data Analysis: Data Wrangling with pandas,
NumPy and I Python, O'Reilly, 2017, 2nd Edition

You might also like