0% found this document useful (0 votes)

156 views41 pages

Data Wrangling & Visualization - II

The document discusses various data formats and techniques for data wrangling and visualization. It covers loading and storing data in XML, HTML, and binary formats like HDF5. It discusses using Beautiful Soup to extract data from XML and HTML documents, and describes how to read and write data to HDF5 files using pandas for efficient storage of large scientific datasets. Web scraping and interacting with APIs and databases are also covered.

Uploaded by

Ujwal mudhiraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

156 views41 pages

Data Wrangling & Visualization - II

Uploaded by

Ujwal mudhiraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Data Wrangling & VISUALIZATION - II

Dr Tilottama Goswami
Professor
Department of Artificial Intelligence
Anurag University
AGENDA

 Data Loading, Storage, and File Formats: XML

 HTML - Web Scraping
 Binary Data Formats - Using HDF5 Format
 Reading Microsoft Excel Files
 Interacting with Web APIs
 Interacting with Databases
To Acquire Data from XML & HTML

 BeautifulSoup is a class in the bs4 module of python. Basic purpose of

building beautifulsoup is to parse HTML or XML documents.
What is HTML?

 HTML is the markup language which helps you to create and design web
content.
 It has a variety of tag and attributes for defining the layout and structure of
the web document.
 It is designed to display data in a formatted manner.
 A HTML document has the extension .htm or .html.
 HTML uses a pre-defined set of markup symbols (short codes) that describe
the format of content on a web page. For example, the following simple
HTML code uses tags to make some words bold and some italic:

This is how you make <b>bold text</b> and this is how you make <i>italic text</i>
WHAT IS XML ?

 Definition
 A file with the .xml file extension is an Extensible Markup Language (XML)
file. These are really just plain text files that use custom tags to describe the
structure and other features of the document.
 A metalanguage which allows users to define their own customized markup
languages, especially in order to display documents on the internet.

 What is the advantage

XML is a powerful way to store data in a format that can be stored, searched,
and shared. XML is designed to store and transport data.
<INDICATOR>

<INDICATOR_SEQ>373889</INDICATOR_SEQ>

<PARENT_SEQ></PARENT_SEQ>

<AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>

<INDICATOR_NAME>Escalator Availability</INDICATOR_NAME>

<DESCRIPTION>Percent of the time that escalators are operational

systemwide. The availability rate is based on physical observations performed

the morning of regular business days only. This is a new indicator the agency

began reporting in 2009.</DESCRIPTION>

<PERIOD_YEAR>2011</PERIOD_YEAR>
Example
<PERIOD_MONTH>12</PERIOD_MONTH>
Performance_MNR.xml
<CATEGORY>Service Indicators</CATEGORY>

<DESIRED_CHANGE>U</DESIRED_CHANGE>

<INDICATOR_UNIT>%</INDICATOR_UNIT>

<DECIMAL_PLACES>1</DECIMAL_PLACES>

<YTD_TARGET>97.00</YTD_TARGET>

<YTD_ACTUAL></YTD_ACTUAL>

<MONTHLY_TARGET>97.00</MONTHLY_TARGET>

<MONTHLY_ACTUAL></MONTHLY_ACTUAL>

</INDICATOR>
Where XML is Used?
 RSS and ATOM both describe how reader apps handle web feeds.
 Microsoft .NET uses XML for its configuration files.
 Microsoft Office 2007 and later use XML as the basis for document structure.
That’s what the “X” means in the .DOCX Word document format, for
example, and it’s also used in Excel (XLSX files) and PowerPoint (PPTX files).
Use of XML

DOCX is a better choice for just about every situation. The

format creates smaller, lighter files that are easier to read
and transfer. The open nature of the Office Open XML
standard means that it can be read by just about any full-
featured word processor, including online tools like Google
Docs

XLSX is a compress file. If you have on your computer an

xls file and you save it in xlsx, you will see that the size
has been significantly reduced. The XML code compress
your file.

A file with the .pptx file extension is a Microsoft

PowerPoint Open XML (PPTX) file created by Microsoft
PowerPoint. You can also open this type of file with other
presentation apps, like OpenOffice Impress, Google
Slides, or Apple Keynote.
Difference between HTML &XML

 XML is abbreviation for extensible Markup Language whereas HTML stands

for Hypertext Markup Language
 XML mainly focuses on transfer of data while HTML is focused on
presentation of the data.
 XML is content driven whereas HTML is format driven.
 XML is Case sensitive while HTML is Case insensitive.
 XML doesn’t have a predefined markup language, like HTML does. XML
allows users to create their own markup symbols to describe content,
making an unlimited and self-defining symbol set
WEB SCRAPING

 INSTALLATION
 pip install lxml
 pip install beautifulsoup4 html5lib
 pip install bs4

 The pandas.read_html function has a number of options, but by default it

searches for and attempts to parse all tabular data contained within
<table> tags.
Get Access to Root Node of XML, when root
node is known to programmer
Using lxml.objectify, we parse the file and get a reference to the root node of the XML file with getroot

from lxml import objectify

path= r"C:\Users\Tilottama\OneDrive\Data Wrangling\Lab\books.xml“

parsed = objectify.parse(open(path))

root = parsed.getroot()

data = [ ]

for elt in root.book:

el_data = {}
for child in elt.getchildren():
el_data[child.tag] = child.pyval
#print(el_data)
data.append(el_data[child.tag])

print(data)
print(type(el_data))
Get Access to Root Node of XML, when the
root node is not known to programmer
from bs4 import BeautifulSoup
Books.xml
<?xml version="1.0" ?>
<books>
<book>
<title>Data Wrangling and Visualization</title>
<author>Wayne</author>
<price>699</price>
</book>
<book>
<title>Database Management Systems</title>
<author>Ullman</author>
<price>899</price>
</book>
<book>
<title>Computer Networks</title>
<author>Stallings</author>
<price>935</price>
</book>
</books>
Extract HTML Tables in DataFrame
HTML Link Tag : Get the Text

 Consider an HTML link tag, which is also valid XML:

 from io import StringIO
 tag = '<a href="http://www.google.com">Google</a>'
 root = objectify.parse(StringIO(tag)).getroot()
 You can now access any of the fields (like href) in the tag or the link text:
Binary Data Formats

 Pickle
 HDF5
 Message Pack
Convert to Binary Format: Serialization
using pickle
 One of the easiest ways to store data (also known as serialization) efficiently
in binary format is using Python’s built-in pickle serialization
 pandas objects all have a to_pickle() method that writes the data to disk in
pickle format
 You can read any “pickled” object stored in a file by using the built-in
pickle directly, or even more conveniently using pandas.read_pickle():
Why pickle is considered as short term
storage format
The problem is that it is hard to guarantee that the format will be stable over
time; an object pickled today may not unpickle with a later version of a library.
We have tried to maintain backward compatibility when possible, but at some
point in the future it may be necessary to “break” the pickle format.
II) pandas has built-in support for binary
data formats: HDF5

Hierarchical Data Format (HDF)

 HDF5 is a well-regarded file format intended for storing large quantities of
scientific array data.
 Advantages:
 available as a C library, and it has interfaces available in many other languages,
including Java, Julia, MATLAB, and Python
 HDF5 supports on-the-fly compression with a variety of compression modes,
enabling data with repeated patterns to be stored more efficiently
 HDF5 can be a good choice for working with very large datasets that don’t fit
into memory, as you can efficiently read and write small sections of much larger
arrays
 Since many data analysis problems are I/O-bound (rather than CPU-bound),
using a tool like HDF5 can massively accelerate your applications
Concepts
 CPU Bound means the rate at which process progresses is limited by the
speed of the CPU. A task that performs calculations on a small set of
numbers, for example multiplying small matrices, is likely to be CPU bound.
 I/O Bound means the rate at which a process progresses is limited by the
speed of the I/O subsystem. A task that processes data from disk, for
example, counting the number of lines in a file is likely to be I/O bound.
 Memory bound means the rate at which a process progresses is limited by
the amount memory available and the speed of that memory access. A
task that processes large amounts of in memory data, for example
multiplying large matrices, is likely to be Memory Bound.
 Cache bound means the rate at which a process progress is limited by the
amount and speed of the cache available. A task that simply processes
more data than fits in the cache will be cache bound.

I/O Bound would be slower than Memory Bound would be slower than Cache
Bound would be slower than CPU Bound.
Is HDF5 a Database?

HDF5 is not a database. It is best suited for write-once, read-many datasets.

While data can be added to a file at any time, if multiple writers do so
simultaneously, the file can become corrupted
Functions

 pandas.read_hdf()
 pandas.DataFrame.to_hdf()
 pandas.HDFStore(), add multiple datasets and print the stores
 pandas.HDFStore.put()
 pandas.HDFStore.get()
 pandas.HDFStore.info()
 Pandas.HDFStore.keys()
H
D
F
5

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html
H
D
F
5
H
D
F
5
H
D
F
S
t
o
r
e
Syntax: HDFStore.put(key, value, format=None, index=True, append=False, complib=None, H
complevel=None, min_itemsize=None, nan_rep=None, data_columns=None, encoding=None, errors='strict',
track_times=True, dropna=False) D
Where F
Key is the identifier for the data object S
Value is the Dataframe or series.
Format is the format used to store data objects. Values can be 'fixed'(default) or 'table'. t
Append appends the input data to the existing. It forces Table format.
data_columns is the list of columns to be used as indexed columns. To use all columns, specify as True. o
Encoding provides an encoding for strings.
track_times governs the recording of times associated with an object. If set to True, time data is recorded. r
Dropna specify if the null data values to be dropped or not.
e
Get all the data stored using HDFStore. The get method in HDFStore class can be used to read

H
the file. Mode=’r’ has to be specified to open the file in read mode.

D
HDFStore supports two storage schemas, 'fixed' and 'table'.

format{‘fixed’, ‘table’, None}, default ‘fixed’

‘table’ format is generally slower, but it supports query operations using a special syntax
‘table’: Table format. Write as a PyTables Table structure which may perform worse but F
5
allow more flexible operations like searching / selecting subsets of the data.

‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable

Pandas.DataFrame.to_hdf(path_or_buf, key, mode='a', complevel=None, complib=None,

append=False, format=None, index=True, min_itemsize=None, nan_rep=None, dropna=None,
data_columns=None, errors='strict', encoding='UTF-8')
H
D
F
5
H
D
F
5
Reading Microsoft Excel Files
Interacting with Web APIs

 What is an API?
 API stands for Application Programming Interface. Don’t worry about the
AP, just focus on the I. An API is an interface. You use interfaces all the time.
A computer operating system is an interface. Buttons in an elevator are an
interface. A gas pedal in a car is an interface.
 An interface sits on top of a complicated system and simplifies certain
tasks, a middleman that saves you from needing to know all the details of
what’s happening under the hood. A web API is the same sort of thing. It
sits on top of a web service, like Twitter or YouTube, and simplifies certain
tasks for you. It translates your actions into the technical details for the
computer system on the other end
What is an WebAPI ?

 What is a web API?

 There are lots of different flavors of web API. One of the most common, and
most accessible to non-programmers, is called a REST, or RESTful, API. From
now on, when I say “web API” I mean a REST API.
 A web API is an interface with URLs as the controls. In that respect, the
entire web is a sort of API. You try to access a URL in your browser (also
known as a request), and a web server somewhere makes a bunch of
complicated decisions based on that and sends you back some content
(also known as a response). A standard web API works the same way.
 The key difference between an ordinary URL and a URL that’s part of a web
API is that an ordinary URL sends back something pretty designed to look
good in your browser, whereas a web API URL sends back something ugly
designed to be useful to a computer.
Example : http://twitter.com/

When you request the URL http://twitter.com/ in a

Web Page browser you get back a nice-looking webpage with a
bunch of colors and pictures and buttons. It’s
designed for a human to look at and for a browser to
draw on a screen. But it sucks if what you want is to
gather and analyzing data.

When you request this web

API URL instead, you get
back an ugly-looking chunk of
Web API plain text with no decorations:
Web APIs are a way to strip
away all the extraneous
visual interface that you
don’t care about and get at
the data
Interacting with Web APIs
Interacting with WebAPIs
Interacting with Databases
Interacting with Databases
Interacting with Databases
Reference

 GeekForGeeks
 Wes McKinney. Python for Data Analysis: Data Wrangling with pandas,
NumPy and I Python, O'Reilly, 2017, 2nd Edition

MadCap Flare for Programmers
From Everand
MadCap Flare for Programmers
Thomas Tregner
5/5 (1)
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
From Everand
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
Steven Bright
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Ericsson-APG 43L (Linux) O&M
100% (2)
Ericsson-APG 43L (Linux) O&M
4 pages
Usability Characteristics Evaluation On Adminitration Module of Academic Information System Using ISO/IEC 9126 Quality Model
No ratings yet
Usability Characteristics Evaluation On Adminitration Module of Academic Information System Using ISO/IEC 9126 Quality Model
6 pages
DWV_UNIT_II
No ratings yet
DWV_UNIT_II
37 pages
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
2.5/5 (2)
DA Unit 4
No ratings yet
DA Unit 4
46 pages
XML Data Format
From Everand
XML Data Format
Lucas Lee
No ratings yet
Config File Types
From Everand
Config File Types
Frank Wellington
No ratings yet
CH 2 Data Collection Management
No ratings yet
CH 2 Data Collection Management
42 pages
TOML Config Basics
From Everand
TOML Config Basics
Frank Wellington
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
James Learning Javascript Programming
From Everand
James Learning Javascript Programming
James Lombard
No ratings yet
DBMS_UNIT4_NOTES
No ratings yet
DBMS_UNIT4_NOTES
95 pages
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Session - 6 - Complex Data Types
No ratings yet
Session - 6 - Complex Data Types
27 pages
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
Unit 5 Lecture Notes 5
No ratings yet
Unit 5 Lecture Notes 5
20 pages
Comprehensive Hypertext Markup Language (HTML).: A Tutorial Guide to Editing and Developing a Responsive and Dynamic Website for Beginners.
From Everand
Comprehensive Hypertext Markup Language (HTML).: A Tutorial Guide to Editing and Developing a Responsive and Dynamic Website for Beginners.
Ibrahim Nugwa Abdulrazak
No ratings yet
Html5: QuickStudy Laminated Reference Guide
From Everand
Html5: QuickStudy Laminated Reference Guide
Robin Nixon
5/5 (1)
Data Format Compare
From Everand
Data Format Compare
Frank Wellington
No ratings yet
Mastering XML: Essential Techniques
From Everand
Mastering XML: Essential Techniques
Brett Neutreon
No ratings yet
Python File Handling Made Easy: A Practical Guide with Examples
From Everand
Python File Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Internet Information Services 8.5
From Everand
Internet Information Services 8.5
Murat Yildirimoglu
No ratings yet
PHP 5 CMS Framework Development - 2nd Edition
From Everand
PHP 5 CMS Framework Development - 2nd Edition
Martin Brampton
No ratings yet
XML and Web Database
No ratings yet
XML and Web Database
10 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
XML Processing With Python
100% (1)
XML Processing With Python
447 pages
Data Science Formats Beyond CSV and Hdfs
No ratings yet
Data Science Formats Beyond CSV and Hdfs
54 pages
Living With Linux In the Industrial World
From Everand
Living With Linux In the Industrial World
Elaiya Iswera Lallan
No ratings yet
week2_data_formats 3
No ratings yet
week2_data_formats 3
60 pages
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
IE Python
No ratings yet
IE Python
26 pages
Semantic Web Ontology Lec 7 8 week 4
No ratings yet
Semantic Web Ontology Lec 7 8 week 4
36 pages
Part I: Basics: Chapter 1. Xquery: A Guided Tour
No ratings yet
Part I: Basics: Chapter 1. Xquery: A Guided Tour
61 pages
Introduction to XML
No ratings yet
Introduction to XML
44 pages
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
Tutorial+1
No ratings yet
Tutorial+1
22 pages
Introduction to HTML & CSS
From Everand
Introduction to HTML & CSS
Claudia Da Silva
4.5/5 (4)
Report File
No ratings yet
Report File
40 pages
XML and Web Databases
No ratings yet
XML and Web Databases
58 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
HTML5 & CSS3 For Beginners: Your Guide To Easily Learn HTML5 & CSS3 Programming in 7 Days
From Everand
HTML5 & CSS3 For Beginners: Your Guide To Easily Learn HTML5 & CSS3 Programming in 7 Days
i Code Academy
4/5 (11)
XHTML
From Everand
XHTML
Jitendra Patel
No ratings yet
PHP Oracle Web Development: Data processing, Security, Caching, XML, Web Services, and Ajax
From Everand
PHP Oracle Web Development: Data processing, Security, Caching, XML, Web Services, and Ajax
Yuli Vasiliev
No ratings yet
UNIT 3 Resource Description Framework and XML Technologies
No ratings yet
UNIT 3 Resource Description Framework and XML Technologies
22 pages
Browsing and Querying On XML Data Sources
No ratings yet
Browsing and Querying On XML Data Sources
29 pages
Api and data structure
No ratings yet
Api and data structure
3 pages
Intro To Python
No ratings yet
Intro To Python
11 pages
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Gu Into Reviewer
No ratings yet
Gu Into Reviewer
38 pages
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
From Everand
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
Malcolm Coxall
No ratings yet
Xquery XML Databases: Roger L. Costello 16 June 2010
No ratings yet
Xquery XML Databases: Roger L. Costello 16 June 2010
27 pages
Easy html and css
From Everand
Easy html and css
S VASIST
No ratings yet
Processing XML documents with Oracle JDeveloper 11g
From Everand
Processing XML documents with Oracle JDeveloper 11g
Deepak Vohra
No ratings yet
IT3020 L5 - XML Slides
No ratings yet
IT3020 L5 - XML Slides
12 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
VNX2 - VNX2 Series Operating Environment Upgrade Procedures To SHA-2 Signed Version 05.33.009.5.236 or Later, Including How To Recover From Any Errors (Dell Correctable) - Dell US
No ratings yet
VNX2 - VNX2 Series Operating Environment Upgrade Procedures To SHA-2 Signed Version 05.33.009.5.236 or Later, Including How To Recover From Any Errors (Dell Correctable) - Dell US
4 pages
Technical Service Bulletin: Condition
No ratings yet
Technical Service Bulletin: Condition
3 pages
SESSION_1
No ratings yet
SESSION_1
4 pages
Akira-Ransomware-Threat-Profile_Adversary-Pursuit-Group-Blackpoint-Cyber_2024Q3
No ratings yet
Akira-Ransomware-Threat-Profile_Adversary-Pursuit-Group-Blackpoint-Cyber_2024Q3
28 pages
CheckPoint R80.10 ReleaseNotes
No ratings yet
CheckPoint R80.10 ReleaseNotes
27 pages
computer-class-3rd-worksheet
No ratings yet
computer-class-3rd-worksheet
23 pages
The River PDF
No ratings yet
The River PDF
7 pages
FFF EULA License Ver2.1 PDF
No ratings yet
FFF EULA License Ver2.1 PDF
2 pages
Allocation User Manual
No ratings yet
Allocation User Manual
155 pages
SAP BPC NW 10.0 - 7.5 Script Logic Implementation Guide V18
No ratings yet
SAP BPC NW 10.0 - 7.5 Script Logic Implementation Guide V18
132 pages
These Are The Top 10 Machine Learning Languages On GitHub
No ratings yet
These Are The Top 10 Machine Learning Languages On GitHub
3 pages
Welcome To The Visionhub Operator Training
No ratings yet
Welcome To The Visionhub Operator Training
58 pages
International Cyber Olympiad 2023-1
No ratings yet
International Cyber Olympiad 2023-1
4 pages
SWOT Analysis Connectivity and Digitization in German Automobile Industry
No ratings yet
SWOT Analysis Connectivity and Digitization in German Automobile Industry
12 pages
Packaging Tutorial
No ratings yet
Packaging Tutorial
86 pages
Introduction to Linux-Shell Programming: Trường ĐHSPKT.TP.HCM Thực hành Hệ Điều Hành 1
No ratings yet
Introduction to Linux-Shell Programming: Trường ĐHSPKT.TP.HCM Thực hành Hệ Điều Hành 1
14 pages
How to Draw to Unlock iPhone – ForYouTricks
No ratings yet
How to Draw to Unlock iPhone – ForYouTricks
1 page
XI - Step-By-step Guide To Develop Adapter Module To Read Excel File
No ratings yet
XI - Step-By-step Guide To Develop Adapter Module To Read Excel File
4 pages
Partner Advantage Sandbox Toolkit - Y21
No ratings yet
Partner Advantage Sandbox Toolkit - Y21
9 pages
Ah en TP (G) XXXSXX Hmi Firmware Upgrade 107453 en 00
No ratings yet
Ah en TP (G) XXXSXX Hmi Firmware Upgrade 107453 en 00
5 pages
Photo Frame Tutorial: An Introduction To Driveworksxpress
No ratings yet
Photo Frame Tutorial: An Introduction To Driveworksxpress
21 pages
ETL Informatica SDLC
100% (3)
ETL Informatica SDLC
2 pages
Fod
No ratings yet
Fod
12 pages
Dffa b10228 00 7600 PDF
No ratings yet
Dffa b10228 00 7600 PDF
2 pages
MlCROSOFT EXCEL NOTE
No ratings yet
MlCROSOFT EXCEL NOTE
6 pages
Data Catalog Power BI User Guide
No ratings yet
Data Catalog Power BI User Guide
6 pages
JDBC Programs
No ratings yet
JDBC Programs
15 pages
Barcode Scanner App UX
No ratings yet
Barcode Scanner App UX
9 pages