0% found this document useful (0 votes)
4 views18 pages

Data Science Unit 2 Second Half Notes[1]

The document discusses hierarchical indexing in pandas, detailing the stack and unstack methods for rearranging data in DataFrames. It explains pivoting as a technique for transforming data from long to wide format for better readability and analysis, and outlines the benefits of pivoting for summarizing data and facilitating visualization. Additionally, it covers handling missing data, filtering, filling, and transforming data using functions or mappings.

Uploaded by

Sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views18 pages

Data Science Unit 2 Second Half Notes[1]

The document discusses hierarchical indexing in pandas, detailing the stack and unstack methods for rearranging data in DataFrames. It explains pivoting as a technique for transforming data from long to wide format for better readability and analysis, and outlines the benefits of pivoting for summarizing data and facilitating visualization. Additionally, it covers handling missing data, filtering, filling, and transforming data using functions or mappings.

Uploaded by

Sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT – II (2nd Half)

Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There


are two primary actions:

stack

This “rotates” or pivots from the columns in the data to the rows

unstack

This pivots from the rows into the columns

Consider a small Data‐ Frame with string arrays as row and column indexes:

In [120]: data = pd.DataFrame(np.arange(6).reshape((2, 3)), .....: index=pd.Index(['Ohio',


'Colorado'], name='state'), .....: columns=pd.Index(['one', 'two', 'three'], .....:
name='number'))

In [121]: data

Out[121]:

number one two three

state

Ohio 0 1 2

Colorado 3 4 5

Using the stack method on this data pivots the columns into the rows, producing a Series:

In [122]: result = data.stack()

In [123]: result

Out[123]:

state number

Ohio one 0

two 1

three 2
Colorado one 3

two 4

three 5

dtype: int64

From a hierarchically indexed Series, you can rearrange the data back into a Data‐ Frame
with unstack:

In [124]: result.unstack()

Out[124]:

number one two three

state

Ohio 0 1 2

Colorado 3 4 5

PIVOTING

Pivoting is a fundamental data transformation technique used in data analysis, especially


when working with tabular data. It involves rearranging or rotating data from one format
to another to make it more readable, meaningful, or suitable for analysis. Pivoting is
typically used to summarize data, organize information more effectively, and prepare it
for reporting or visualization.

The two main types of pivoting are:

1. Pivoting (long to wide): Converting rows into columns.

2. Unpivoting (wide to long): Converting columns back into rows.

Why Pivot Data?

Raw data is often collected in a “long” format, which is excellent for storing and
processing, especially in databases. However, long format data can be harder for humans
to interpret or analyze directly.

Pivoting helps:

● Summarize large datasets


● Compare values across different categories

● Create dashboards and reports

● Facilitate data visualization

● Prepare data for machine learning or statistical models

The Long vs. Wide Format


Understanding these two data formats is essential before diving into pivoting.
Long Format:

● Each row is a single observation.


Product Month Sales
A Jan 100
A Feb 150
B Jan 200
B Feb 180
Wide Format:

● Data is spread across columns, often with months (or another variable) as
columns and values aggregated under them.
Product Jan Feb
A 100 150
B 200 180

Real-World Example: Sales Data

Suppose you are a data analyst working for a retail company, and you are given monthly
sales data in the long format (as shown above). You want to create a report comparing
sales figures across products and months. Looking at the long table isn’t very intuitive,
especially as the number of rows grows.

Pivoting the Data

By pivoting the data, you restructure it so each Product becomes a row, and each Month
becomes a column, with Sales as the values. This format is far easier to read, especially
when comparing figures side by side.
How Pivoting Works (Step-by-Step)

Let’s take the earlier long format data:

Product Month Sales

A Jan 100

A Feb 150

B Jan 200

B Feb 180

Step 1: Identify the Pivot Components

● Index (rows): Product

● Columns: Month

● Values: Sales

Step 2: Apply the Pivot

Product Jan Feb

A 100 150

B 200 180

This gives a cleaner, side-by-side view of each product’s performance over different
months.

Tools That Use Pivoting

1. Excel Pivot Tables

Excel offers a built-in feature called a Pivot Table that allows users to drag and drop
fields to create pivoted views of their data.

Steps in Excel:

1. Select your dataset.

2. Go to Insert → Pivot Table.


3. Choose your fields: place "Product" in the Rows area, "Month" in Columns, and
"Sales" in Values.

4. Excel generates the pivoted table for you automatically.

2. Pandas in Python

Python’s pandas library provides powerful pivot functionality through the pivot() or
pivot_table() methods.

import pandas as pd
# Sample data
data = {
'Product': ['A', 'A', 'B', 'B'],
'Month': ['Jan', 'Feb', 'Jan', 'Feb'],
'Sales': [100, 150, 200, 180]
}
df = pd.DataFrame(data)
# Pivot the data
pivoted = df.pivot(index='Product', columns='Month', values='Sales')
print(pivoted)

OUTPUT
Month Feb Jan
Product
A 150 100
B 180 200
Benefits of Pivoting

● Data Clarity: Makes complex data easier to understand.

● Quick Summaries: Helps summarize trends and totals quickly.

● Better Visualization: Useful for plotting charts like heatmaps, bar charts, etc.

● Reporting: Essential for business dashboards and KPI tracking.


Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of
pandas is to make working with missing data as painless as possible. For example, all of
the descriptive statistics on pandas objects exclude missing data by default. The way that
missing data is represented in pandas objects is somewhat imperfect, but it is functional
for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a
Number) to represent missing data. We call this a sentinel value that can be easily
detected:

In [10]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [11]: string_data

Out[11]:

0 aardvark

1 artichoke

2 NaN

3 avocado

dtype: object

In [12]: string_data.isnull()

Out[12]:

0 False

1 False

2 True

3 False

dtype: bool

In pandas, we’ve adopted a convention used in the R programming language by referring


to missing data as NA, which stands for not available.
Filtering Out Missing Data

There are a few ways to filter out missing data. While you always have the option to do it
by hand using pandas.isnull and boolean indexing, the dropna can be helpful. On a Series,
it returns the Series with only the non-null data and index values:

In [15]: from numpy import nan as NA

In [16]: data = pd.Series([1, NA, 3.5, NA, 7])

In [17]: data.dropna()

Out[17]:

0 1.0

2 3.5

4 7.0

dtype: float64

This is equivalent to:

In [18]: data[data.notnull()]

Out[18]:

0 1.0

2 3.5

4 7.0

dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows or
columns that are all NA or only those containing any NAs. dropna by default drops any
row containing a missing value:

In [19]:

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], ....: [NA, NA, NA], [NA, 6.5, 3.]])

In [20]: cleaned = data.dropna()

In [21]: data
Out[21]:

0 1 2

0 1.0 6.5 3.0

1 1.0 NaN NaN

2 NaN NaN NaN

3 NaN 6.5 3.0

In [22]: cleaned

Out[22]:

0 1 2

0 1.0 6.5 3.0

Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along with
it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna
method is the workhorse function to use.

Calling fillna with a constant replaces missing values with that value:

In [33]: df.fillna(0)

Out[33]:

0 1 2

0 -0.204708 0.000000 0.000000

1 -0.555730 0.000000 0.000000

2 0.092908 0.000000 0.769023

3 1.246435 0.000000 -1.296221

4 0.274992 0.228913 1.352917

5 0.886429 -2.001637 -0.371843

6 1.669025 -0.438570 -0.539741


Data Transformation

Filtering, cleaning, and other transformations are another class of important operations.

Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons.

Here is an example:

In [45]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})

In [46]: data

Out[46]:

k1 k2

0 one 1

1 two 1

2 one 2

3 two 3

4 one 3

5 two 4

6 two 4

The DataFrame method duplicated returns a boolean Series indicating whether each row
is a duplicate (has been observed in a previous row) or not:

In [47]: data.duplicated()

Out[47]:

0 False

1 False

2 False

3 False

4 False
5 False

6 True

dtype: bool

Transforming Data Using a Function or Mapping

For many datasets, some transformations based on the values in an array, Series, or
column in a DataFrame are performed.

Consider the following hypotheti‐ cal data collected about various kinds of meat:

In [52]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', ....: 'Pastrami',


'corned beef', 'Bacon', ....: 'pastrami', 'honey ham', 'nova lox'], ....: 'ounces': [4, 3, 12, 6, 7.5,
8, 3, 5, 6]})

In [53]: data

Out[53]:

food ounces

0 bacon 4.0

1 pulled pork 3.0

2 bacon 12.0

3 Pastrami 6.0

4 corned beef 7.5

5 Bacon 8.0

6 pastrami 3.0

7 honey ham 5.0

8 nova lox 6.0

Suppose you wanted to add a column indicating the type of animal that each food came
from. Let’s write down a mapping of each distinct meat type to the kind of animal:
meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow',
'honey ham': 'pig', 'nova lox': 'salmon' }

The map method on a Series accepts a function or dict-like object containing a map‐ ping,
but here we have a small problem in that some of the meats are capitalized and others
are not. Thus, we need to convert each value to lowercase using the str.lower Series
method:

In [55]: lowercased = data['food'].str.lower()

In [56]: lowercased

Out[56]:

0 bacon

1 pulled pork

2 bacon

3 pastrami

4 corned beef

5 bacon

6 pastrami

7 honey ham

8 nova lox

Name: food, dtype: object

In [57]: data['animal'] = lowercased.map(meat_to_animal)

In [58]: data

Out[58]:

food ounces animal

0 bacon 4.0 pig

1 pulled pork 3.0 pig


2 bacon 12.0 pig

3 Pastrami 6.0 cow

4 corned beef 7.5 cow

5 Bacon 8.0 pig

6 pastrami 3.0 cow

7 honey ham 5.0 pig

8 nova lox 6.0 salmon

We could also have passed a function that does all the work:

In [59]: data['food'].map(lambda x: meat_to_animal[x.lower()])

Out[59]:

0 pig

1 pig

2 pig

3 cow

4 cow

5 pig

6 cow

7 pig

8 salmon

Name: food, dtype: object

Using map is a convenient way to perform element-wise transformations and other data
cleaning–related operations.

String Manipulation

Python has long been a popular raw data manipulation language in part due to its ease of
use for string and text processing. Most text operations are made simple with the string
object’s built-in methods. For more complex pattern matching and text manipulations,
regular expressions may be needed. pandas adds to the mix by ena‐ bling you to apply
string and regular expressions concisely on whole arrays of data, additionally handling
the annoyance of missing data.

String Object Methods

In many string munging and scripting applications, built-in string methods are sufficient.
As an example, a comma-separated string can be broken into pieces with split:

In [134]: val = 'a,b, guido'

In [135]: val.split(',')

Out[135]: ['a', 'b', ' guido']

split is often combined with strip to trim whitespace (including line breaks):

In [136]: pieces = [x.strip() for x in val.split(',')]

In [137]: pieces

Out[137]: ['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter using


addition: In [138]: first, second, third = pieces

In [139]: first + '::' + second + '::' + third

Out[139]: 'a::b::guido'

A faster and more Pythonic way is to pass a list or tuple to the join method on the string
'::':

In [140]: '::'.join(pieces) Out[140]: 'a::b::guido'

Discretization and Binning

Continuous data is often discretized or otherwise separated into “bins” for analysis.In a
data about a group of people in a study, and need to group them into discrete age buckets:

In [75]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older.
To do so, you have to use cut, a function in pandas:

In [76]: bins = [18, 25, 35, 60, 100]

In [77]: cats = pd.cut(ages, bins)

In [78]: cats

Out[78]:

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25,
35]]

Length: 12

Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. The output you see describes
the bins computed by pandas.cut. You can treat it like an array of strings indicating the
bin name; internally it contains a categories array specifying the distinct category names
along with a labeling for the ages data in the codes attribute:

In [79]: cats.codes

Out[79]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [80]: cats.categories

Out[80]:

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]

closed='right',

dtype='interval[int64]')

In [81]: pd.value_counts(cats)

Out[81]:

(18, 25] 5

(35, 60] 3

(25, 35] 3

(60, 100] 1
dtype: int64

Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut.

Consistent with mathematical notation for intervals, a parenthesis means that the side is
open, while the square bracket means it is closed (inclusive). You can change which side
is closed by passing right=False:

In [82]: pd.cut(ages, [18, 26, 36, 61, 100], right=False)

Out[82]:

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26,
36)]

Length: 12

Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

Outliers

An outlier is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism.

For ease of presentation we may refer to data objects that are not outliers as “normal” or
expected data. Similarly, we may refer to outliers as “abnormal” data.

In the below figure, most objects follow a roughly Gaussian distribution. However, the
objects in region R are significantly different. It is unlikely that they follow the same
distribution as the other objects in the data set. Thus, the objects in R are outliers in the
data set.

Outliers are different from noisy data.


Noise is a random error or variance in a measured variable. In general, noise is not
interesting in data analysis, including outlier detection.

For example, in credit card fraud detection, a customer’s purchase behavior can be
modeled as a random variable. A customer may generate some “noise transactions” that
may seem like “random errors” or “variance,” such as by buying a bigger lunch one day,
or having one more cup of coffee than usual. Such transactions should not be treated as
outliers; otherwise, the credit card company would incur heavy costs from verifying that
many transactions. The company may also lose customers by bothering them with
multiple false alarms.

As in many other data analysis and data mining tasks, noise should be removed before
outlier detection. Outliers are interesting because they are suspected of not being
generated by the same mechanisms as the rest of the data. Therefore, in outlier detection,
it is important to justify why the outliers detected are generated by some other
mechanisms. This is often achieved by making various assumptions on the rest of the data
and showing that the outliers detected violate those assumptions significantly.

Outlier detection is also related to novelty detection in evolving data sets. For example,
by monitoring a social media web site where new content is incoming, novelty detection
may identify new topics and trends in a timely manner. Novel topics may initially appear
as outliers. To this extent, outlier detection and novelty detection share some similarity
in modeling and detection methods. However, a critical difference between the two is that
in novelty detection, once new topics are confirmed, they are usually incorporated into
the model of normal behavior so that follow-up instances are not treated as outliers
anymore.

Types of Outliers

In general, outliers can be classified into three categories, namely

● Global outliers
● Contextual (or conditional) outliers, and
● Collective outliers.
Global Outliers

In a given data set, a data object is a global outlier if it deviates significantly from the rest
of the data set. Global outliers are sometimes called point anomalies, and are the simplest
type of outliers. Most outlier detection methods are aimed at finding global outliers.

Consider the points in the above figure again. The points in region R significantly deviate
from the rest of the data set, and hence are examples of global outliers. To detect global
outliers, a critical issue is to find an appropriate measurement of deviation with respect
to the application in question. Various measurements are proposed, and, based on these,
outlier detection methods are partitioned into different categories. We will come to this
issue in detail later. Global outlier detection is important in many applications. Consider
intrusion detection in computer networks, for example. If the communication behavior of
a computer is very different from the normal patterns (e.g., a large number of packages is
broadcast in a short time), this behavior may be considered as a global outlier and the
corresponding computer is a suspected victim of hacking. As another example, in trading
transaction auditing systems, transactions that do not follow the regulations are
considered as global outliers and should be held for further examination.

Contextual Outliers

“The temperature today is 28◦ C. Is it exceptional (i.e., an outlier)?” It depends, for


example, on the time and location! If it is in winter in Toronto, yes, it is an outlier. If it is a
summer day in Toronto, then it is normal. Unlike global outlier detection, in this case,
Outlier Detection whether or not today’s temperature value is an outlier depends on the
context—the date, the location, and possibly some other factors.

In a given data set, a data object is a contextual outlier if it deviates


significantly with respect to a specific context of the object. Contextual outliers are also
known as conditional outliers because they are conditional on the selected context.
Therefore, in contextual outlier detection, the context has to be specified as part of the
problem definition. Generally, in contextual outlier detection, the attributes of the data
objects in question are divided into two groups:

Contextual attributes: The contextual attributes of a data object define the object’s
context. In the temperature example, the contextual attributes may be date and location.
Behavioral attributes: These define the object’s characteristics, and are used to evaluate
whether the object is an outlier in the context to which it belongs. In the temperature
example, the behavioral attributes may be the temperature, humidity, and pressure.

Unlike global outlier detection, in contextual outlier detection,


whether a data object is an outlier depends on not only the behavioral attributes but also
the contextual attributes. A configuration of behavioral attribute values may be
considered an outlier in one context (e.g., 28◦ C is an outlier for a Toronto winter), but
not an outlier in another context (e.g., 28◦ C is not an outlier for a Toronto summer).
Contextual outliers are a generalization of local outliers, a notion introduced in density-
based outlier analysis approaches. An object in a data set is a local outlier if its density
significantly deviates from the local area in which it occurs.

Collective Outliers

Suppose you are a supply-chain manager of AllElectronics. You handle thousands of


orders and shipments every day. If the shipment of an order is delayed, it may not be
considered an outlier because, statistically, delays occur from time to time. However, you
have to pay attention if 100 orders are delayed on a single day. Those 100 orders as a
whole form an outlier, although each of them may not be regarded as an outlier if
considered individually. You may have to take a close look at those orders collectively to
understand the shipment problem.

Given a data set, a subset of data objects forms a collective outlier if the
objects as a whole deviate significantly from the entire data set. Importantly, the
individual data objects may not be outliers.

In the below figure, the black objects as a whole form a collective outlier because the
density of those objects is much higher than the rest in the data set. However, every black
object individually is not an outlier with respect to the whole data set.

You might also like