Data Science Unit 2 Second Half Notes[1]
Data Science Unit 2 Second Half Notes[1]
stack
This “rotates” or pivots from the columns in the data to the rows
unstack
Consider a small Data‐ Frame with string arrays as row and column indexes:
In [121]: data
Out[121]:
state
Ohio 0 1 2
Colorado 3 4 5
Using the stack method on this data pivots the columns into the rows, producing a Series:
In [123]: result
Out[123]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int64
From a hierarchically indexed Series, you can rearrange the data back into a Data‐ Frame
with unstack:
In [124]: result.unstack()
Out[124]:
state
Ohio 0 1 2
Colorado 3 4 5
PIVOTING
Raw data is often collected in a “long” format, which is excellent for storing and
processing, especially in databases. However, long format data can be harder for humans
to interpret or analyze directly.
Pivoting helps:
● Data is spread across columns, often with months (or another variable) as
columns and values aggregated under them.
Product Jan Feb
A 100 150
B 200 180
Suppose you are a data analyst working for a retail company, and you are given monthly
sales data in the long format (as shown above). You want to create a report comparing
sales figures across products and months. Looking at the long table isn’t very intuitive,
especially as the number of rows grows.
By pivoting the data, you restructure it so each Product becomes a row, and each Month
becomes a column, with Sales as the values. This format is far easier to read, especially
when comparing figures side by side.
How Pivoting Works (Step-by-Step)
A Jan 100
A Feb 150
B Jan 200
B Feb 180
● Columns: Month
● Values: Sales
A 100 150
B 200 180
This gives a cleaner, side-by-side view of each product’s performance over different
months.
Excel offers a built-in feature called a Pivot Table that allows users to drag and drop
fields to create pivoted views of their data.
Steps in Excel:
2. Pandas in Python
Python’s pandas library provides powerful pivot functionality through the pivot() or
pivot_table() methods.
import pandas as pd
# Sample data
data = {
'Product': ['A', 'A', 'B', 'B'],
'Month': ['Jan', 'Feb', 'Jan', 'Feb'],
'Sales': [100, 150, 200, 180]
}
df = pd.DataFrame(data)
# Pivot the data
pivoted = df.pivot(index='Product', columns='Month', values='Sales')
print(pivoted)
OUTPUT
Month Feb Jan
Product
A 150 100
B 180 200
Benefits of Pivoting
● Better Visualization: Useful for plotting charts like heatmaps, bar charts, etc.
Missing data occurs commonly in many data analysis applications. One of the goals of
pandas is to make working with missing data as painless as possible. For example, all of
the descriptive statistics on pandas objects exclude missing data by default. The way that
missing data is represented in pandas objects is somewhat imperfect, but it is functional
for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a
Number) to represent missing data. We call this a sentinel value that can be easily
detected:
In [11]: string_data
Out[11]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
In [12]: string_data.isnull()
Out[12]:
0 False
1 False
2 True
3 False
dtype: bool
There are a few ways to filter out missing data. While you always have the option to do it
by hand using pandas.isnull and boolean indexing, the dropna can be helpful. On a Series,
it returns the Series with only the non-null data and index values:
In [17]: data.dropna()
Out[17]:
0 1.0
2 3.5
4 7.0
dtype: float64
In [18]: data[data.notnull()]
Out[18]:
0 1.0
2 3.5
4 7.0
dtype: float64
With DataFrame objects, things are a bit more complex. You may want to drop rows or
columns that are all NA or only those containing any NAs. dropna by default drops any
row containing a missing value:
In [19]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], ....: [NA, NA, NA], [NA, 6.5, 3.]])
In [21]: data
Out[21]:
0 1 2
In [22]: cleaned
Out[22]:
0 1 2
Rather than filtering out missing data (and potentially discarding other data along with
it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna
method is the workhorse function to use.
Calling fillna with a constant replaces missing values with that value:
In [33]: df.fillna(0)
Out[33]:
0 1 2
Filtering, cleaning, and other transformations are another class of important operations.
Removing Duplicates
Here is an example:
In [45]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})
In [46]: data
Out[46]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
The DataFrame method duplicated returns a boolean Series indicating whether each row
is a duplicate (has been observed in a previous row) or not:
In [47]: data.duplicated()
Out[47]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
For many datasets, some transformations based on the values in an array, Series, or
column in a DataFrame are performed.
Consider the following hypotheti‐ cal data collected about various kinds of meat:
In [53]: data
Out[53]:
food ounces
0 bacon 4.0
2 bacon 12.0
3 Pastrami 6.0
5 Bacon 8.0
6 pastrami 3.0
Suppose you wanted to add a column indicating the type of animal that each food came
from. Let’s write down a mapping of each distinct meat type to the kind of animal:
meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow',
'honey ham': 'pig', 'nova lox': 'salmon' }
The map method on a Series accepts a function or dict-like object containing a map‐ ping,
but here we have a small problem in that some of the meats are capitalized and others
are not. Thus, we need to convert each value to lowercase using the str.lower Series
method:
In [56]: lowercased
Out[56]:
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
In [58]: data
Out[58]:
We could also have passed a function that does all the work:
Out[59]:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Using map is a convenient way to perform element-wise transformations and other data
cleaning–related operations.
String Manipulation
Python has long been a popular raw data manipulation language in part due to its ease of
use for string and text processing. Most text operations are made simple with the string
object’s built-in methods. For more complex pattern matching and text manipulations,
regular expressions may be needed. pandas adds to the mix by ena‐ bling you to apply
string and regular expressions concisely on whole arrays of data, additionally handling
the annoyance of missing data.
In many string munging and scripting applications, built-in string methods are sufficient.
As an example, a comma-separated string can be broken into pieces with split:
In [135]: val.split(',')
split is often combined with strip to trim whitespace (including line breaks):
In [137]: pieces
Out[139]: 'a::b::guido'
A faster and more Pythonic way is to pass a list or tuple to the join method on the string
'::':
Continuous data is often discretized or otherwise separated into “bins” for analysis.In a
data about a group of people in a study, and need to group them into discrete age buckets:
In [75]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older.
To do so, you have to use cut, a function in pandas:
In [78]: cats
Out[78]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25,
35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
The object pandas returns is a special Categorical object. The output you see describes
the bins computed by pandas.cut. You can treat it like an array of strings indicating the
bin name; internally it contains a categories array specifying the distinct category names
along with a labeling for the ages data in the codes attribute:
In [79]: cats.codes
In [80]: cats.categories
Out[80]:
closed='right',
dtype='interval[int64]')
In [81]: pd.value_counts(cats)
Out[81]:
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut.
Consistent with mathematical notation for intervals, a parenthesis means that the side is
open, while the square bracket means it is closed (inclusive). You can change which side
is closed by passing right=False:
Out[82]:
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26,
36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
Outliers
An outlier is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism.
For ease of presentation we may refer to data objects that are not outliers as “normal” or
expected data. Similarly, we may refer to outliers as “abnormal” data.
In the below figure, most objects follow a roughly Gaussian distribution. However, the
objects in region R are significantly different. It is unlikely that they follow the same
distribution as the other objects in the data set. Thus, the objects in R are outliers in the
data set.
For example, in credit card fraud detection, a customer’s purchase behavior can be
modeled as a random variable. A customer may generate some “noise transactions” that
may seem like “random errors” or “variance,” such as by buying a bigger lunch one day,
or having one more cup of coffee than usual. Such transactions should not be treated as
outliers; otherwise, the credit card company would incur heavy costs from verifying that
many transactions. The company may also lose customers by bothering them with
multiple false alarms.
As in many other data analysis and data mining tasks, noise should be removed before
outlier detection. Outliers are interesting because they are suspected of not being
generated by the same mechanisms as the rest of the data. Therefore, in outlier detection,
it is important to justify why the outliers detected are generated by some other
mechanisms. This is often achieved by making various assumptions on the rest of the data
and showing that the outliers detected violate those assumptions significantly.
Outlier detection is also related to novelty detection in evolving data sets. For example,
by monitoring a social media web site where new content is incoming, novelty detection
may identify new topics and trends in a timely manner. Novel topics may initially appear
as outliers. To this extent, outlier detection and novelty detection share some similarity
in modeling and detection methods. However, a critical difference between the two is that
in novelty detection, once new topics are confirmed, they are usually incorporated into
the model of normal behavior so that follow-up instances are not treated as outliers
anymore.
Types of Outliers
● Global outliers
● Contextual (or conditional) outliers, and
● Collective outliers.
Global Outliers
In a given data set, a data object is a global outlier if it deviates significantly from the rest
of the data set. Global outliers are sometimes called point anomalies, and are the simplest
type of outliers. Most outlier detection methods are aimed at finding global outliers.
Consider the points in the above figure again. The points in region R significantly deviate
from the rest of the data set, and hence are examples of global outliers. To detect global
outliers, a critical issue is to find an appropriate measurement of deviation with respect
to the application in question. Various measurements are proposed, and, based on these,
outlier detection methods are partitioned into different categories. We will come to this
issue in detail later. Global outlier detection is important in many applications. Consider
intrusion detection in computer networks, for example. If the communication behavior of
a computer is very different from the normal patterns (e.g., a large number of packages is
broadcast in a short time), this behavior may be considered as a global outlier and the
corresponding computer is a suspected victim of hacking. As another example, in trading
transaction auditing systems, transactions that do not follow the regulations are
considered as global outliers and should be held for further examination.
Contextual Outliers
Contextual attributes: The contextual attributes of a data object define the object’s
context. In the temperature example, the contextual attributes may be date and location.
Behavioral attributes: These define the object’s characteristics, and are used to evaluate
whether the object is an outlier in the context to which it belongs. In the temperature
example, the behavioral attributes may be the temperature, humidity, and pressure.
Collective Outliers
Given a data set, a subset of data objects forms a collective outlier if the
objects as a whole deviate significantly from the entire data set. Importantly, the
individual data objects may not be outliers.
In the below figure, the black objects as a whole form a collective outlier because the
density of those objects is much higher than the rest in the data set. However, every black
object individually is not an outlier with respect to the whole data set.