100% found this document useful (1 vote)

486 views2 pages

Pandas Cheat Sheet

This document provides a cheat sheet on data wrangling with pandas. It discusses tidy data, which organizes data such that each variable is in its own column and each observation is in its own row. This format works best with pandas' vectorized operations. The cheat sheet also covers syntax for creating DataFrames, reshaping data by gathering/spreading rows and columns, subsetting observations and variables, and method chaining in pandas.

Uploaded by

Bharat Gaikwad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

486 views2 pages

Pandas Cheat Sheet

Uploaded by

Bharat Gaikwad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 2

Data Wrangling Tidy Data – A foundation for wrangling in pandas

with pandas F M A F M A Tidy data complements pandas’s vectorized M

* A F

Cheat Sheet
http://pandas.pydata.org
In a tidy
data set:
& operations. pandas will automatically preserve
observations as you manipulate variables. No
other format works as intuitively with pandas.

M A
Each variable is saved
in its own column
Each observation is
saved in its own row *
Syntax – Creating DataFrames Reshaping Data – Change the layout of a data set
a b c
df.sort_values('mpg')
1 4 7 10 Order rows by values of a column (low to high).
2 5 8 11
3 6 9 12
df.sort_values('mpg',ascending=False)
Order rows by values of a column (high to low).
df = pd.DataFrame(
{"a" : [4 ,5, 6], pd.melt(df) df.pivot(columns='var', values='val') df.rename(columns = {'y':'year'})
"b" : [7, 8, 9], Gather columns into rows. Spread rows into columns. Rename the columns of a DataFrame
"c" : [10, 11, 12]},
df.sort_index()
index = [1, 2, 3])
Sort the index of a DataFrame
Specify values for each column.
df.reset_index()
df = pd.DataFrame(
Reset index of DataFrame to row numbers, moving
[[4, 7, 10],
[5, 8, 11], index to columns.
[6, 9, 12]], pd.concat([df1,df2]) pd.concat([df1,df2], axis=1) df.drop(columns=['Length','Height'])
index=[1, 2, 3], Append rows of DataFrames Append columns of DataFrames Drop columns from DataFrame
columns=['a', 'b', 'c'])
Specify values for each row.

n v
a b c
Subset Observations (Rows) Subset Variables (Columns)
1 4 7 10
d
2 5 8 11
e 2 6 9 12

df = pd.DataFrame( df[['width','length','species']]
df[df.Length > 7] df.sample(frac=0.5) Select multiple columns with specific names.
{"a" : [4 ,5, 6],
Extract rows that meet logical Randomly select fraction of rows. df['width'] or df.width
"b" : [7, 8, 9],
criteria. df.sample(n=10) Select single column with specific name.
"c" : [10, 11, 12]},
df.drop_duplicates() Randomly select n rows. df.filter(regex='regex')
index = pd.MultiIndex.from_tuples( Remove duplicate rows (only df.iloc[10:20]
[('d',1),('d',2),('e',2)], Select columns whose name matches regular expression regex.
considers columns). Select rows by position. regex (Regular Expressions) Examples
names=['n','v']))
df.head(n) df.nlargest(n, 'value')
Create DataFrame with a MultiIndex '\.' Matches strings containing a period '.'
Select first n rows. Select and order top n entries.
df.tail(n) df.nsmallest(n, 'value') 'Length$' Matches strings ending with word 'Length'

Method Chaining Select last n rows. Select and order bottom n entries. '^Sepal'
'^x[1-5]$'
Matches strings beginning with the word 'Sepal'
Matches strings beginning with 'x' and ending with 1,2,3,4,5
Most pandas methods return a DataFrame so that
another pandas method can be applied to the Logic in Python (and pandas) '^(?!Species$).*' Matches strings except the string 'Species'
result. This improves readability of code. < Less than != Not equal to
df = (pd.melt(df) df.loc[:,'x2':'x4']
> Greater than df.column.isin(values) Group membership
.rename(columns={ Select all columns between x2 and x4 (inclusive).
== Equals pd.isnull(obj) Is NaN
'variable' : 'var', df.iloc[:,[1,2,5]]
'value' : 'val'}) <= Less than or equals pd.notnull(obj) Is not NaN Select columns in positions 1, 2 and 5 (first column is 0).
.query('val >= 200') >= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all df.loc[df['a'] > 10, ['a','c']]
) Select rows meeting logical condition, and only the specific columns .
http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants
Summarize Data Handling Missing Data Combine Data Sets
df['w'].value_counts() df.dropna() adf bdf
Count number of rows with each unique value of variable x1 x2 x1 x3
Drop rows with any column having NA/null data.
len(df) A 1 A T
df.fillna(value)
# of rows in DataFrame. B 2 B F
Replace all NA/null data with value.
df['w'].nunique() C 3 D T
# of distinct values in a column.
df.describe() Make New Columns Standard Joins
x1 x2 x3 pd.merge(adf, bdf,
Basic descriptive statistics for each column (or GroupBy)
A 1 T how='left', on='x1')
B 2 F Join matching rows from bdf to adf.
C 3 NaN
df.assign(Area=lambda df: df.Length*df.Height)
pandas provides a large set of summary functions that operate on x1 x2 x3 pd.merge(adf, bdf,
Compute and append one or more new columns.
different kinds of pandas objects (DataFrame columns, Series, A 1.0 T how='right', on='x1')
df['Volume'] = df.Length*df.Height*df.Depth
GroupBy, Expanding and Rolling (see below)) and produce single B 2.0 F Join matching rows from adf to bdf.
Add single column.
values for each of the groups. When applied to a DataFrame, the D NaN T
pd.qcut(df.col, n, labels=False)
result is returned as a pandas Series for each column. Examples: Bin column into n buckets. x1 x2 x3
sum() min() pd.merge(adf, bdf,
A 1 T
Sum values of each object. Minimum value in each object. how='inner', on='x1')
Vector Vector B 2 F
count() max() function function
Join data. Retain only rows in both sets.
Count non-NA/null values of Maximum value in each object. x1 x2 x3 pd.merge(adf, bdf,
each object. mean() A 1 T
median() Mean value of each object. pandas provides a large set of vector functions that operate on all how='outer', on='x1')
columns of a DataFrame or a single selected column (a pandas B 2 F Join data. Retain all values, all rows.
Median value of each object. var() C 3 NaN
quantile([0.25,0.75]) Variance of each object. Series). These functions produce vectors of values for each of the
columns, or a single Series for the individual Series. Examples: D NaN T
Quantiles of each object. std()
apply(function) Standard deviation of each max(axis=1) min(axis=1) Filtering Joins
Apply function to each object. object. Element-wise max. Element-wise min. x1 x2 adf[adf.x1.isin(bdf.x1)]
clip(lower=-10,upper=10) abs() A 1 All rows in adf that have a match in bdf.
Group Data Trim values at input thresholds Absolute value. B 2

df.groupby(by="col") The examples below can also be applied to groups. In this case, the x1 x2 adf[~adf.x1.isin(bdf.x1)]
Return a GroupBy object, function is applied on a per-group basis, and the returned vectors C 3 All rows in adf that do not have a match in bdf.
grouped by values in column are of the length of the original DataFrame.
named "col". shift(1) shift(-1) ydf zdf
Copy with values shifted by 1. Copy with values lagged by 1. x1 x2 x1 x2
df.groupby(level="ind") rank(method='dense') cumsum() A 1 B 2
Return a GroupBy object, Ranks with no gaps. Cumulative sum. B 2 C 3
grouped by values in index rank(method='min') cummax() C 3 D 4
level named "ind". Ranks. Ties get min rank. Cumulative max.
Set-like Operations
All of the summary functions listed above can be applied to a group. rank(pct=True) cummin()
x1 x2 pd.merge(ydf, zdf)
Additional GroupBy functions: Ranks rescaled to interval [0, 1]. Cumulative min.
rank(method='first') cumprod() B 2 Rows that appear in both ydf and zdf
size() agg(function)
Ranks. Ties go to first value. Cumulative product. C 3 (Intersection).
Size of each group. Aggregate group using function.
x1 x2
pd.merge(ydf, zdf, how='outer')
Windows Plotting A
B
1
2
Rows that appear in either or both ydf and zdf
(Union).
df.expanding() df.plot.hist() df.plot.scatter(x='w',y='h') C 3
Return an Expanding object allowing summary functions to be Histogram for each column Scatter chart using pairs of points D 4
pd.merge(ydf, zdf, how='outer',
applied cumulatively. indicator=True)
df.rolling(n) x1 x2
A 1 .query('_merge == "left_only"')
Return a Rolling object allowing summary functions to be .drop(columns=['_merge'])
applied to windows of length n. Rows that appear in ydf but not zdf (Setdiff).
http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants