0% found this document useful (0 votes)
190 views4 pages

Pandas Cheat Sheet CN

This document provides a cheat sheet on data wrangling with pandas. It discusses tidy data, which is a format that complements pandas' vectorized operations by having each variable in its own column and each observation in its own row. It then covers syntax for creating DataFrames, reshaping data by gathering/spreading rows and columns, subsetting observations and variables, and method chaining to apply multiple pandas methods sequentially. Finally, it includes examples of using regular expressions to select columns.

Uploaded by

ren zhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
190 views4 pages

Pandas Cheat Sheet CN

This document provides a cheat sheet on data wrangling with pandas. It discusses tidy data, which is a format that complements pandas' vectorized operations by having each variable in its own column and each observation in its own row. It then covers syntax for creating DataFrames, reshaping data by gathering/spreading rows and columns, subsetting observations and variables, and method chaining to apply multiple pandas methods sequentially. Finally, it includes examples of using regular expressions to select columns.

Uploaded by

ren zhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Wrangling Tidy Data – A foundation for wrangling in pandas

with pandas F M A F M A Tidy data complements pandas’s vectorized M


* A F

Cheat Sheet
http://pandas.pydata.org
In a tidy
data set:
& operations. pandas will automatically preserve
observations as you manipulate variables. No
other format works as intuitively with pandas.

M A
Each variable is saved
in its own column
Each observation is
saved in its own row *
Syntax – Creating DataFrames Reshaping Data – Change the layout of a data set
a b c df=df.sort_values('mpg')
1 4 7 10 Order rows by values of a column (low to high).
2 5 8 11
3 6 9 12
df=df.sort_values('mpg',ascending=False)
Order rows by values of a column (high to low).
df = pd.DataFrame(
{"a" : [4 ,5, 6], pd.melt(df) df.pivot(columns='var', values='val') df=df.rename(columns = {'y':'year'})
"b" : [7, 8, 9], Gather columns into rows. Spread rows into columns. Rename the columns of a DataFrame
"c" : [10, 11, 12]},
index = [1, 2, 3]) df=df.sort_index()
Specify values for each column. Sort the index of a DataFrame

df = pd.DataFrame( df=df.reset_index()
Reset index of DataFrame to row numbers, moving
[[4, 7, 10],
index to columns.
[5, 8, 11],
[6, 9, 12]], pd.concat([df1,df2]) pd.concat([df1,df2], axis=1) df=df.drop(['Length','Height'], axis=1)
index=[1, 2, 3], Append rows of DataFrames Append columns of DataFrames Drop columns from DataFrame
columns=['a', 'b', 'c'])
Specify values for each row.

n v
a b c Subset Observations (Rows) Subset Variables (Columns)
1 4 7 10
d
2 5 8 11
e 2 6 9 12

df = pd.DataFrame( df[['width','length','species']]
df[df.Length > 7] df.sample(frac=0.5) Select multiple columns with specific names.
{"a" : [4 ,5, 6], Extract rows that meet logical Randomly select fraction of rows.
"b" : [7, 8, 9], df['width'] or df.width
criteria. df.sample(n=10) Select single column with specific name.
"c" : [10, 11, 12]}, Randomly select n rows.
df.drop_duplicates() df.filter(regex='regex')
index = pd.MultiIndex.from_tuples( Remove duplicate rows (only df.iloc[10:20] Select columns whose name matches regular expression regex.
[('d',1),('d',2),('e',2)], considers columns). Select rows by position.
names=['n','v']))) regex (Regular Expressions) Examples
df.head(n) df.nlargest(n, 'value')
Create DataFrame with a MultiIndex
Select first n rows. Select and order top n entries. '\.' Matches strings containing a period '.'
df.tail(n) df.nsmallest(n, 'value') 'Length$' Matches strings ending with word 'Length'
Method Chaining Select last n rows. Select and order bottom n entries. '^Sepal' Matches strings beginning with the word 'Sepal'

Most pandas methods return a DataFrame so that '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5
another pandas method can be applied to the Logic in Python (and pandas) ''^(?!Species$).*' Matches strings except the string 'Species'
result. This improves readability of code. < Less than != Not equal to
df = (pd.melt(df) df.loc[:,'x2':'x4']
.rename(columns={
> Greater than df.column.isin(values) Group membership Select all columns between x2 and x4 (inclusive).
'variable' : 'var', == Equals pd.isnull(obj) Is NaN df.iloc[:,[1,2,5]]
'value' : 'val'}) <= Less than or equals pd.notnull(obj) Is not NaN
Select columns in positions 1, 2 and 5 (first column is 0).
.query('val >= 200') df.loc[df['a'] > 10, ['a','c']]
>= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all
) Select rows meeting logical condition, and only the specific columns .
http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants
Summarize Data Handling Missing Data Combine Data Sets
df['Length'].value_counts() df=df.dropna() adf bdf
Count number of rows with each unique value of variable Drop rows with any column having NA/null data. x1 x2 x1 x3
len(df) df=df.fillna(value) A 1 A T
# of rows in DataFrame. Replace all NA/null data with value. B 2 B F
len(df['w'].unique()) C 3 D T
# of distinct values in a column.
df.describe()
Make New Variables Standard Joins

Basic descriptive statistics for each column (or GroupBy) x1 x2 x3 pd.merge(adf, bdf,
A 1 T how='left', on='x1')
B 2 F Join matching rows from bdf to adf.
C 3 NaN
df=df.assign(Area=lambda df: df.Length*df.Height)
pandas provides a large set of summary functions that operate on Compute and append one or more new columns. x1 x2 x3 pd.merge(adf, bdf,
different kinds of pandas objects (DataFrame columns, Series, df['Volume'] = df.Length*df.Height*df.Depth A 1.0 T how='right', on='x1')
GroupBy, Expanding and Rolling (see below)) and produce single Add single column. B 2.0 F Join matching rows from adf to bdf.
values for each of the groups. When applied to a DataFrame, the pd.qcut(df.col, n, labels=False) D NaN T
result is returned as a pandas Series for each column. Examples: Bin column into n buckets.
x1 x2 x3 pd.merge(adf, bdf,
sum() min()
A 1 T how='inner', on='x1')
Sum values of each object. Minimum value in each object. Vector Vector B 2 F Join data. Retain only rows in both sets.
count() max() function function
Count non-NA/null values of Maximum value in each object.
each object. x1 x2 x3 pd.merge(adf, bdf,
mean()
Mean value of each object. pandas provides a large set of vector functions that operate on all A 1 T how='outer', on='x1')
median()
Median value of each object. columns of a DataFrame or a single selected column (a pandas B 2 F Join data. Retain all values, all rows.
var()
Variance of each object. Series). These functions produce vectors of values for each of the C 3 NaN
quantile([0.25,0.75])
Quantiles of each object. columns, or a single Series for the individual Series. Examples: D NaN T
std()
apply(function) Standard deviation of each max(axis=1) min(axis=1) Filtering Joins
Apply function to each object. object. Element-wise max. Element-wise min. x1 x2 adf[adf.x1.isin(bdf.x1)]
clip(lower=-10,upper=10) abs() A 1 All rows in adf that have a match in bdf.
Group Data Trim values at input thresholds Absolute value. B 2

df.groupby(by="col") The examples below can also be applied to groups. In this case, the x1 x2 adf[~adf.x1.isin(bdf.x1)]
Return a GroupBy object, function is applied on a per-group basis, and the returned vectors C 3 All rows in adf that do not have a match in bdf.
grouped by values in column are of the length of the original DataFrame.
named "col". shift(1) shift(-1) ydf zdf
Copy with values shifted by 1. Copy with values lagged by 1. x1 x2 x1 x2
df.groupby(level="ind") rank(method='dense') cumsum() A 1 B 2
Return a GroupBy object, Ranks with no gaps. Cumulative sum. B 2 C 3
grouped by values in index rank(method='min') cummax() C 3 D 4
level named "ind". Ranks. Ties get min rank. Cumulative max.
Set-like Operations
All of the summary functions listed above can be applied to a group. rank(pct=True) cummin()
Additional GroupBy functions: Ranks rescaled to interval [0, 1]. Cumulative min. x1 x2 pd.merge(ydf, zdf)
size() agg(function) rank(method='first') cumprod() B 2 Rows that appear in both ydf and zdf
Size of each group. Aggregate group using function. Ranks. Ties go to first value. Cumulative product. C 3 (Intersection).

x1 x2
Windows Plotting A
B
1
2
pd.merge(ydf, zdf, how='outer')
Rows that appear in either or both ydf and zdf
(Union).
df.expanding() df.plot.hist() df.plot.scatter(x='w',y='h') C 3
Return an Expanding object allowing summary functions to be Histogram for each column Scatter chart using pairs of points D 4 pd.merge(ydf, zdf, how='outer',
applied cumulatively. indicator=True)
df.rolling(n) x1 x2
A 1 .query('_merge == "left_only"')
Return a Rolling object allowing summary functions to be .drop(['_merge'],axis=1)
applied to windows of length n. Rows that appear in ydf but not zdf (Setdiff).
http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants
利用pandas 整洁的数据 – 利用pandas整理数据的基础
整理数据 F M A F M A M
* A F

&
整洁的数据与pandas矢量化操作相辅相成。
在整洁的
Cheat Sheet 数据中:
当你操作变量时,pandas将自动操作
对应变量的每个观测值,非常直观。
http://pandas.pydata.org
wmsby翻译(wmsbywwzyx [at] gmail.com)
变量(variables)保存在
各自的列(column)中
其观测值(observations)保存
在各自对应的行(row)里
译注:变量就是列的意思,观测值就是行的意思。 M * A
创建DataFrames 数据重塑Reshaping – 改变数据的形状
变量(variables)保存在各自的列(column)中
整洁的数据与pandas矢量化操作相辅相成。当你操作变量时,pandas
a b c df=df.sort_values('mpg')
1 4 7 10 根据某列的值对行进行排序(升序)
2 5 8 11
3 6 9 12
df=df.sort_values('mpg',ascending=False)
根据某列的值对行进行排序(降序)
df = pd.DataFrame(
{"a" : [4 ,5, 6], pd.melt(df) df.pivot(columns='var', values='val') df=df.rename(columns = {'y':'year'})
"b" : [7, 8, 9], 列“旋转”为行(逆透视) 行“旋转”为列(透视) 更改DataFrame的列名
"c" : [10, 11, 12]},
index = [1, 2, 3]) df=df.sort_index()
对DataFrame的索引进行排序
为每列指定值.
df=df.reset_index()
df = pd.DataFrame(
将DataFrame的索引移到列里,重置索引为行数
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]], pd.concat([df1,df2]) pd.concat([df1,df2], axis=1) df=df.drop(['Length','Height'], axis=1)
index=[1, 2, 3], 按行连接 按列连接 删除DataFrame的列
columns=['a', 'b', 'c'])
为每行指定值.
a b c 选取观测值 (选取行) 选取变量(选取列)
n v
1 4 7 10
d
2 5 8 11
e 2 6 9 12

df = pd.DataFrame( df[['width','length','species']]
df[df.Length > 7] df.sample(frac=0.5) 根据列名选取多列
{"a" : [4 ,5, 6], 提取符合标准的行 随机选取部分(按比例)行
"b" : [7, 8, 9], df['width'] or df.width
df.sample(n=10) 根据列名选取单列
"c" : [10, 11, 12]}, 随机选取n行
df.drop_duplicates() df.filter(regex='regex')
index = pd.MultiIndex.from_tuples( 删除重复行 df.iloc[10:20] 选取列名匹配正则表达式的列
[('d',1),('d',2),('e',2)], 按位置选取行
names=['n','v'])))
df.head(n) df.nlargest(n, 'value') 正则表达式举例
创建有多重索引的DataFrame. 选取前n行 对指定列排序并选择数值最大的n行 '\.' 匹配包含'.'的字符串
df.tail(n) df.nsmallest(n, 'value') 匹配结尾是'Length'的字符串
链式方法
'Length$'
选取后n行 . 对指定列排序并选择数值最小的n行
'^Sepal' 匹配开头是'Sepal'的字符串
Most pandas methods return a DataFrame so that '^x[1-5]$' 匹配开头是'x',结尾是1-5的字符串
another pandas method can be applied to the Python(和pandas)的逻辑运算 ''^(?!Species$).*' 匹配除'Species'以外的所有字符串
result. This improves readability of code. < Less than != Not equal to
df = (pd.melt(df) df.loc[:,'x2':'x4']
.rename(columns={
> Greater than df.column.isin(values) Group membership 选取'x2'、'x4'之间的列(包括x2和x4列)
'variable' : 'var', == Equals pd.isnull(obj) Is NaN df.iloc[:,[1,2,5]]
'value' : 'val'}) pd.notnull(obj)
按位置选取第1,2,5列 (第一列的列数为0).
<= Less than or equals Is not NaN
.query('val >= 200') df.loc[df['a'] > 10, ['a','c']]
>= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all
) 选取行符合逻辑条件的指定列
http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants
汇总数据 处理缺失数据 合并数据集
df['Length'].value_counts() df=df.dropna() adf bdf
计算某列中各值出现的频率 删除含有空值的行 x1 x2 x1 x3
len(df) df=df.fillna(value) A 1 A T
DataFrame 的行数 替换所有空值为某个值 B 2 B F
len(df['w'].unique()) C 3 D T
计算某列中各值去重后的个数
df.describe()
.
创建新的变量(创建新列) 标准连接

DataFrame各列(或分组)的基本统计描述 x1 x2 x3 pd.merge(adf, bdf,


A 1 T how='left', on='x1')
B 2 F 以x1列作为键值,将bdf对应的行合并到adf(左连接)
C 3 NaN
df=df.assign(Area=lambda df: df.Length*df.Height)
pandas提供了许多*汇总*函数,用来处理不同的pandas数据(如DataFrame, 计算和追加一列或多个新列 x1 x2 x3 pd.merge(adf, bdf,
Series,GroupBy,Expanding,Rolling等对象),产生各组的*单个*汇总值。 A 1.0 T how='right', on='x1')
df['Volume'] = df.Length*df.Height*df.Depth
当应用到DataFrame时,每一列分别返回一个汇总值,返回的结果是索引为列 增加一列 B 2.0 F 以x1列作为键值,将abf对应的行合并到bdf(右连接)
名的Series序列。部分常用汇总函数如下: pd.qcut(df.col, n, labels=False) D NaN T
将列切成n块
x1 x2 x3 pd.merge(adf, bdf,
sum() min()
A 1 T how='inner', on='x1')
对各个对象的值进行求和 各个对象的值之最小值
Vector Vector B 2 F 以x1列作为键值,合并adf和bdf,仅保留共有键值
count() max() function function
对各个对象的非空值进行计数 各个对象的值之最大值 (内连接)
x1 x2 x3 pd.merge(adf, bdf,
mean()
各个对象的值之平均值 pandas提供了许多*向量*函数,用来处理DataFrame的所有列或者选定的单列 A 1 T how='outer', on='x1')
median()
各个对象的值之中位数 (即Series对象)。这些函数应用于各列的每个元素,产生长度相等的向量(区 B 2 F 以x1列作为键值,合并adf和bdf,保留所有键值
var()
别于汇总函数只产生一个汇总值,即长度为1)。对于单列( 即Series对象),产 C 3 NaN (外连接)
quantile([0.25,0.75]) 各个对象的值之方差
生新的单列(即Series) D NaN T
各个对象的值之分位数 std()
apply(function) 各个对象的值之标准差 max(axis=1) min(axis=1) 过滤连接
对各个对象应用某个函数 求各行的最大值(逐元素求最大值) 求各行的最小值(逐元素求最小值) x1 x2 adf[adf.x1.isin(bdf.x1)]
clip(lower=-10,upper=10) abs() A 1 选取adf的行,这些行的键值同时存在于bdf中。
分组数据 根据输入阈值修剪数值 求绝对值 B 2

df.groupby(by="col") x1 x2 adf[~adf.x1.isin(bdf.x1)]
以下例子也可以应用到分组中。在这种情况下,函数应用在每一个group,
根据“col”列的值进行分组, C 3 选取adf的行,这些行的键值不存在于bdf中。
返回的向量与原始DataFrame具有相同的长度。
返回GroupBy对象。

shift(1) shift(-1) ydf zdf


向前偏移一个位置 向后偏移一个位置 x1 x2 x1 x2
df.groupby(level="ind") rank(method='dense') cumsum() A 1 B 2
根据名为ind的索引级别的值进行
返回连续的排名值 累计求和 B 2 C 3
分组,返回GroupBy对象。
rank(method='min') cummax() C 3 D 4
使用分组的最小排名 累计最大值
Set-like Operations
所有汇总函数都可以应用于GroupBy对象,GroupBy函数还有: rank(pct=True) cummin()
将排名值缩放到[0, 1]之间 累计最小值 x1 x2 pd.merge(ydf, zdf)
size() agg(function) rank(method='first') cumprod() B 2 合并ydf和zdf的相同行。
每组的大小 利用自定义函数聚合各个group 按值在原始数据中的出现顺序分配排名 累计求积 C 3 (交集).

x1 x2
窗口 作图
pd.merge(ydf, zdf, how='outer')
A 1 合并ydf和zdf的所有行。
B 2 (并集).
df.expanding() df.plot.hist() df.plot.scatter(x='w',y='h') C 3
返回扩展窗口对象(Expanding object),使得汇总函数渐增地应用到 产生每列的直方图 利用成对的数据点产生散点图 D 4 pd.merge(ydf, zdf, how='outer',
不断扩张的窗口中。
x1 x2 indicator=True)
df.rolling(n) .query('_merge == "left_only"')
返回移动窗口对象(Rolling object),使得汇总函数移动地应用到 A 1
.drop(['_merge'],axis=1)
长度为n的窗口中。
合并存在于ydf但不存在于zdf的行。(差集)
http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants

You might also like