0% found this document useful (0 votes)
52 views

Data Manipulation With Pandas

Uploaded by

gracebaker811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Data Manipulation With Pandas

Uploaded by

gracebaker811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Data Manipulation with Pandas

Transforming DataFrames
1. Introducing DataFrames

Hi, I'm Richie. I'll be your tour guide through the world of pandas.

2. What's the point of pandas?

pandas is a Python package for data manipulation. It can also be used for data
visualization; we'll get to that in Chapter 4.

3. Course outline

We'll start by talking about DataFrames, which form the core of pandas. In chapter 2,
we'll discuss aggregating data to gather insights. In chapter 3, you'll learn all about
slicing and indexing to subset DataFrames. Finally, you'll visualize your data, deal with
missing data, and read data into a DataFrame. Let's dive in.

4. pandas is built on NumPy and Matplotlib

pandas is built on top of two essential Python packages, NumPy and Matplotlib. Numpy
provides multidimensional array objects for easy data manipulation that pandas uses to
store data, and Matplotlib has powerful data visualization capabilities that pandas takes
advantage of.

5. pandas is popular

pandas has millions of users, with PyPi recording about 14 million downloads in
December 2019. This represents almost the entire Python data science community!

1. 1
https://pypistats.org/packages/pandas

6. Rectangular data

There are several ways to store data for analysis, but rectangular data, sometimes
called "tabular data" is the most common form. In this example, with dogs, each
observation, or each dog, is a row, and each variable, or each dog property, is a
column. pandas is designed to work with rectangular data like this.

7. pandas DataFrames

In pandas, rectangular data is represented as a DataFrame object. Every programming


language used for data analysis has something similar to this. R also has DataFrames,
while SQL has database tables. Every value within a column has the same data type,
either text or numeric, but different columns can contain different data types.

8. Exploring a DataFrame: .head()


When you first receive a new dataset, you want to quickly explore it and get a sense of
its contents. pandas has several methods for this. The first is head, which returns the
first few rows of the DataFrame. We only had seven rows to begin with, so it's not super
exciting, but this becomes very useful if you have many rows.

9. Exploring a DataFrame: .info()

The info method displays the names of columns, the data types they contain, and
whether they have any missing values.

10. Exploring a DataFrame: .shape

A DataFrame's shape attribute contains a tuple that holds the number of rows followed
by the number of columns. Since this is an attribute instead of a method, you write it
without parentheses.

11. Exploring a DataFrame: .describe()

The describe method computes some summary statistics for numerical columns, like
mean and median. "count" is the number of non-missing values in each column.
describe is good for a quick overview of numeric variables, but if you want more control,
you'll see how to perform more specific calculations later in the course.

12. Components of a DataFrame: .values

DataFrames consist of three different components, accessible using attributes. The


values attribute, as you might expect, contains the data values in a 2-dimensional
NumPy array.

13. Components of a DataFrame: .columns and .index

The other two components of a DataFrame are labels for columns and rows. The
columns attribute contains column names, and the index attribute contains row numbers
or row names. Be careful, since row labels are stored in dot-index, not in dot-rows.
Notice that these are Index objects, which we'll cover in Chapter 3. This allows for
flexibility in labels. For example, the dogs data uses row numbers, but row names are
also possible.

14. pandas Philosophy

Python has a semi-official philosophy on how to write good code called The Zen of
Python. One suggestion is that given a programming problem, there should only be one
obvious solution. As you go through this course, bear in mind that pandas deliberately
doesn't follow this philosophy. Instead, there are often multiple ways to solve a problem,
leaving you to choose the best. In this respect, pandas is like a Swiss Army Knife, giving
you a variety of tools, making it incredibly powerful, but more difficult to learn. In this
course, we aim for a more streamlined approach to pandas, only covering the most
important ways of doing things.

1. 1
https://www.python.org/dev/peps/pep-0020/
Inspecting a DataFrame
When you get a new DataFrame to work with, the first thing you need to do is explore it
and see what it contains. There are several useful methods and attributes for this.

 .head() returns the first few rows (the “head” of the DataFrame).
 .info() shows information on each of the columns, such as the data type and number
of missing values.
 .shape returns the number of rows and columns of the DataFrame.
 .describe() calculates a few summary statistics for each column.

homelessness is a DataFrame containing estimates of homelessness in each U.S. state


in 2018. The individual column is the number of homeless individuals not part of a
family with children. The family_members column is the number of homeless individuals
part of a family with children. The state_pop column is the state's total population.
pandas is imported for you.
Instructions 1/4
25 XP

 Print the head of the homelessness DataFrame.


 Print information about the column types and missing values in homelessness.
 Print the number of rows and columns in homelessness.
 Print some summary statistics that describe the homelessness DataFrame.

# Print the head of the homelessness data


print(homelessness.head())

# Print information about homelessness


print(homelessness.info())

# Print the shape of homelessness


print(homelessness.shape)

# Print a description of homelessness


print(homelessness.describe())
Parts of a DataFrame
To better understand DataFrame objects, it's useful to know that they consist of three
components, stored as attributes:

 .values: A two-dimensional NumPy array of values.


 .columns: An index of columns: the column names.
 .index: An index for the rows: either row numbers or row names.

You can usually think of indexes as a list of strings or numbers, though the
pandas Index data type allows for more sophisticated options. (These will be covered
later in the course.)
homelessness is available.
Instructions
100 XP

 Import pandas using the alias pd.


 Print a 2D NumPy array of the values in homelessness.
 Print the column names of homelessness.
 Print the index of homelessness.

# Import pandas using the alias pd


import pandas as pd

# Print the values of homelessness


print(homelessness.values)

# Print the column index of homelessness


print(homelessness.columns)

# Print the row index of homelessness


print(homelessness.index)

Sorting and Subsetting


1. Sorting and subsetting

In this video, we'll cover the two simplest and possibly most important ways to find
interesting parts of your DataFrame.
2. Sorting

The first thing you can do is change the order of the rows by sorting them so that the
most interesting data is at the top of the DataFrame. You can sort rows using the
sort_values method, passing in a column name that you want to sort by. For example,
when we apply sort_values on the weight_kg column of the dogs DataFrame, we get
the lightest dog at the top, Stella the Chihuahua, and the heaviest dog at the bottom,
Bernie the Saint Bernard.

3. Sorting in descending order

Setting the ascending argument to False will sort the data the other way around, from
heaviest dog to lightest dog.

4. Sorting by multiple variables

We can sort by multiple variables by passing a list of column names to sort_values.


Here, we sort first by weight, then by height. Now, Charlie, Lucy, and Bella are ordered
from shortest to tallest, even though they all weigh the same.

5. Sorting by multiple variables

To change the direction values are sorted in, pass a list to the ascending argument to
specify which direction sorting should be done for each variable. Now, Charlie, Lucy,
and Bella are ordered from tallest to shortest.

6. Subsetting columns

We may want to zoom in on just one column. We can do this using the name of the
DataFrame, followed by square brackets with a column name inside. Here, we can look
at just the name column.

7. Subsetting multiple columns

To select multiple columns, you need two pairs of square brackets. In this code, the
inner and outer square brackets are performing different tasks. The outer square
brackets are responsible for subsetting the DataFrame, and the inner square brackets
are creating a list of column names to subset. This means you could provide a separate
list of column names as a variable and then use that list to perform the same subsetting.
Usually, it's easier to do in one line.

8. Subsetting rows

There are lots of different ways to subset rows. The most common way to do this is by
creating a logical condition to filter against. For example, let's find all the dogs whose
height is greater than 50 centimeters. Now we have a True or False value for every row.

9. Subsetting rows

We can use the logical condition inside of square brackets to subset the rows we're
interested in to get all of the dogs taller than 50 centimeters.
10. Subsetting based on text data

We can also subset rows based on text data. Here, we use the double equal sign in the
logical condition to filter the dogs that are Labradors.

11. Subsetting based on dates

We can also subset based on dates. Here, we filter all the dogs born before 2015.
Notice that the dates are in quotes and are written as year then month, then day. This is
the international standard date format.

12. Subsetting based on multiple conditions

To subset the rows that meet multiple conditions, you can combine conditions using
logical operators, such as the "and" operator seen here. This means that only rows that
meet both of these conditions will be subsetted. You could also do this in one line of
code, but you'll also need to add parentheses around each condition.

13. Subsetting using .isin()

If you want to filter on multiple values of a categorical variable, the easiest way is to use
the isin method. This takes in a list of values to filter for. Here, we check if the color of a
dog is black or brown, and use this condition to subset the data.

Sorting rows
Finding interesting bits of data in a DataFrame is often easier if you change the order of
the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical
variable), you may wish to break the ties by sorting on another column. You can sort on
multiple columns in this way by passing a list of column names.

Sort on … Syntax

one column df.sort_values("breed")

multiple columns df.sort_values(["breed", "weight_kg"])


By combining .sort_values() with .head(), you can answer questions in the form,
"What are the top cases where…?".
homelessness is available and pandas is loaded as pd.
Instructions 1/3
35 XP
Instructions 1/3
35 XP

 1
o Sort homelessness by the number of homeless individuals, from smallest
to largest, and save this as homelessness_ind.
o Print the head of the sorted DataFrame.

# Sort homelessness by individuals


homelessness_ind = homelessness.sort_values("individuals")

# Print the top few rows


print(homelessness_ind.head())

 2

o Sort homelessness by the number of homeless family_members in


descending order, and save this as homelessness_fam.
o Print the head of the sorted DataFrame.

# Sort homelessness by descending family members


homelessness_fam = homelessness.sort_values("family_members", asc
ending = False)

# Print the top few rows


print(homelessness_fam.head())

 3

o Sort homelessness first by region (ascending), and then by number of


family members (descending). Save this as homelessness_reg_fam.
o Print the head of the sorted DataFrame.

# Sort homelessness by region, then descending family members


homelessness_reg_fam = homelessness.sort_values(["region", "famil
y_members"], ascending=[True, False])

# Print the top few rows


print(homelessness_reg_fam.head())
Subsetting columns
When working with data, you may not need all of the variables in your dataset. Square
brackets ([]) can be used to select only the columns that matter to you in an order that
makes sense to you. To select only "col_a" of the DataFrame df, use
df["col_a"]
To select "col_a" and "col_b" of df, use
df[["col_a", "col_b"]]
homelessness is available and pandas is loaded as pd.

o Create a DataFrame called individuals that contains only


the individuals column of homelessness.
o Print the head of the result.

# Select the individuals column


individuals = homelessness["individuals"]

# Print the head of the result


print(individuals.head())

 2

o Create a DataFrame called state_fam that contains only


the state and family_members columns of homelessness, in that order.
o Print the head of the result.

Hint
Use double square-brackets with column names in quotes to select multiple columns.
# Select the state and family_members columns
state_fam = homelessness[["state", "family_members"]]

# Print the head of the result


print(state_fam.head())
 3

o Create a DataFrame called ind_state that contains


the individuals and state columns of homelessness, in that order.
o Print the head of the result.

# Select only the individuals and state columns, in that order


ind_state = homelessness[["individuals", "state"]]

# Print the head of the result


print(ind_state.head())

Subsetting rows
A large part of data science is about finding which bits of your dataset are interesting.
One of the simplest techniques for this is to find a subset of rows that match some
criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use
relational operators to return True or False for each row, then pass that inside square
brackets.
dogs[dogs["height_cm"] > 60]
dogs[dogs["color"] == "tan"]
You can filter for multiple conditions at once by using the "bitwise and" operator, &.
dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]
homelessness is available and pandas is loaded as pd.
Instructions 1/3
35 XP
Instructions 1/3
35 XP

 1

Filter homelessness for cases where the number of individuals is greater than ten
thousand, assigning to ind_gt_10k. View the printed result.
Take Hint (-10 XP)
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness["individuals"] >10000]
# See the result
print(ind_gt_10k)

 2

Filter homelessness for cases where the USA Census region is "Mountain",
assigning to mountain_reg. View the printed result.

# Filter for rows where region is Mountain


mountain_reg = homelessness[homelessness["region"] == "Mountain"]

# See the result


print(mountain_reg)

 3

Filter homelessness for cases where the number of family_members is less than
one thousand and the region is "Pacific", assigning to fam_lt_1k_pac. View the
printed result.

# Filter for rows where family_members is less than 1000


# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 10
00) & (homelessness["region"] == "Pacific")]

# See the result


print(fam_lt_1k_pac)
Subsetting rows by categorical variables
Subsetting data based on a categorical variable often involves using the "or" operator
(|) to select rows from multiple categories. This can get tedious when you want all
states in one of three different regions, for example. Instead, use the .isin() method,
which will allow you to tackle this problem by writing one condition instead of three
separate ones.
colors = ["brown", "black", "tan"]
condition = dogs["color"].isin(colors)
dogs[condition]
homelessness is available and pandas is loaded as pd.
Instructions 1/2
35 XP
Instructions 1/2
35 XP
 Filter homelessness for cases where the USA census region is "South Atlantic" or
it is "Mid-Atlantic", assigning to south_mid_atlantic. View the printed result.

o
o
o
o

Hint
The solution takes the form df[(df["col"] == "value_1") | (df["col"] ==
"value_2")].
Did you find this hint helpful?

# Subset for rows in South Atlantic or Mid-Atlantic regions


south_mid_atlantic = homelessness[(homelessness["region"] == "Sou
th Atlantic") | (homelessness["region"] == "Mid-Atlantic")]

# See the result


print(south_mid_atlantic)

 2

Filter homelessness for cases where the USA census state is in the list of Mojave
states, canu, assigning to mojave_homelessness. View the printed result.
Hint
The solution takes the form df[df["col"].isin(["value_1", "value_2"])] .

# The Mojave Desert states


canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states


mojave_homelessness = homelessness[homelessness["state"].isin(can
u)]

# See the result


print(mojave_homelessness)

New Columns
Got It!
1. New columns

In the last lesson, you saw how to subset and sort a DataFrame to extract interesting
bits. However, often when you first receive a DataFrame, the contents aren't exactly
what you want. You may have to add new columns derived from existing columns.

2. Adding a new column

Creating and adding new columns can go by many names, including mutating a
DataFrame, transforming a DataFrame, and feature engineering. Let's say we want to
add a new column to our DataFrame that has each dog's height in meters instead of
centimeters. On the left-hand side of the equals, we use square brackets with the name
of the new column we want to create. On the right-hand side, we have the calculation.
Notice that both the existing column and the new column we just created are in the
DataFrame.

3. Doggy mass index

Let's see what the results are if we calculate the body mass index, or BMI, of these
dogs. BMI is usually calculated by taking a person's weight in kilograms and dividing it
by their height in meters, squared. Instead of doing this with people, we'll try it out with
dogs. Again, the new column is on the left-hand side of the equals, but this time, our
calculation involves two columns.
4. Multiple manipulations

The real power of pandas comes in when you combine all the skills you've learned so
far. Let's figure out the names of skinny, tall dogs. First, to define the skinny dogs, we
take the subset of the dogs who have a BMI of under 100. Next, we sort the result in
descending order of height to get the tallest skinny dogs at the top. Finally, we keep
only the columns we're interested in. Here, you can see that Max is the tallest dog with
a BMI of under 100.

Adding new columns


You aren't stuck with just the data you are given. Instead, you can add new columns to
a DataFrame. This has many names, such as transforming, mutating, and feature
engineering.

You can create new columns from scratch, but it is also common to derive them from
other columns, for example, by adding columns together or by changing their units.

homelessness is available and pandas is loaded as pd.


Instructions
100 XP

 Add a new column to homelessness, named total, containing the sum of


the individuals and family_members columns.
 Add another column to homelessness, named p_individuals, containing the
proportion of homeless people in each state who are individuals.

# Add total col as sum of individuals and family_members


homelessness["total"] = homelessness["individuals"] + homelessnes
s["family_members"]

# Add p_individuals col as proportion of total that are individua


ls
homelessness["p_individuals"] = homelessness["individuals"] / hom
elessness["total"]

# See the result


print(homelessness)
Combo-attack!
You've seen the four most common types of data manipulation: sorting rows, subsetting
columns, subsetting rows, and adding new columns. In a real-life data analysis, you can
mix and match these four manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, "Which state has the highest number of
homeless individuals per 10,000 people in the state?" Combine your new pandas skills
to find out.
Instructions
100 XP

 Add a column to homelessness, indiv_per_10k, containing the number of homeless


individuals per ten thousand people in each state.
 Subset rows where indiv_per_10k is higher than 20, assigning
to high_homelessness.
 Sort high_homelessness by descending indiv_per_10k, assigning
to high_homelessness_srt.
 Select only the state and indiv_per_10k columns of high_homelessness_srt and
save as result. Look at the result.

Hint

 To add a column, use syntax like df["new_col"] = df["col_a"] / df["col_b"].


 To filter rows, use syntax like df[df["col"] > n].
 To sort rows, use syntax like df.sort_values("col", ascending=False).
 To select columns, use syntax like df[["col_a", "col_b"]].

# Create indiv_per_10k col as homeless individuals per 10k state


pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals
"] / homelessness["state_pop"]

# Subset rows for indiv_per_10k greater than 20


high_homelessness = homelessness[homelessness["indiv_per_10k"] >
20]

# Sort high_homelessness by descending indiv_per_10k


high_homelessness_srt = high_homelessness.sort_values("indiv_per_
10k", ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k


cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result


print(result)

Aggregating DataFrames
Summary statistics
1. Summary statistics

Hi, I'm Maggie, and I'll be the other instructor for this course. In the first chapter, you
learned about DataFrames, how to sort and subset them, and how to add new columns
to them. In this chapter, we'll talk about aggregating data, starting with summary
statistics. Summary statistics, as follows from their name, are numbers that summarize
and tell you about your dataset.

2. Summarizing numerical data

One of the most common summary statistics for numeric data is the mean, which is one
way of telling you where the "center" of your data is. You can calculate the mean of a
column by selecting the column with square brackets and calling dot-mean. There are
lots of other summary statistics that you can compute on columns, like median and
mode, minimum and maximum, and variance and standard deviation. You can also take
sums and calculate quantiles.

3. Summarizing dates

You can also get summary statistics for date columns. For example, we can find the
oldest dog's date of birth by taking the minimum of the date of birth column. Similarly,
we can take the maximum to see that the youngest dog was born in 2018.

4. The .agg() method

The aggregate, or agg, method allows you to compute custom summary statistics. Here,
we create a function called pct30 that computes the thirtieth percentile of a DataFrame
column. Don't worry if this code doesn't make sense to you -- just know that the function
takes in a column and spits out the column's thirtieth percentile. Now we can subset the
weight column and call dot-agg, passing in the name of our function, pct30. It gives us
the thirtieth percentile of the dogs' weights.

5. Summaries on multiple columns

agg can also be used on more than one column. By selecting the weight and height
columns before calling agg, we get the thirtieth percentile for both columns.

6. Multiple summaries

We can also use agg to get multiple summary statistics at once. Here's another function
that computes the fortieth percentile called pct40. We can pass a list of functions into
agg, in this case, pct30 and pct40, which will return the thirtieth and fortieth percentiles
of the dogs' weights.

7. Cumulative sum

pandas also has methods for computing cumulative statistics, for example, the
cumulative sum. Calling cumsum on a column returns not just one number, but a
number for each row of the DataFrame. The first number returned, or the number in the
zeroth index, is the first dog's weight. The next number is the sum of the first and
second dogs' weights. The third number is the sum of the first, second, and third dogs'
weights, and so on. The last number is the sum of all the dogs' weights.

8. Cumulative statistics

pandas also has methods for other cumulative statistics, such as the cumulative
maximum, cumulative minimum, and the cumulative product. These all return an entire
column of a DataFrame, rather than a single number.

9. Walmart

In this chapter, you'll be working with data on Walmart stores, which is a chain of
department stores in the US. The dataset contains weekly sales in US dollars in various
stores. Each store has an ID number and a specific store type. The sales are also
separated by department ID. Along with weekly sales, there is information about
whether it was a holiday week or not, the average temperature during the week in that
location, the average fuel price in dollars per liter that week, and the national
unemployment rate that week.

Mean and median


Summary statistics are exactly what they sound like - they summarize many numbers in
one statistic. For example, mean, median, minimum, maximum, and standard deviation
are summary statistics. Calculating summary statistics allows you to get a better sense
of your data, even if there's a lot of it.

sales is available and pandas is loaded as pd.


Instructions
100 XP

 Explore your new DataFrame first by printing the first few rows of the sales DataFrame.
 Print information about the columns in sales.
 Print the mean of the weekly_sales column.
 Print the median of the weekly_sales column.
# Print the head of the sales DataFrame
print(sales.head())

# Print the info about the sales DataFrame


print(sales.info())

# Print the mean of weekly_sales


print(sales["weekly_sales"].mean())

# Print the median of weekly_sales


print(sales["weekly_sales"].median())

Summarizing dates
Summary statistics can also be calculated on date columns that have values with the
data type datetime64. Some summary statistics — like mean — don't make a ton of
sense on dates, but others are super helpful, for example, minimum and maximum,
which allow you to see what time range your data covers.
sales is available and pandas is loaded as pd.
Instructions
100 XP

 Print the maximum of the date column.


 Print the minimum of the date column.

# Print the maximum of the date column


print(sales["date"].max())

# Print the minimum of the date column


print(sales["date"].min())

Efficient summaries
While pandas and NumPy have tons of functions, sometimes, you may need a different
function to summarize your data.

The .agg() method allows you to apply your own custom functions to a DataFrame, as
well as apply functions to more than one column of a DataFrame at once, making your
aggregations super-efficient. For example,
df['column'].agg(function)
In the custom function for this exercise, "IQR" is short for inter-quartile range, which is
the 75th percentile minus the 25th percentile. It's an alternative to standard deviation
that is helpful if your data contains outliers.

sales is available and pandas is loaded as pd.


Instructions 1/3
35 XP
Instructions 1/3
35 XP

 1

o Use the custom iqr function defined for you along with .agg() to print the
IQR of the temperature_c column of sales.
Take Hint (-10 XP)
# A custom IQR function
def iqr(column):
return column.quantile(0.75) - column.quantile(0.25)

# Print IQR of the temperature_c column


print(sales["temperature_c"].agg(iqr))

 2

o Update the column selection to use the custom iqr function with .agg() to
print the IQR of temperature_c, fuel_price_usd_per_l,
and unemployment, in that order.

# A custom IQR function


def iqr(column):
return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & u


nemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployme
nt"]].agg(iqr))

 3

o Update the aggregation functions called by .agg():


include iqr and np.median in that order.
o Remember that the summary functions should be passed to .agg() as a
list (meaning they are enclosed in brackets)
# Import NumPy and create custom IQR function
import numpy as np
def iqr(column):
return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd


_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployme
nt"]].agg([iqr, np.median]))

Cumulative statistics
Cumulative statistics can also be helpful in tracking summary statistics over time. In this
exercise, you'll calculate the cumulative sum and cumulative max of a department's
weekly sales, which will allow you to identify what the total sales were so far as well as
what the highest weekly sales were so far.

A DataFrame called sales_1_1 has been created for you, which contains the sales data
for department 1 of store 1. pandas is loaded as pd.
Instructions
100 XP

 Sort the rows of sales_1_1 by the date column in ascending order.


 Get the cumulative sum of weekly_sales and add it as a new column
of sales_1_1 called cum_weekly_sales.
 Get the cumulative maximum of weekly_sales, and add it as a column
called cum_max_sales.
 Print the date, weekly_sales, cum_weekly_sales, and cum_max_sales columns.

# Sort sales_1_1 by date


sales_1_1 = sales_1_1.sort_values("date", ascending = True)

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales


col
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum(
)

# Get the cumulative max of weekly_sales, add as cum_max_sales co


l
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

# See the columns you calculated


print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum
_max_sales"]])

Counting
1. Counting

So far, in this chapter, you've learned how to summarize numeric variables. In this
video, you'll learn how to summarize categorical data using counting.

2. Avoiding double counting

Counting dogs is no easy task when they're running around the park. It's hard to keep
track of who you have and haven't counted!

3. Vet visits

Here's a DataFrame that contains vet visits. The vet's office wants to know how many
dogs of each breed have visited their office. However, some dogs have been to the vet
more than once, like Max and Stella, so we can't just count the number of each breed in
the breed column.

4. Dropping duplicate names

Let's try to fix this by removing rows that contain a dog name already listed earlier in the
dataset, or in other words; we'll extract a dog with each name from the dataset once.
We can do this using the drop_duplicates method. It takes an argument, subset, which
is the column we want to find our duplicates based on - in this case, we want all the
unique names. Now we have a list of dogs where each one appears once. We have
Max the Chow Chow, but where did Max the Labrador go? Because we have two
different dogs with the same name, we'll need to consider more than just name when
dropping duplicates.

5. Dropping duplicate pairs

Since Max and Max are different breeds, we can drop the rows with pairs of name and
breed listed earlier in the dataset. To base our duplicate dropping on multiple columns,
we can pass a list of column names to the subset argument, in this case, name and
breed. Now both Maxes have been included, and we can start counting.

6. Easy as 1, 2, 3

To count the dogs of each breed, we'll subset the breed column and use the
value_counts method. We can also use the sort argument to get the breeds with the
biggest counts on top.

7. Proportions

The normalize argument can be used to turn the counts into proportions of the total.
25% of the dogs that go to this vet are Labradors.
Dropping duplicates
Removing duplicates is an essential skill to get accurate counts because often, you
don't want to count the same thing multiple times. In this exercise, you'll create some
new DataFrames using unique values from sales.
sales is available and pandas is imported as pd.
Instructions
100 XP

 Remove rows of sales with duplicate pairs of store and type and save
as store_types and print the head.
 Remove rows of sales with duplicate pairs of store and department and save
as store_depts and print the head.
 Subset the rows that are holiday weeks using the is_holiday column, and drop the
duplicate dates, saving as holiday_dates.
 Select the date column of holiday_dates, and print.

# Drop duplicate store/type combinations


store_types = sales.drop_duplicates(subset=["store", “type”])
print(store_types.head())

# Drop duplicate store/department combinations


store_depts = sales.drop_duplicates(subset=["store", "department"
])
print(store_depts.head())

# Subset the rows where is_holiday is True and drop duplicate dat
es
holiday_dates = sales[sales["is_holiday"] == True].drop_duplicate
s("date")

# Print date col of holiday_dates


print(holiday_dates["date"])

Counting categorical variables


Counting is a great way to get an overview of your data and to spot curiosities that you
might not notice otherwise. In this exercise, you'll count the number of each type of
store and the number of each department number using the DataFrames you created in
the previous exercise:

# Drop duplicate store/type combinations


store_types = sales.drop_duplicates(subset=["store", "type"])

# Drop duplicate store/department combinations


store_depts = sales.drop_duplicates(subset=["store", "department"])
The store_types and store_depts DataFrames you created in the last exercise are
available, and pandas is imported as pd.
Instructions
100 XP

 Count the number of stores of each store type in store_types.


 Count the proportion of stores of each store type in store_types.
 Count the number of different departments in store_depts, sorting the counts in
descending order.
 Count the proportion of different departments in store_depts, sorting the proportions
in descending order.

# Count the number of stores of each type


store_counts = store_types["type"].value_counts()
print(store_counts)

# Get the proportion of stores of each type


store_props = store_types["type"].value_counts(normalize=True)
print(store_props)

# Count the number of each department number and sort


dept_counts_sorted = store_depts["department"].value_counts(sort=
True)
print(dept_counts_sorted)

# Get the proportion of departments of each number and sort


dept_props_sorted = store_depts["department"].value_counts(sort=T
rue, normalize=True)
print(dept_props_sorted)
Grouped Summary Statistics
1. Grouped summary statistics

So far, you've been calculating summary statistics for all rows of a dataset, but
summary statistics can be useful to compare different groups.

2. Summaries by group

While computing summary statistics of entire columns may be useful, you can gain
many insights from summaries of individual groups. For example, does one color of dog
weigh more than another on average? Are female dogs taller than males? You can
already answer these questions with what you've learned so far! We can subset the
dogs into groups based on their color, and take the mean of each. But that's a lot of
work, and the duplicated code means you can easily introduce copy and paste bugs.

3. Grouped summaries

That's where the groupby method comes in. We can group by the color variable, select
the weight column, and take the mean. This will give us the mean weight for each dog
color. This was just one line of code compared to the five we had to write before to get
the same results.

4. Multiple grouped summaries

Just like with ungrouped summary statistics, we can use the agg method to get multiple
statistics. Here, we pass a list of functions into agg after grouping by color. This gives us
the minimum, maximum, and sum of the different colored dogs' weights.

5. Grouping by multiple variables

You can also group by multiple columns and calculate summary statistics. Here, we
group by color and breed, select the weight column and take the mean. This gives us
the mean weight of each breed of each color.

6. Many groups, many summaries

You can also group by multiple columns and aggregate by multiple columns.

What percent of sales occurred at each store type?


While .groupby() is useful, you can calculate grouped summary statistics without it.
Walmart distinguishes three types of stores: "supercenters," "discount stores," and
"neighborhood markets," encoded in this dataset as type "A," "B," and "C." In this
exercise, you'll calculate the total sales made at each store type, without
using .groupby(). You can then use these numbers to see what proportion of Walmart's
total sales were made at each type.
sales is available and pandas is imported as pd.
Instructions
100 XP

 Calculate the total weekly_sales over the whole dataset.


 Subset for type "A" stores, and calculate their total weekly sales.
 Do the same for type "B" and type "C" stores.
 Combine the A/B/C results into a list, and divide by sales_all to get the proportion of
sales by type.

# Calc total weekly sales


sales_all = sales["weekly_sales"].sum()

# Subset for type A stores, calc total weekly sales


sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

# Subset for type B stores, calc total weekly sales


sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()

# Subset for type C stores, calc total weekly sales


sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# Get proportion for each type


sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)

Calculations with .groupby()


The .groupby() method makes life much easier. In this exercise, you'll perform the
same calculations as last time, except you'll use the .groupby() method. You'll also
perform calculations on data grouped by two variables to see if sales differ by store type
depending on if it's a holiday week or not.
sales is available and pandas is loaded as pd.
Instructions 1/2
50 XP

 1

 2

 Group sales by "type", take the sum of "weekly_sales", and store


as sales_by_type.
 Calculate the proportion of sales at each store type by dividing by the sum
of sales_by_type. Assign to sales_propn_by_type.

# Group by type; calc total weekly sales


sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type


sales_propn_by_type = sales_by_type / sum(sales_by_type)
print(sales_propn_by_type)

 Group sales by "type" and "is_holiday", take the sum of weekly_sales, and
store as sales_by_type_is_holiday.

# From previous step


sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Group by type and is_holiday; calc total weekly sales


sales_by_type_is_holiday = sales.groupby(["type", "is_holiday"])
["weekly_sales"].sum()
print(sales_by_type_is_holiday)

Multiple grouped summaries


Earlier in this chapter, you saw that the .agg() method is useful to compute multiple
statistics on multiple variables. It also works with grouped data. NumPy, which is
imported as np, has many different summary statistics functions,
including: np.min, np.max, np.mean, and np.median.
sales is available and pandas is imported as pd.
Instructions
100 XP

 Import numpy with the alias np.


 Get the min, max, mean, and median of weekly_sales for each store type
using .groupby() and .agg(). Store this as sales_stats. Make sure to
use numpy functions!
 Get the min, max, mean, and median of unemployment and fuel_price_usd_per_l for
each store type. Store this as unemp_fuel_stats.
# Import numpy with the alias np
import numpy as np

# For each store type, aggregate weekly_sales: get min, max, mean
, and median
sales_stats = sales.groupby("type")["weekly_sales"].agg([min, ma
x, np.mean, np.median])

# Print sales_stats
print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_


per_l: get min, max, mean, and median
unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_p
rice_usd_per_l"]].agg([min, max, np.mean, np.median])

# Print unemp_fuel_stats
print(unemp_fuel_stats)

Pivot Tables
Got It!
1. Pivot tables

Pivot tables are another way of calculating grouped summary statistics. If you've ever
used a spreadsheet, chances are you've used a pivot table. Let's see how to create
pivot tables in pandas.

2. Group by to pivot table

In the last lesson, we grouped the dogs by color and calculated their mean weights. We
can do the same thing using the pivot_table method. The "values" argument is the
column that you want to summarize, and the index column is the column that you want
to group by. By default, pivot_table takes the mean value for each group.

3. Different statistics

If we want a different summary statistic, we can use the aggfunc argument and pass it a
function. Here, we take the median for each dog color using NumPy's median function.

4. Multiple statistics

To get multiple summary statistics at a time, we can pass a list of functions to the
aggfunc argument. Here, we get the mean and median for each dog color.
5. Pivot on two variables

You also previously computed the mean weight grouped by two variables: color and
breed. We can also do this using the pivot_table method. To group by two variables, we
can pass a second variable name into the columns argument. While the result looks a
little different than what we had before, it contains the same numbers. There are NaNs,
or missing values, because there are no black Chihuahuas or gray Labradors in our
dataset, for example.

dogs.pivot_table(values=”weight_kg”, index=”color”, columns=”breed, fill_value=0,


margins=True)

6. Filling missing values in pivot tables

Instead of having lots of missing values in our pivot table, we can have them filled in
using the fill_value argument. Here, all of the NaNs get filled in with zeros.

7. Summing with pivot tables

If we set the margins argument to True, the last row and last column of the pivot table
contain the mean of all the values in the column or row, not including the missing values
that were filled in with Os. For example, in the last row of the Labrador column, we can
see that the mean weight of the Labradors is 26 kilograms. In the last column of the
Brown row, the mean weight of the Brown dogs is 24 kilograms. The value in the bottom
right, in the last row and last column, is the mean weight of all the dogs in the dataset.
Using margins equals True allows us to see a summary statistic for multiple levels of the
dataset: the entire dataset, grouped by one variable, by another variable, and by two
variables.

Pivoting on one variable


Pivot tables are the standard way of aggregating data in spreadsheets.

In pandas, pivot tables are essentially another way of performing grouped calculations.
That is, the .pivot_table() method is an alternative to .groupby().
In this exercise, you'll perform calculations using .pivot_table() to replicate the
calculations you performed in the last lesson using .groupby().
sales is available and pandas is imported as pd.
Instructions 1/3
35 XP

 1

o Get the mean weekly_sales by type using .pivot_table() and store


as mean_sales_by_type.
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values="weekly_sales", ind
ex="type")

# Print mean_sales_by_type
print(mean_sales_by_type)

Get the mean and median (using NumPy functions)


of weekly_sales by type using .pivot_table() and store as mean_med_sales_by_type

# Import NumPy as np
import numpy as np

# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values="weekly_sales",
index="type", aggfunc=[np.mean, np.median])

# Print mean_med_sales_by_type
print(mean_med_sales_by_type)

 Get the mean


of weekly_sales by type and is_holiday using .pivot_table() and store
as mean_sales_by_type_holiday.

# Pivot for mean weekly_sales by store type and holiday


mean_sales_by_type_holiday = sales.pivot_table(values="weekly_sal
es", index="type", columns="is_holiday")

# Print mean_sales_by_type_holiday
print(mean_sales_by_type_holiday)

Fill in missing values and sum values with pivot tables


The .pivot_table() method has several useful arguments,
including fill_value and margins.
 fill_value replaces missing values with a real value (known as imputation). What to
replace missing values with is a topic big enough to have its own course (Dealing with
Missing Data in Python), but the simplest thing to do is to substitute a dummy value.
 margins is a shortcut for when you pivoted by two variables, but also wanted to pivot by
each of those variables separately: it gives the row and column totals of the pivot table
contents.

In this exercise, you'll practice using these arguments to up your pivot table skills, which
will help you crunch numbers more efficiently!

sales is available and pandas is imported as pd.


Instructions 1/2
50 XP

 1

o Print the mean weekly_sales by department and type, filling in any


missing values with 0.

# Print mean weekly_sales by department and type; fill missing va


lues with 0
print(sales.pivot_table(values="weekly_sales", index="department"
, columns="type", fill_value=0))

 Print the mean weekly_sales by department and type, filling in any missing
values with 0 and summing all rows and columns.

# Print the mean weekly_sales by department and type; fill missin


g values with 0s; sum all rows and cols
print(sales.pivot_table(values="weekly_sales", index="department"
, columns="type", fill_value=0, margins=True))

Slicing and Indexing DataFrames


Explicit Indexes
1. Explicit indexes

In chapter one, you saw that DataFrames are composed of three parts: a NumPy array
for the data, and two indexes to store the row and column details.
2. The dog dataset, revisited

Here's the dog dataset again.

3. .columns and .index

Recall that dot-columns contains an Index object of column names, and dot-index
contains an Index object of row numbers.

4. Setting a column as the index

You can move a column from the body of the DataFrame to the index. This is called
"setting an index," and it uses the set_index method. Notice that the output has
changed slightly; in particular, a quick visual clue that name is now in the index is that
the index values are left-aligned rather than right-aligned.

5. Removing an index

To undo what you just did, you can reset the index - that is, you remove it. This is done
via .reset_index().

6. Dropping an index

reset_index has a drop argument that allows you to discard an index. Here, setting drop
to True entirely removes the dog names.

dogs.reset_index(drop=True)

7. Indexes make subsetting simpler

You may be wondering why you should bother with indexes. The answer is that it
makes subsetting code cleaner. Consider this example of subsetting for the rows where
the dog is called Bella or Stella. It's a fairly tricky line of code for such a simple task.
Now, look at the equivalent when the names are in the index. DataFrames have a
subsetting method called "loc," which filters on index values. Here you simply pass the
dog names to loc as a list. Much easier!

8. Index values don't need to be unique

The values in the index don't need to be unique. Here, there are two Labradors in the
index.

9. Subsetting on duplicated index values

Now, if you subset on "Labrador" using loc, all the Labrador data is returned.

10. Multi-level indexes a.k.a. hierarchical indexes

You can include multiple columns in the index by passing a list of column names to
set_index. Here, breed and color are included. These are called multi-level indexes, or
hierarchical indexes: the terms are synonymous. There is an implication here that the
inner level of index, in this case, color, is nested inside the outer level, breed.
11. Subset the outer level with a list

To take a subset of rows at the outer level index, you pass a list of index values to loc.
Here, the list contains Labrador and Chihuahua, and the resulting subset contains all
dogs from both breeds.

12. Subset inner levels with a list of tuples

To subset on inner levels, you need to pass a list of tuples. Here, the first tuple specifies
Labrador at the outer level and Brown at the inner level. The resulting rows have to
match all conditions from a tuple. For example, the black Labrador wasn't returned
because the brown condition wasn't matched.

13. Sorting by index values

In chapter 1, you saw how to sort the rows of a DataFrame using sort_values. You can
also sort by index values using sort_index. By default, it sorts all index levels from outer
to inner, in ascending order.

14. Controlling sort_index

You can control the sorting by passing lists to the level and ascending arguments.

15. Now you have two problems

Indexes are controversial. Although they simplify subsetting code, there are some
downsides. Index values are just data. Storing data in multiple forms makes it harder to
think about. There is a concept called "tidy data," where data is stored in tabular form -
like a DataFrame. Each row contains a single observation, and each variable is stored
in its own column. Indexes violate the last rule since index values don't get their own
column. In pandas, the syntax for working with indexes is different from the syntax for
working with columns. By using two syntaxes, your code is more complicated, which
can result in more bugs. If you decide you don't want to use indexes, that's perfectly
reasonable. However, it's useful to know how they work for cases when you need to
read other people's code.

16. Temperature dataset

In this chapter, you'll work with a monthly time series of air temperatures in cities around
the world.

Setting and removing indexes


pandas allows you to designate columns as an index. This enables cleaner code when
taking subsets (as well as providing more efficient lookup under some circumstances).

In this chapter, you'll be exploring temperatures, a DataFrame of average temperatures


in cities around the world. pandas is loaded as pd.
Instructions
100 XP

 Look at temperatures.
 Set the index of temperatures to "city", assigning to temperatures_ind.
 Look at temperatures_ind. How is it different from temperatures?
 Reset the index of temperatures_ind, keeping its contents.
 Reset the index of temperatures_ind, dropping its contents.

# Look at temperatures
print(temperatures)

# Set the index of temperatures to city


temperatures_ind = temperatures.set_index(["city"])

# Look at temperatures_ind
print(temperatures_ind)

# Reset the temperatures_ind index, keeping its contents


print(temperatures_ind.reset_index())

# Reset the temperatures_ind index, dropping its contents


print(temperatures_ind.reset_index(drop=True))

Subsetting with .loc[]


The killer feature for indexes is .loc[]: a subsetting method that accepts index values.
When you pass it a single argument, it will take a subset of rows.
The code for subsetting using .loc[] can be easier to read than standard square
bracket subsetting, which can make your code less burdensome to maintain.
pandas is loaded as pd. temperatures and temperatures_ind are available; the latter is
indexed by city.
Instructions
100 XP

 Create a list called cities that contains "Moscow" and "Saint Petersburg".
 Use [] subsetting to filter temperatures for rows where the city column takes a value
in the cities list.
 Use .loc[] subsetting to filter temperatures_ind for rows where the city is in
the cities list.

Take Hint (-30 XP)


# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]

# Subset temperatures using square brackets


print(temperatures[temperatures["city"].isin(cities)])

# Subset temperatures_ind using .loc[]


print(temperatures_ind.loc[cities])

Setting multi-level indexes


Indexes can also be made out of multiple columns, forming a multi-level
index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested
categorical variables. For example, in a clinical trial, you might have control and
treatment groups. Then each test subject belongs to one or another group, and we can
say that a test subject is nested inside the treatment group. Similarly, in the temperature
dataset, the city is located in the country, so we can say a city is nested inside the
country.

The main downside is that the code for manipulating indexes is different from the code
for manipulating columns, so you have to learn two syntaxes and keep track of how
your data is represented.

pandas is loaded as pd. temperatures is available.


Instructions
100 XP

 Set the index of temperatures to the "country" and "city" columns, and assign this
to temperatures_ind.
 Specify two country/city pairs to keep: "Brazil"/"Rio De
Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.
 Print and subset temperatures_ind for rows_to_keep using .loc[].

Take Hint (-30 XP)

# Index temperatures by country & city


temperatures_ind = temperatures.set_index(["country", "city"])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore


rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahor
e")]

# Subset for rows to keep


print(temperatures_ind.loc[rows_to_keep])

Sorting by index values


Previously, you changed the order of the rows in a DataFrame by
calling .sort_values(). It's also useful to be able to sort by elements in the index. For
this, you need to use .sort_index().
pandas is loaded as pd. temperatures_ind has a multi-level index of country and city,
and is available.
Instructions
100 XP

 Sort temperatures_ind by the index values.


 Sort temperatures_ind by the index values at the "city" level.
 Sort temperatures_ind by ascending country then descending city.

# Sort temperatures_ind by index values


print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level


print(temperatures_ind.sort_index(level="city"))

# Sort temperatures_ind by country then descending city


print(temperatures_ind.sort_index(level=["country", "city"], asce
nding = [True, False]))

Slicing and subsetting with .loc and .iloc


1. Slicing and subsetting with .loc and .iloc

Slicing is a technique for selecting consecutive elements from objects.

2. Slicing lists

Here are the dog breeds, this time as a list. To slice the list, you pass first and last
positions separated by a colon into square brackets. Remember that Python positions
start from zero, so 2 refers to the third element, Chow Chow. Also remember that the
last position, 5, is not included in the slice, so we finish at Labrador, not Chihuahua. If
you want the slice to start from the beginning of the list, you can omit the zero. Here,
using colon-3 returns the first three elements. Slicing with colon on its own returns the
whole list.

3. Sort the index before you slice

You can also slice DataFrames, but first, you need to sort the index. Here, the dogs
dataset has been given a multi-level index of breed and color; then, the index is sorted
with sort_index.

4. Slicing the outer index level

To slice rows at the outer level of an index, you call loc, passing the first and last values
separated by a colon. The full dataset is shown on the right for comparison. There are
two differences compared to slicing lists. Rather than specifying row numbers, you
specify index values. Secondly, notice that the final value is included. Here, Poodle is
included in the results.

5. Slicing the inner index levels badly

The same technique doesn't work on inner index levels. Here, trying to slice from Tan to
Grey returns an empty DataFrame instead of the six dogs we wanted. It's important to
understand the danger here. pandas doesn't throw an error to let you know that there is
a problem, so be careful when coding.

6. Slicing the inner index levels correctly

The correct approach to slicing at inner index levels is to pass the first and last positions
as tuples. Here, the first element to include is a tuple of Labrador and Brown.

7. Slicing columns

Since DataFrames are two-dimensional objects, you can also slice columns. You do this
by passing two arguments to loc.

The simplest case involves subsetting columns but keeping all rows. To do this, pass a
colon as the first argument to loc. As with slicing lists, a colon by itself means "keep
everything." The second argument takes column names as the first and last positions to
slice on.

8. Slice twice

You can slice on rows and columns at the same time: simply pass the appropriate slice
to each argument. Here, you see the previous two slices being performed in the same
line of code.

9. Dog days

An important use case of slicing is to subset DataFrames by a range of dates. To


demonstrate this, let's set the date_of_birth column as the index and sort by this index.
10. Slicing by dates

You slice dates with the same syntax as other types. The first and last dates are passed
as strings.

11. Slicing by partial dates

One helpful feature is that you can slice by partial dates. Here, the first and last
positions are only specified as 2014 and 2016, with no month or day parts. pandas
interprets this as slicing from the start of 2014 to the end of 2016; that is, all dates in
2014, 2015, and 2016.

12. Subsetting by row/column number

You can also slice DataFrames by row or column number using the iloc method. This
uses a similar syntax to slicing lists, except that there are two arguments: one for rows
and one for columns. Notice that, like list slicing but unlike loc, the final values aren't
included in the slice. In this case, the fifth row and fourth column aren't included.

Slicing index values


Slicing lets you select consecutive elements of an object using first:last syntax.
DataFrames can be sliced by index values or by row/column number; we'll start with the
first case. This involves slicing inside the .loc[] method.

Compared to slicing lists, there are a few things to remember.

 You can only slice an index if the index is sorted (using .sort_index()).
 To slice at the outer level, first and last can be strings.
 To slice at inner levels, first and last should be tuples.
 If you pass a single slice to .loc[], it will slice the rows.

pandas is loaded as pd. temperatures_ind has country and city in the index, and is
available.
Instructions
100 XP
Instructions
100 XP

 Sort the index of temperatures_ind.


 Use slicing with .loc[] to get these subsets:
o from Pakistan to Russia.
o from Lahore to Moscow. (This will return nonsense.)

o from Pakistan, Lahore to Russia, Moscow.


# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Subset rows from Pakistan to Russia


print(temperatures_srt.loc["Pakistan":"Russia"])

# Try to subset rows from Lahore to Moscow


print(temperatures_srt.loc["Lahore":"Moscow"])

# Subset rows from Pakistan, Lahore to Russia, Moscow


print(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia", "Mos
cow")])

Slicing in both directions


You've seen slicing DataFrames by rows and by columns, but since DataFrames are
two-dimensional objects, it is often natural to slice both dimensions at once. That is, by
passing two arguments to .loc[], you can subset by rows and columns in one go.
pandas is loaded as pd. temperatures_srt is indexed by country and city, has a sorted
index, and is available.
Instructions
100 XP

 Use .loc[] slicing to subset rows from India, Hyderabad to Iraq, Baghdad.
 Use .loc[] slicing to subset columns from date to avg_temp_c.
 Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c.

# Subset rows from India, Hyderabad to Iraq, Baghdad


print(temperatures_srt.loc[("India", "Hyderabad"):
("Iraq","Baghdad")])

# Subset columns from date to avg_temp_c


print(temperatures_srt.loc[:, "date":"avg_temp_c"])

# Subset in both directions at once


print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghd
ad"), "date":"avg_temp_c"])
Slicing time series
Slicing is particularly useful for time series since it's a common thing to want to filter for
data within a date range. Add the date column to the index, then use .loc[] to perform
the subsetting. The important thing to remember is to keep your dates in ISO 8601
format, that is, "yyyy-mm-dd" for year-month-day, "yyyy-mm" for year-month,
and "yyyy" for year.
Recall from Chapter 1 that you can combine multiple Boolean conditions using logical
operators, such as &. To do so in one line of code, you'll need to add
parentheses () around each condition.
pandas is loaded as pd and temperatures, with no index, is available.
Instructions
100 XP
Instructions
100 XP

 Use Boolean conditions, not .isin() or .loc[], and the full date "yyyy-mm-dd", to
subset temperatures for rows in 2010 and 2011 and print the results.
 Set the index of temperatures to the date column and sort it.
 Use .loc[] to subset temperatures_ind for rows in 2010 and 2011.
 Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011.

You might also like