0% found this document useful (0 votes)
88 views

41b Data Wrangling, Grouping and Aggregation

The document discusses data wrangling techniques using pandas for analyzing weather data from Finland, focusing on cleaning, renaming columns, and aggregating data. It includes a case study on determining if January 2020 was the warmest month in Finland, alongside methods for automating data analysis workflows. Key topics covered include handling missing values, temperature conversion, and applying functions to DataFrames.

Uploaded by

Gaurav Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

41b Data Wrangling, Grouping and Aggregation

The document discusses data wrangling techniques using pandas for analyzing weather data from Finland, focusing on cleaning, renaming columns, and aggregating data. It includes a case study on determining if January 2020 was the warmest month in Finland, alongside methods for automating data analysis workflows. Key topics covered include handling missing values, temperature conversion, and applying functions to DataFrames.

Uploaded by

Gaurav Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

7/6/24, 11:28 AM Data wrangling, grouping and aggregation

Data wrangling, grouping and


aggregation
Contents
Cleaning data while reading
Renaming columns
Using functions with pandas
String slicing
Grouping and aggregating data
Case study: Detecting warm months
Automating the analysis
Footnotes

Next, we will continue working with weather data, but expand our analysis to cover longer
periods of data from Finland. In the following, you will learn various useful techniques in pandas
to manipulate, group and aggregate the data in different ways that are useful when extracting
insights from your data. In the end, you will learn how to create an automated data analysis
workflow that can be repeated with multiple input files having a similar structure. As a case
study, we will investigate whether January 2020 was the warmest month on record also in
Finland, as the month was the warmest one on record globally [1].

Cleaning data while reading


In this section we are using weather observation data from Finland that was downloaded from
NOAA (see Datasets chapter for further details). The input data is separated with varying
number of spaces (i.e., fixed width). The first lines and columns of the data look like following:

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 1/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H VSB MW MW MW MW AW ...
029440 99999 190601010600 090 7 *** *** OVC * * * 0.0 ** ** ** ** ** ...
029440 99999 190601011300 *** 0 *** *** OVC * * * 0.0 ** ** ** ** ** ...
029440 99999 190601012000 *** 0 *** *** OVC * * * 0.0 ** ** ** ** ** ...
029440 99999 190601020600 *** 0 *** *** CLR * * * 0.0 ** ** ** ** ** ...

By looking at the data, we can notice a few things that we need to consider when reading the
data:

1. Delimiter: The columns are separated with a varying amount of spaces which requires
using some special tricks when reading the data with pandas read_csv() function
2. NoData values: NaN values in the NOAA data are coded with varying number of *
characters, hence, we need to be able to instruct pandas to interpret those as NaNs.
3. Many columns: The input data contains many columns (altogether 33). Many of those do
not contain any meaningful data for our needs. Hence, we should probably ignore the
unnecessary columns already at this stage.

Handling and cleaning heterogeneous input data (such as our example here) can be done after
reading in the data. However, in many cases, it is actually useful to do some cleaning and
preprocessing already when reading the data. In fact, that is often much easier to do. In our
case, we can read the data with varying number of spaces between the columns (1) by using a
parameter delim_whitespace=True (alternatively, specifying sep='\s+' would work). For
handling the NoData values (2), we can tell pandas to consider the * characters as NaNs by
using a paramater na_values and specifying a list of characters that should be converted to
NaNs. Hence, in this case we can specify na_values=['*', '**', '***', '****', '*****',
'******'] which will then convert the varying number of * characters into NaN values. Finally,
we can limit the number of columns that we read (3) by using the usecols parameter, which
we already used previously. In our case, we are interested in columns that might be somehow
useful to our analysis, including the station name, timestamp, and data about temperatures:
'USAF', 'YR--MODAHRMN', 'TEMP', 'MAX', 'MIN' . Achieving all these things is pretty
straightforward using the read_csv() function:

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 2/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

import pandas as pd

# Define relative path to the file


fp = "data/029820.txt"

# Read data using varying amount of spaces as separator,


# specifying '*' characters as NoData values,
# and selecting only specific columns from the data
data = pd.read_csv(
fp,
delim_whitespace=True,
na_values=["*", "**", "***", "****", "*****", "******"],
usecols=["USAF", "YR--MODAHRMN", "TEMP", "MAX", "MIN"],
)

Let’s see now how the data looks by printing the first five rows with the head() function:

data.head()

USAF YR--MODAHRMN TEMP MAX MIN

0 29820 190601010600 34.0 NaN NaN

1 29820 190601011300 32.0 NaN NaN

2 29820 190601012000 30.0 NaN NaN

3 29820 190601020600 33.0 NaN NaN

4 29820 190601021300 35.0 NaN NaN

Perfect, looks good. We have skipped a bunch of unnecessary columns and also the asterisk (*)
characters have been correctly converted to NaN values.

Renaming columns
Let’s take a closer look at the column names of our DataFrame:

print(data.columns)

Index(['USAF', 'YR--MODAHRMN', 'TEMP', 'MAX', 'MIN'], dtype='object')

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 3/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

As we see, some of the column names are a bit awkward and difficult to interpret (a description
for the columns is available in the metadata data/3505doc.txt). Luckily, it is easy to alter labels in
a pandas DataFrame using the rename() function. In order to change the column names, we
need to tell pandas how we want to rename the columns using a dictionary that converts the
old names to new ones. As you probably remember from Chapter 1, a dictionary is a specific
data structure in Python for storing key-value pairs. We can define the new column names using
a dictionary where we list “ key: value ” pairs in following manner:

USAF : STATION_ID

YR--MODAHRMN : TIME

TEMP : TEMP_F

Hence, the original column name (e.g. YR--MODAHRMN ) is the dictionary key which will be
converted to a new column name TIME (which is the value ). The temperature values in our
data file is again represented in Fahrenheit. We will soon convert these temperatures to Celsius.
Hence, in order to avoid confusion with the columns, let’s rename the column TEMP to TEMP_F .
Also the station number USAF is much more intuitive if we call it STATION_ID . Let’s create a
dictionary for the new column names:

new_names = {
"USAF": "STATION_ID",
"YR--MODAHRMN": "TIME",
"TEMP": "TEMP_F",
}
new_names

{'USAF': 'STATION_ID', 'YR--MODAHRMN': 'TIME', 'TEMP': 'TEMP_F'}

Our dictionary looks correct, so now we can change the column names by passing that
dictionary using the parameter columns in the rename() function:

data = data.rename(columns=new_names)
data.columns

Index(['STATION_ID', 'TIME', 'TEMP_F', 'MAX', 'MIN'], dtype='object')

Perfect, now our column names are easier to understand and use.

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 4/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

Using functions with pandas


Now it’s time to convert those temperatures from Fahrenheit to Celsius. We have done this
many times before, but this time we will learn how to apply our own functions to data in a
pandas DataFrame. We will define a function for the temperature conversion, and apply this
function for each Celsius value on each row of the DataFrame. Output celsius values should be
stored in a new column called TEMP_C . But first, it is a good idea to check some basic
properties of our new input data before proceeding with data analysis:

# First rows
data.head(2)

STATION_ID TIME TEMP_F MAX MIN

0 29820 190601010600 34.0 NaN NaN

1 29820 190601011300 32.0 NaN NaN

# Last rows
data.tail(2)

STATION_ID TIME TEMP_F MAX MIN

198332 29820 201910012200 47.0 NaN NaN

198333 29820 201910012300 46.0 NaN NaN

# Data types
data.info()

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 5/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198334 entries, 0 to 198333
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STATION_ID 198334 non-null int64
1 TIME 198334 non-null int64
2 TEMP_F 197916 non-null float64
3 MAX 29868 non-null float64
4 MIN 29536 non-null float64
dtypes: float64(3), int64(2)
memory usage: 7.6 MB

Nothing suspicous for the first and last rows, but here with info() we can see that the number
of observations per column seem to be varying if you compare the Non-Null Count
information to the number of entries in the data (N=198334). Only station number and time
seem to have data on each row. All other columns seem to have some missing values. This is
not necessarily anything dangerous, but good to keep in mind. Let’s still look at the descriptive
statistics:

# Descriptive stats
data.describe()

STATION_ID TIME TEMP_F MAX MIN

count 198334.0 1.983340e+05 197916.000000 29868.000000 29536.000000

mean 29820.0 1.990974e+11 43.717845 46.405852 40.537446

std 0.0 2.691914e+09 14.306138 14.446547 14.350235

min 29820.0 1.906010e+11 -28.000000 -20.000000 -28.000000

25% 29820.0 1.979011e+11 34.000000 36.000000 31.000000

50% 29820.0 1.997061e+11 43.000000 45.000000 40.000000

75% 29820.0 2.013113e+11 55.000000 59.000000 52.000000

max 29820.0 2.019100e+11 90.000000 102.000000 77.000000

By looking at the TEMP_F values (Fahrenheit temperatures), we can confirm that our
measurements seems more or less valid because the value range of the temperatures makes
sense, i.e. there are no outliers such as extremely high MAX values or low MIN values. It is

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 6/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

always a good practice to critically check your data before doing any analysis, as it is possible
that your data may include incorrect values, e.g. due to a sensor malfunction or human error.

Defining a function
Now we are sure that our data looks okay, and we can start our temperature conversion process
by first defining our temperature conversion function from Fahrenheit to Celsius. Pandas can
use regular functions, hence you can define functions for pandas exactly in the same way as you
would do normally (as we learned in Chapter 1). Hence, let’s define a function that converts
Fahrenheits to Celsius:

def fahr_to_celsius(temp_fahrenheit):
"""Function to convert Fahrenheit temperature into Celsius.

Parameters
----------

temp_fahrenheit: int | float


Input temperature in Fahrenheit (should be a number)

Returns
-------

Temperature in Celsius (float)


"""

# Convert the Fahrenheit into Celsius


converted_temp = (temp_fahrenheit - 32) / 1.8

return converted_temp

Now we have the function defined and stored in memory. At this point it is good to test the
function with some known value:

fahr_to_celsius(32)

0.0

32 Fahrenheits is indeed 0 Celsius, so our function seem to be working correctly.

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 7/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

Using a function by iterating over rows


Next we will learn how to use our function with data stored in a pandas DataFrame. We will first
apply the function row-by-row using a for loop and then we will learn a more efficient way of
applying the function to all rows at once.

Looping over rows in a DataFrame can be done in a couple of different ways. A common
approach is to use the iterrows() method which loops over the rows as index-Series pairs. In
other words, we can use the iterrows() method together with a for loop to repeat a process
for each row in a Pandas DataFrame. Please note that iterating over rows this way is a rather
inefficient approach, but it is still useful to understand the logic behind how this works. When
using the iterrows() method it is important to understand that iterrows() accesses not only
the values of one row, but also the index of the row. Let’s start with a simple example for
loop that goes through each row in our DataFrame.

# Iterate over the rows


for idx, row in data.iterrows():
# Print the index value
print("Index:", idx)

# Print the temperature from the row


print("Temp F:", row["TEMP_F"], "\n")

break

Index: 0
Temp F: 34.0

We can see that the idx variable indeed contains the index value at position 0 (the first row)
and the row variable contains all the data from that given row stored as a pandas Series. Also,
notice that when developing a for loop you do not always need to iterate through the entire
loop if you just want to test things out. Using the break statement in Python terminates a loop
whenever it is placed inside the loop. Here we used it to check out the values on the first row of
the DataFrame. This allows us to test the code logic without printing thousands of values to the
screen!

Next, let’s create an empty column TEMP_C for the Celsius temperatures and update the values
in that column using the fahr_to_celsius() function that we defined earlier. For updating the
value in the DataFrame, we can use the at method that we already used earlier in this chapter.
This time, however, we will use the itertuples() method to access the rows in the DataFrame.
https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 8/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

The itertuples() method works similarly to iterrows() , except it returns only the row values
without the index . In addition, the returned values are not a pandas Series, but instead
itertuples() returns a named tuple data type. As a result, when using itertuples()
accessing the row values needs to be done a bit differently. A tuple is like a list (but immutable,
i.e. you cannot change it) and “named tuple” is a special kind of tuple object that adds the
ability to access the values by name instead of position index. Hence, we can access the TEMP_F
value in a given row using row.TEMP_F (in contrast to how we accessed the value in the
previous code above). We will not work with named tuples in the rest of the book, but more
information can be found in the Python documentation for named tuples [2].

Let’s see an example of how to use the itertuples() method.

# Create an empty column for the output values


data["TEMP_C"] = 0.0

# Iterate over the rows


for row in data.itertuples():
# Convert the Fahrenheit to Celsius
# Notice how we access the row value
celsius = fahr_to_celsius(row.TEMP_F)

# Update the value for 'Celsius' column with the converted value
# Notice how we can access the Index value
data.at[row.Index, "TEMP_C"] = celsius

# Check the result


data.head()

STATION_ID TIME TEMP_F MAX MIN TEMP_C

0 29820 190601010600 34.0 NaN NaN 1.111111

1 29820 190601011300 32.0 NaN NaN 0.000000

2 29820 190601012000 30.0 NaN NaN -1.111111

3 29820 190601020600 33.0 NaN NaN 0.555556

4 29820 190601021300 35.0 NaN NaN 1.666667

# What does our row look like?


row._asdict()

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 9/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

{'Index': 198333,
'STATION_ID': 29820,
'TIME': 201910012300,
'TEMP_F': 46.0,
'MAX': nan,
'MIN': nan,
'TEMP_C': 0.0}

Okay, now we have iterated over our data and updated the temperatures in Celsius to TEMP_C
column by using our fahr_to_celsius() function. The values look correct as 32 degrees
Fahrenheit indeed is 0 Celsius degrees, as can be seen on the second row. We also have the last
row of our DataFrame in the code above, which is a named tuple that has been converted to
the more familiar dictionary data type using the _asdict() method for named tuples.

Before moving to other more efficient ways to use functions with pandas DataFrames, we
should note a few things about the approaches above. We demonstrated use of the
itertuples() method for looping over the values because it is significantly faster than
iterrows() (can be around 100x faster). We also used .at to assign the value to the
DataFrame because it is designed to access single values more efficiently than .loc , which can
access also groups of rows and columns. That said, you could have also simply used
data.loc[idx, new_column] = celsius to achieve the same result as both examples above. It is
just slower.

Using a function with apply


Although using for loop with itertuples() can be fairly efficient, pandas DataFrames and
Series have a dedicated method called apply() for applying functions on columns (or rows).
apply() is typically faster than itertuples() , especially if you have large number of rows,
such as in our case. When using apply() , we pass the function that we want to use as an
argument. Let’s start by applying the function to the TEMP_F column that contains the
temperature values in Fahrenheit:

data["TEMP_F"].apply(fahr_to_celsius)

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 10/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

0 1.111111
1 0.000000
2 -1.111111
3 0.555556
4 1.666667
...
198329 8.333333
198330 8.333333
198331 8.333333
198332 8.333333
198333 7.777778
Name: TEMP_F, Length: 198334, dtype: float64

The results look logical. Notice how we passed the fahr_to_celsius() function without using
the parentheses () after the name of the function. When using apply , you should always
leave out the parentheses from the function that you use. Meaning that you should use
apply(fahr_to_celsius) instead of apply(fahr_to_celsius()) . Why? Because the apply()
method will execute and use the function itself in the background when it operates with the
data. If we would pass our function with the parentheses, the fahr_to_celsius() function
would actually be executed once before the loop with apply() starts (hence becoming
unusable), and that is not what we want. Our previous command only returned the Series of
temperatures to the screen, but naturally we can also store them permanently into a new
column (overwriting the old values):

data["TEMP_C"] = data["TEMP_F"].apply(fahr_to_celsius)

A nice thing with apply() is that we can also apply the function on several columns at once.
Below, we also sort the values in descending order based on values in MIN column to see that
applying our function really works:

cols = ["TEMP_F", "MIN", "MAX"]


result = data[cols].apply(fahr_to_celsius)
result.sort_values(by="MIN", ascending=False).head()

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 11/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

TEMP_F MIN MAX

196775 25.000000 25.000000 29.444444

154531 25.555556 24.444444 27.777778

188167 25.000000 24.444444 27.777778

188407 23.888889 23.888889 27.777778

188143 24.444444 23.888889 28.888889

You can also directly store the outputs to new columns 'TEMP_C' , 'MIN_C' , 'MAX_C' :

cols = ["TEMP_F", "MIN", "MAX"]


data[cols] = data[cols].apply(fahr_to_celsius)
data.head()

STATION_ID TIME TEMP_F MAX MIN TEMP_C

0 29820 190601010600 1.111111 NaN NaN 1.111111

1 29820 190601011300 0.000000 NaN NaN 0.000000

2 29820 190601012000 -1.111111 NaN NaN -1.111111

3 29820 190601020600 0.555556 NaN NaN 0.555556

4 29820 190601021300 1.666667 NaN NaN 1.666667

In this section, we showed you a few different ways to iterate over rows in pandas and apply
functions. The most important thing is that you understand the logic of how loops work and
how you can use your own functions to modify the values in a pandas DataFrame. Whenever
you need to loop over your data, we recommend using .apply() as it is typically the most
efficient one in terms of execution time. However, remember that in most cases you do not
actually need to use loops, but you can do calculations in a “vectorized manner” (which is the
fastest way) as we learned previously when doing basic calculations in pandas.

String slicing
We will eventually want to group our data based on month in order to see if the January
temperatures in 2020 were higher than on average (which is the goal in our analysis as you

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 12/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

might recall). Currently, the date and time information is stored in the column TIME that has a
structure yyyyMMddhhmm . This is a typical timestamp format in which yyyy equals to year in four
digit format, MM to month (two digits), dd days, hh hours and mm minutes. Let’s have a closer
look at the date and time information we have by checking the values in that column, and their
data type:

data["TIME"].head()

0 190601010600
1 190601011300
2 190601012000
3 190601020600
4 190601021300
Name: TIME, dtype: int64

data["TIME"].tail()

198329 201910011900
198330 201910012000
198331 201910012100
198332 201910012200
198333 201910012300
Name: TIME, dtype: int64

The TIME column contains several observations per day (and even several observations per
hour). The timestamp for the first observation is 190601010600 , i.e. from 1st of January 1906
(way back!), and the timestamp for the latest observation is 201910012350 . As we can see, the
data type ( dtype ) of our column seems to be int64 , i.e. the information is stored as integer
values.

We want to aggregate this data on a monthly level. In order to do so, we need to “label” each
row of data based on the month when the record was observed. Hence, we need to somehow
separate information about the year and month for each row. In practice, we can create a new
column (or an index) containing information about the month (including the year, but excluding
days, hours and minutes). There are different ways of achieving this, but here we will take
advantage of string slicing which means that we convert the date and time information into
character strings and “cut” the needed information from the string objects. The other option
would be to convert the timestamp values into something called datetime objects, but we will

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 13/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

learn about those a bit later. Before further processing, we first want to convert the TIME
column as character strings for convenience, stored into a new column TIME_STR :

data["TIME_STR"] = data["TIME"].astype(str)

If we look at the latest time stamp in the data ( 201910012350 ), you can see that there is a
systematic pattern YEAR-MONTH-DAY-HOUR-MINUTE . Four first characters represent the year, and
the following two characters represent month. Because we are interested in understanding
monthly averages for different years, we want to slice the year and month values from the
timestamp (the first 6 characters), like this:

date = "201910012350"
date[0:6]

'201910'

Based on this information, we can slice the correct range of characters from the TIME_STR
column using a specific pandas function designed for Series, called .str.slice() . As
parameters, the function has start and stop which you can use to specify the positions
where the slicing should start and end:

data["YEAR_MONTH"] = data["TIME_STR"].str.slice(start=0, stop=6)


data.head()

STATION_ID TIME TEMP_F MAX MIN TEMP_C TIME_STR YE

0 29820 190601010600 1.111111 NaN NaN 1.111111 190601010600

1 29820 190601011300 0.000000 NaN NaN 0.000000 190601011300

2 29820 190601012000 -1.111111 NaN NaN -1.111111 190601012000

3 29820 190601020600 0.555556 NaN NaN 0.555556 190601020600

4 29820 190601021300 1.666667 NaN NaN 1.666667 190601021300

Nice! Now we have “labeled” the rows based on information about day of the year and hour of
the day.

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 14/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

Question 3.5
Create a new column 'MONTH' with information about the month without the year.

Show code cell content

Grouping and aggregating data

Basic logic of grouping a DataFrame using .groupby()


In the following sections, we want to calculate the average temperature for each month in our
dataset. Here, we will learn how to use a .groupby() method which is a handy tool for
compressing large amounts of data and computing statistics for subgroups. We will use the
groupby method to calculate the average temperatures for each month trough these three
main steps:

1. group the data based on year and month using groupby()


2. calculate the average temperature for each month (i.e. each group)
3. store the resulting rows into a DataFrame called monthly_data

We have quite a few rows of weather data (N=198334), and several observations per day. Our
goal is to create an aggreated DataFrame that would have only one row per month. The
.groupby() takes as a parameter the name of the column (or a list of columns) that you want
to use as basis for doing the grouping. Let’s start by grouping our data based on unique year
and month combination:

grouped = data.groupby("YEAR_MONTH")

Notice, thas it would also be possible to create combinations of years and months “on-the-fly”
if you have them in separate columns. In such case, grouping the data could be done as
grouped = data.groupby(['YEAR', 'MONTH']) . Let’s explore the new variable grouped :

print(type(grouped))
print(len(grouped))

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 15/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
826

We have a new object with type DataFrameGroupBy with 826 groups. In order to understand
what just happened, let’s also check the number of unique year and month combinations in our
data:

data["YEAR_MONTH"].nunique()

826

Length of the grouped object should be the same as the number of unique values in the
column we used for grouping ( YEAR_MONTH ). For each unique value, there is a group of data.
Let’s explore our grouped data further by check the “names” of the groups (five first ones).
Here, we access the keys of the groups and convert them to a list so that we can slice and
print only a few of those to the sceen:

list(grouped.groups.keys())[:5]

['190601', '190602', '190603', '190604', '190605']

Let’s check the contents for a group representing January 1906. We can get the values for that
month from the grouped object using the get_group() method:

# Specify a month (as character string)


month = "190601"

# Select the group


group1 = grouped.get_group(month)
group1

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 16/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

STATION_ID TIME TEMP_F MAX MIN TEMP_C TIME_STR Y

0 29820 190601010600 1.111111 NaN NaN 1.111111 190601010600

1 29820 190601011300 0.000000 NaN NaN 0.000000 190601011300

2 29820 190601012000 -1.111111 NaN NaN -1.111111 190601012000

3 29820 190601020600 0.555556 NaN NaN 0.555556 190601020600

4 29820 190601021300 1.666667 NaN NaN 1.666667 190601021300

... ... ... ... ... ... ... ...

88 29820 190601301300 -2.222222 NaN NaN -2.222222 190601301300

89 29820 190601302000 -6.111111 NaN NaN -6.111111 190601302000

90 29820 190601310600 -7.777778 NaN NaN -7.777778 190601310600

91 29820 190601311300 -1.111111 NaN NaN -1.111111 190601311300

92 29820 190601312000 -0.555556 NaN NaN -0.555556 190601312000


93 rows × 9 columns

As we can see, a single group contains a DataFrame with values only for that specific month.
Let’s check the DataType of this group:

type(group1)

pandas.core.frame.DataFrame

So, one group is a pandas DataFrame which is really useful, because it allows us to use all the
familiar DataFrame methods for calculating statistics etc. for this specific group which we will
see shortly. It is also possible to iterate over the groups in our DataFrameGroupBy object which
can be useful if you need to conduct and apply some more complicated subtasks for each
group. When doing so, it is important to understand that a single group in our
DataFrameGroupBy actually contains not only the actual values, but also information about the
key that was used to do the grouping. Hence, when iterating we need to assign the key and
the values (i.e. the group) into separate variables. Let’s see how we can iterate over the groups

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 17/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

and print the key and the data from a single group (again using break to only see what is
happening):

# Iterate over groups


for key, group in grouped:
# Print key and group
print("Key:\n", key)
print("\nFirst rows of data in this group:\n", group.head())

# Stop iteration with break command


break

Key:
190601

First rows of data in this group:


STATION_ID TIME TEMP_F MAX MIN TEMP_C TIME_STR \
0 29820 190601010600 1.111111 NaN NaN 1.111111 190601010600
1 29820 190601011300 0.000000 NaN NaN 0.000000 190601011300
2 29820 190601012000 -1.111111 NaN NaN -1.111111 190601012000
3 29820 190601020600 0.555556 NaN NaN 0.555556 190601020600
4 29820 190601021300 1.666667 NaN NaN 1.666667 190601021300

YEAR_MONTH MONTH
0 190601 01
1 190601 01
2 190601 01
3 190601 01
4 190601 01

Here, we can see that the key contains the name of the group (i.e. the unique value from
YEAR_MONTH ).

Aggregating data with groupby()


We can, for example, calculate the average values for all variables using the statistical functions
that we have seen already (e.g. mean, std, min, max, median). To calculate the average
temperature for each month, we can use the mean() function. Let’s calculate the mean for all
the weather related data attributes in our group at once:

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 18/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

# Specify the columns that will be part of the calculation


mean_cols = ["TEMP_F", "TEMP_C"]

# Calculate the mean values all at one go


mean_values = group1[mean_cols].mean()
mean_values

TEMP_F -0.537634
TEMP_C -0.537634
dtype: float64

As a result, we get a pandas Series with mean values calculated for all columns in the group.
Notice that if you want to convert this Series back into a DataFrame (which can be useful if you
e.g. want to merge multiple groups), you can use command .to_frame().T which first converts
the Series into a DataFrame and then transposes the order of the axes (the label names
becomes the column names):

# Convert to DataFrame
mean_values.to_frame().T

TEMP_F TEMP_C

0 -0.537634 -0.537634

To do a similar aggregation with all the groups in our data, we can actually combine the
groupby() function with the aggregation step (such as taking the mean, median etc. of given
columns), and finally restructure the resulting DataFrame a bit. This can be at first a bit harder to
understand, but this is how you would do the grouping and aggregating the values as follows:

# The columns that we want to aggregate


mean_cols = ["TEMP_F", "TEMP_C"]

# Group and aggregate the data with one line


monthly_data = data.groupby("YEAR_MONTH")[mean_cols].mean().reset_index()
monthly_data

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 19/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

YEAR_MONTH TEMP_F TEMP_C

0 190601 -0.537634 -0.537634

1 190602 -1.044974 -1.044974

2 190603 -2.485066 -2.485066

3 190604 2.740741 2.740741

4 190605 10.722820 10.722820

... ... ... ...

821 201906 14.990715 14.990715

822 201907 17.288769 17.288769

823 201908 17.747080 17.747080

824 201909 13.132371 13.132371

825 201910 8.750000 8.750000


826 rows × 3 columns

As we can see, aggregating the data in this way is fairly straightforward and fast process
requiring merely a single command. So what did we actually do here? We i) grouped the data,
ii) selected specific columns from the result ( mean_cols ), iii) calculated the mean for all of the
selected columns of the groups, and finally 4) reset the index. Resetting the index at the end is
not necessary, but by doing it, we turn the YEAR_MONTH values into a dedicated column in our
data (which would be otherwise store as index ) .

What might not be obvious from this example is the fact that hidden in the background, each
group is actually iterated over and the aggregation step is repeated for each group. For you to
better understand what happens, we will next repeat the same process by iterating over groups
and eventually creating a DataFrame that will contain the mean values for all those weather
attributes that we were interested in. In this approach, we will first iterate over the groups, then
calculate the mean values, store the result into a list, and finally merge the aggregated data into
a DataFrame called monthly_data .

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 20/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

# Create an empty list for storing the aggregated rows/DataFrames


data_container = []

# The columns that we want to aggregate


mean_cols = ["TEMP_F", "TEMP_C"]

# Iterate over the groups


for key, group in grouped:
# Calculate mean
mean_values = group[mean_cols].mean()

# Add the ´key´ (i.e. the date+time information) into the Series
mean_values["YEAR_MONTH"] = key

# Convert the pd.Series into DataFrame and


# append the aggregated values into a list as a DataFrame
data_container.append(mean_values.to_frame().T)

# After iterating all groups, merge the list of DataFrames


monthly_data = pd.concat(data_container)
monthly_data

TEMP_F TEMP_C YEAR_MONTH

0 -0.537634 -0.537634 190601

0 -1.044974 -1.044974 190602

0 -2.485066 -2.485066 190603

0 2.740741 2.740741 190604

0 10.72282 10.72282 190605

... ... ... ...

0 14.990715 14.990715 201906

0 17.288769 17.288769 201907

0 17.74708 17.74708 201908

0 13.132371 13.132371 201909

0 8.75 8.75 201910


826 rows × 3 columns

As a result, we get identical results as with the earlier approach that was done with a single line
of code (except for the position of the YEAR_MONTH column).
https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 21/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

So which approach should you use? From the performance point of view, we recommend using
the first approach (i.e. chaining) which does not require you to create a separate for loop, and is
highly performant. However, this approach might be a bit more difficult to read and
comprehend (the loop might be easier). Also sometimes you want to include additional
processing steps inside the loop which can be hard accomplish by chaining everything into a
single command. Hence, it is useful to know both of these approaches for doing aggregations
with the data.

Case study: Detecting warm months


Now, we have aggregated our data on monthly level and all we need to do is to check which
years had the warmest January temperatures. A simple approach is to select all January values
from the data and check which group(s) have the highest mean value. Before doing this, let’s
separate the month information from our timestamp following the same approach as previously
we did when slicing the year-month combination:

monthly_data["MONTH"] = monthly_data["YEAR_MONTH"].str.slice(start=4, stop=6)


monthly_data.head()

TEMP_F TEMP_C YEAR_MONTH MONTH

0 -0.537634 -0.537634 190601 01

0 -1.044974 -1.044974 190602 02

0 -2.485066 -2.485066 190603 03

0 2.740741 2.740741 190604 04

0 10.72282 10.72282 190605 05

Now we can select the values for January from our data and store it into a new variable
january_data . We will also check the highest temperature values by sorting the DataFrame in a
descending order:

january_data = monthly_data.loc[monthly_data["MONTH"] == "01"]


january_data.sort_values(by="TEMP_C", ascending=False).head()

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 22/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

TEMP_F TEMP_C YEAR_MONTH MONTH

0 1.302294 1.302294 200801 01

0 1.293952 1.293952 197501 01

0 1.273434 1.273434 198301 01

0 1.15491 1.15491 199201 01

0 1.069869 1.069869 198901 01

By looking at the order of YEAR_MONTH column, we can see that January 2020 indeed was on
average the warmest month on record based on weather observations from Finland.

Automating the analysis


Now we have learned how to aggregate data using pandas. average temperatures for each
month based on hourly weather observations. One of the most useful aspects of programming,
is the ability to automate processes and repeat analyses such as these for any number of
weather stations (assuming the data structure is the same).

Hence, let’s now see how we can repeat the previous data analysis steps for 15 weather stations
located in different parts of Finland containing data for five years (2015-2019). The idea is that
we will repeat the process for each input file using a (rather long) for loop. We will use the most
efficient alternatives of the previously represented approaches, and finally will store the results
in a single DataFrame for all stations. We will learn how to manipulate filepaths in Python using
the pathlib module and see how we can list our input files in the data directory
data/finnish_stations . We will store those paths to a variable file_list , so that we can use
the file paths easily in the later steps.

Managing and listing filesystem paths


In Python there are two commonly used approaches to manage and manipulate filepaths,
namely os.path sub-module and a newer pathlib module (available since Python 3.4) which
we will demonstrate here. The built-in module pathlib provides many useful functions for
interacting and manipulating filepaths on your operating system. On the following, we have
data in different sub-folders and we will learn how to use the Path class from the pathlib

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 23/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

library to construct filepaths. Next, we will import and use the Path class and see how we can
construct a filepath by joining a folder path and file name:

from pathlib import Path

# Initialize the Path


input_folder = Path("data/finnish_stations")

# Join folder path and filename


fp = input_folder / "028360.txt"
fp

PosixPath('data/finnish_stations/028360.txt')

Here, we first initialized the Path object and stored it in variable input_folder by passing a
relative path as a string to directory where all our files are located. Then we created a full
filepath to file 028360.txt by adding a forward slash ( / ) character between the folder and the
filename which joins them together (easy!). In this case, our end result is something called a
PosixPath which is a filesystem path to a given file on Linux or Mac operating systems. If you
would run the same commands on Windows machine, the end result would be a WindowsPath .
Hence, the output depends on which operating system you are using. However, you do not
need to worry about this, because both types of Paths work exactly the same, no matter which
operating system you use.

Both the Path object that we stored in input_folder variable and the PosixPath object that
we stored in variable fp are actually quite versatile creatures, and we can do many useful
things with them. For instance, we can find the parent folder where the file is located, extract
the filename from the full path, test whether the file or directory actually exists, find various
properties of the file (such as size of the file or creation time), and so on:

fp.parent

PosixPath('data/finnish_stations')

fp.name

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 24/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

'028360.txt'

fp.exists()

True

# File properties
size_in_bytes = fp.stat().st_size
creation_time = fp.stat().st_ctime
modified_time = fp.stat().st_mtime
print(
f"Size (bytes): {size_in_bytes}\nCreated (seconds since Epoch): {creation_time}\nM
)

Size (bytes): 1347907


Created (seconds since Epoch): 1649535015.1847472
Modified (seconds since Epoch): 1649535015.1847472

There are also various other methods that you can do with pathlib , such as rename the files
( .rename() ) or create folders ( .mkdir() ). You can see all available methods from pathlib
documentation [3]. One of the most useful tools in pathlib is the ability to list all files within a
given folder by using the method .glob() which also allows you to add specific search criteria
for listing only specific files from the directory:

file_list = list(input_folder.glob("0*txt"))

Here, the result is stored into variable file_list as a list. By default, the .glob() produces
something called a generator which is a “lazy iterator”, i.e. a special kind of function that
allows you to iterate over items like a list, but without actually storing the data in memory. By
enclosing the .glob() search functionality with list() we convert this generator into a
normal Python list. Note that we’re using the * character as a wildcard, so any filename that
starts with 0 and ends with txt will be added to the list of files. We specifically use the
number 0 as the starting part for the search criteria to avoid having metadata files included in
the list. Let’s take a look what we got as a result:

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 25/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

print("Number of files in the list:", len(file_list))


file_list

Number of files in the list: 15

[PosixPath('data/finnish_stations/029170.txt'),
PosixPath('data/finnish_stations/028690.txt'),
PosixPath('data/finnish_stations/029820.txt'),
PosixPath('data/finnish_stations/029700.txt'),
PosixPath('data/finnish_stations/028970.txt'),
PosixPath('data/finnish_stations/029070.txt'),
PosixPath('data/finnish_stations/029500.txt'),
PosixPath('data/finnish_stations/029110.txt'),
PosixPath('data/finnish_stations/028750.txt'),
PosixPath('data/finnish_stations/029720.txt'),
PosixPath('data/finnish_stations/029440.txt'),
PosixPath('data/finnish_stations/028360.txt'),
PosixPath('data/finnish_stations/029810.txt'),
PosixPath('data/finnish_stations/029740.txt'),
PosixPath('data/finnish_stations/029350.txt')]

Iterate over input files and repeat the analysis


Now, we should have all the relevant file paths in the file_list , and we can loop over the list
using a for loop (again we break the loop after first iteration):

for fp in file_list:
print(fp)
break

data/finnish_stations/029170.txt

The data that we have sampled is in regular CSV format which we can read easily with
pd.read_csv() function:

data = pd.read_csv(fp)
data.head()

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 26/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

USAF YR--MODAHRMN DIR SPD GUS TEMP MAX MIN

0 29170 201501010050 240.0 7.0 NaN 34.0 NaN NaN

1 29170 201501010120 260.0 5.0 NaN 36.0 NaN NaN

2 29170 201501010150 250.0 8.0 NaN 34.0 NaN NaN

3 29170 201501010220 250.0 8.0 NaN 36.0 NaN NaN

4 29170 201501010250 240.0 8.0 NaN 36.0 NaN NaN

Now we have all the file paths to our weather observation datasets in a list, and we can start
iterating over them and repeat the analysis steps for each file separately. We keep all the
analytical steps inside a loop so that all of them are repeated to different stations. Finally, we
will store the warmest January for each station in a list called results using a regular Python’s
append() method and merge the list of DataFrames into one by using pd.concat() function:

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 27/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

# A list for storing the result


results = []

# Repeat the analysis steps for each input file:


for fp in file_list:
# Read the data from CSV file
data = pd.read_csv(fp)

# Rename the columns


new_names = {
"USAF": "STATION_NUMBER",
"YR--MODAHRMN": "TIME",
"TEMP": "TEMP_F",
}
data = data.rename(columns=new_names)

# Print info about the current input file


# This is useful to understand how the process proceeds
print(
f"STATION NUMBER: {data.at[0,'STATION_NUMBER']}\tNUMBER OF OBSERVATIONS: {len(
)

# Create column
col_name = "TEMP_C"
data[col_name] = None

# Convert temperatures from Fahrenheit to Celsius


data["TEMP_C"] = data["TEMP_F"].apply(fahr_to_celsius)

# Convert TIME to string


data["TIME_STR"] = data["TIME"].astype(str)

# Parse year and month and convert them to numbers


data["MONTH"] = data["TIME_STR"].str.slice(start=5, stop=6).astype(int)
data["YEAR"] = data["TIME_STR"].str.slice(start=0, stop=4).astype(int)

# Extract observations for the months of January


january = data[data["MONTH"] == 1]

# Aggregate the data and get mean values


columns = ["TEMP_F", "TEMP_C", "STATION_NUMBER"]
monthly_mean = january.groupby(by=["YEAR", "MONTH"])[columns].mean().reset_index()

# Sort the values and take the warmest January


warmest = monthly_mean.sort_values(by="TEMP_C", ascending=False).head(1)

# Add to results
results.append(warmest)

# Merge all the results into a single DataFrame


results = pd.concat(results)

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 28/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

STATION NUMBER: 29170 NUMBER OF OBSERVATIONS: 120211


STATION NUMBER: 28690 NUMBER OF OBSERVATIONS: 119674
STATION NUMBER: 29820 NUMBER OF OBSERVATIONS: 40264
STATION NUMBER: 29700 NUMBER OF OBSERVATIONS: 120618
STATION NUMBER: 28970 NUMBER OF OBSERVATIONS: 120891
STATION NUMBER: 29070 NUMBER OF OBSERVATIONS: 40473
STATION NUMBER: 29500 NUMBER OF OBSERVATIONS: 40405
STATION NUMBER: 29110 NUMBER OF OBSERVATIONS: 81164
STATION NUMBER: 28750 NUMBER OF OBSERVATIONS: 81127
STATION NUMBER: 29720 NUMBER OF OBSERVATIONS: 81127
STATION NUMBER: 29440 NUMBER OF OBSERVATIONS: 120947
STATION NUMBER: 28360 NUMBER OF OBSERVATIONS: 35442
STATION NUMBER: 29810 NUMBER OF OBSERVATIONS: 35377
STATION NUMBER: 29740 NUMBER OF OBSERVATIONS: 121654
STATION NUMBER: 29350 NUMBER OF OBSERVATIONS: 116220

Awesome! Now we have conducted the same analysis for 15 weather stations in Finland and it
did not took too many lines of code! We were able to follow how the process advances with the
printed lines of information, i.e. we did some simple logging of the operations. Let’s finally
investigate our results:

results

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 29/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

YEAR MONTH TEMP_F TEMP_C STATION_NUMBER

3 2018 1 27.705512 -2.385827 29170.0

3 2018 1 20.855503 -6.191387 28690.0

0 2015 1 38.708724 3.727069 29820.0

0 2015 1 38.052815 3.362675 29700.0

3 2018 1 25.359090 -3.689395 28970.0

2 2017 1 33.014374 0.563541 29070.0

0 2015 1 39.166185 3.981214 29500.0

0 2015 1 32.306713 0.170396 29110.0

0 2015 1 27.514236 -2.492091 28750.0

0 2015 1 34.983254 1.657363 29720.0

0 2015 1 32.062323 0.034624 29440.0

3 2018 1 19.644055 -6.864414 28360.0

0 2015 1 39.864476 4.369154 29810.0

0 2015 1 34.405518 1.336399 29740.0

0 2015 1 28.617084 -1.879398 29350.0

Each row in the results represents the warmest January at given STATION_NUMBER between the
years 2015 and 2019. Based on the YEAR column, the warmest January in most of Finland’s
weather stations during this five-year period was in 2015. We can confirm this by checking the
value counts of the YEAR column:

results["YEAR"].value_counts()

YEAR
2015 10
2018 4
2017 1
Name: count, dtype: int64

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 30/31
7/6/24, 11:28 AM Data wrangling, grouping and aggregation

Footnotes
[1] https://www.noaa.gov/news/january-2020-was-earth-s-hottest-january-on-record

[2] https://docs.python.org/3/library/collections.html#collections.namedtuple

[3] https://docs.python.org/3/library/pathlib.html

https://pythongis.org/part1/chapter-03/nb/02-data-analysis.html 31/31

You might also like