A walkthrough on the fundamental mechanics to manipulate the data contained in pandas Series or DataFrame objects.

Binder slides

Learning objectives

  • Understand the main steps involved in exploratory data analysis
  • Visualise geographical data with seaborn
  • Slice, mask, and index pandas.Series and pandas.DataFrame objects
  • Merge pandas.DataFrame objects together on a common key
  • Apply the DataFrame.groupby() operation to aggregate data across different groups of interest

References

This lesson draws heavily on the following textbook chapters:

  • Chapter 2 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron
  • Chapter 3 (pp. 146-170) of Python Data Science Handbook by Jake Vanderplas

You may also find the following blog post useful:

Homework

  • Solve the exercises included in this notebook
  • Work through lesson 2 and 4 from Kaggle Learn's introduction to pandas

If you get stuck on understanding a particular pandas technique, you might find their docs to be helpful.

What is exploratory data analysis?

In data science we apply the scientific method to data with the goal gain insights. This means that we state a hypothesis about the data, test it and refine it if necessary. In this framework, exploratory data analysis (EDA) is the step where we explore the data before actually building models. This helps us understand what information is actually contained in the data and what insights could be gained from it.

Formally, the goals of EDA are:

  • Suggest hypotheses about the phenomena of interest
  • Check if necessary data is available to test these hypotheses
  • Make a selection of appropriate methods and models to achieve the goal
  • Suggest what data should be gathered for further investigation

This exploratory phase lays out the path for the rest of a data science project and is therefore a crucial part of the process.

The data

In this lesson we will analyse two datasets:

  • housing.csv
  • housing_addresses.csv

The first is the California housing dataset we saw in lesson 1, while the second provides information about the neighbourhoods associated with each house. This auxiliary data was generated using the reverse geocoding functionality from Google Maps, where the latitude and longitude coordinates for each house are converted into the closest, human-readable address.

The type of questions we will try to find answers to include:

  • Which cities have the most houses?
  • Which cities have the most expensive houses?
  • What is the breakdown of the house prices by proximity to the ocean?

Import libraries

As in lesson 1, we will be making use of the pandas and seaborn libraries. It is often a good idea to import all your libraries in a single cell block near the top of your notebooks so that your collaborators can quickly see whether they need to install new libraries or not.

# reload modules before executing user code
%load_ext autoreload
# reload all modules every time before executing the Python code
%autoreload 2
# render plots in notebook
%matplotlib inline
# data wrangling
import pandas as pd
import numpy as np
from pathlib import Path
from dslectures.core import *

# data viz
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# these commands define the color scheme
sns.set(color_codes=True)
sns.set_palette(sns.color_palette('muted'))

Load the data

As usual, we can download our datasets using our helper function get_datasets:

get_dataset('housing.csv')
get_dataset('housing_addresses.csv')
Download of housing dataset complete.
Download of housing_addresses dataset complete.

We also make use of the pathlib library to handle our filepaths:

DATA = Path('../data/')
!ls {DATA}
housing (1).csv                housing_gmaps_data_raw (1).csv
housing.csv                    housing_gmaps_data_raw.csv
housing_addresses (1).csv      imdb.csv
housing_addresses.csv          uc
housing_backup.csv
housing_data = pd.read_csv(DATA/'housing.csv'); housing_data.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
housing_addresses = pd.read_csv(DATA/'housing_addresses.csv'); housing_addresses.head()
street_number route locality-political postal_code latitude_longitude
0 3130 Grizzly Peak Boulevard Berkeley 94705.0 37.88,-122.23
1 2005 Tunnel Road Oakland 94611.0 37.86,-122.22
2 6886 Chabot Road Oakland 94618.0 37.85,-122.24
3 6365 Florio Street Oakland 94618.0 37.85,-122.25
4 5407 Bryant Avenue Oakland 94618.0 37.84,-122.25

Inspect the data

The shape of data

Whenever we have a new dataset it is handy to begin by getting an idea of how large the DataFrame is. This can be done with either the len or DataFrame.shape methods:

# get number of rows
len(housing_data), len(housing_addresses)
(20640, 12590)
# get tuples of (n_rows, n_columns)
housing_data.shape, housing_addresses.shape
((20640, 10), (12590, 5))

Rename columns

Usually one finds that the column headers in the raw data are either ambiguous or appear in multiple DataFrame objects, in which case it is handy to give them the same name. Although it's obvious from the DataFrame.head() method what the column headers are for our housing and address data, in most cases one has tens or hundreds of columns and the fastest way to see their names is as follows:

housing_addresses.columns
Index(['street_number', 'route', 'locality-political', 'postal_code',
       'latitude_longitude'],
      dtype='object')

Let's rename the route and locality-political columns to something more transparent:

housing_addresses.rename(columns={'route':'street_name', 'locality-political':'city'}, inplace=True)
# check that the renaming worked
housing_addresses.head()
street_number street_name city postal_code latitude_longitude
0 3130 Grizzly Peak Boulevard Berkeley 94705.0 37.88,-122.23
1 2005 Tunnel Road Oakland 94611.0 37.86,-122.22
2 6886 Chabot Road Oakland 94618.0 37.85,-122.24
3 6365 Florio Street Oakland 94618.0 37.85,-122.25
4 5407 Bryant Avenue Oakland 94618.0 37.84,-122.25

Unique values

Since we are dealing with data about California, we should check that the city column contains a reasonable number of unique entries. In pandas we can check this the DataFrame.nunique() method:

housing_addresses['city'].nunique()
989

Exercise #1

Does the above number make sense to you? What additional data could you find to determine if it does or does not?


Visualising geographical data

In this lesson we will be focusing on how the house location affects its price, so let's make a scatterplot of the latitude and longitude values to see if we can identify any interesting patterns:

sns.scatterplot(x="longitude", y="latitude", data=housing_data);

Although the points look like the shape of California, we see that many are overlapping which obscures potentially interesting substructure. We can fix this by configuring the transparency of the points with the alpha argument:

sns.scatterplot(x="longitude", y="latitude", data=housing_data, alpha=0.1);

This is much better as we can now see distinct clusters of houses. To make this plot even more informative, let's colour the points according to the median house value; we will use the viridis colourmap (palette) as this has been carefully designed for data that has a sequential nature (i.e. low to high values):

fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=housing_data,
    alpha=0.1,
    hue="median_house_value",
    palette="viridis",
    size=housing_data["population"] / 100
)

# place legend outside of figure
fig.legend(loc="center left", bbox_to_anchor=(1.01, 0.6), ncol=1);

Exercise #2

What does the figure above tell us about the relationship between house prices and location or population density?


Finally, to make our visualisation a little more intuitive, we can also overlay the scatter plot on an actual map of California:

fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=housing_data,
    alpha=0.1,
    hue="median_house_value",
    palette="viridis",
    size=housing_data["population"] / 100
)

# place legend outside of figure
fig.legend(loc="center left", bbox_to_anchor=(1.01, 0.6), ncol=1);

# ref - www.kaggle.com/camnugent/geospatial-feature-engineering-and-visualization/
california_img=mpimg.imread('images/california.png')
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5);

Exercise #3

Can you explain the light green and yellow hotspots in the above figure?


Merging DataFrames

Although the housing_data and housing_addresses DataFrames contain interesting information, it would be nice if there was a way to join the two tables.

More generally, one of the most common operations in pandas (or data science for that matter) is the combination of data contained in various objects. In particular, merge or join operations combine datasets by linking rows using one of more keys. These operations are central to relational databases (e.g. SQL-based). The pandas.merge() function in pandas is the main entry point for using these algorithms on your data.

Let's use this idea to combine the housing_data and housing_addresses pandas.DataFrame objects via their common latitude and longitude coordinates. First we need to combine the latitude and longitude columns of housing_data into the same lat,lon format as our housing_addresses pandas.DataFrame. To do so, we will use the Series.astype() function to convert the numerical column values to strings, then use string concatenation to create a new latitude_longitude column with the desired format:

# create new column
housing_data['latitude_longitude'] = housing_data['latitude'].astype(str) + ',' + housing_data['longitude'].astype(str)
# check the column was created
housing_data.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity latitude_longitude
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 37.88,-122.23
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 37.86,-122.22
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 37.85,-122.24
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 37.85,-122.25
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 37.85,-122.25

Exercise #4

Calculate the number of unique values in the latitude_longitude column of both housing_data and housing_addresses.

  • What can you conclude from this comparison?
  • Why do you think there might be less unique coordinate pairs than the total number of houses?

Now that we have latitude_longitude present in both pandas.DataFrame objects we can merge the two tables together in pandas as follows:

# create a new DataFrame from the merge
housing_merged = pd.merge(
    housing_data, housing_addresses, how="left", on="latitude_longitude"
)
# check merge worked
housing_merged.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity latitude_longitude street_number street_name city postal_code
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 37.88,-122.23 3130 Grizzly Peak Boulevard Berkeley 94705.0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 37.86,-122.22 2005 Tunnel Road Oakland 94611.0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 37.85,-122.24 6886 Chabot Road Oakland 94618.0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 37.85,-122.25 6365 Florio Street Oakland 94618.0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 37.85,-122.25 6365 Florio Street Oakland 94618.0

Boom! We now have a single pandas.DataFrame that links information for house prices and attributes and their addresses.

In general the 'how' argument of pandas.merge() allows four different join types:

  • 'left': Use all key combinations found in the left table
  • 'right': Use all key combinations found in the right table
  • 'inner': Use only the key combinations observed in both tables
  • 'outer': Use all key combinations observed in both tables together

A visual example of these different merges in action is shown in the figure below.


Figure: Graphical representation of the different types of merges between two DataFrames df1 and df2 that are possible in pandas.

Dropping columns

It is not uncommon for a dataset to have tens or hundreds of columns, and in practice we may only want to focus our attention on a smaller subset. One way to remove unwanted columns is via the DataFrame.drop() method. Since we have duplicate information about the latitude and longitude coordinate, we may as well drop the latitude_longitude column from our merged pandas.DataFrame:

# inplace=True manipulates an object without returning a new object
housing_merged.drop(['latitude_longitude'], axis=1, inplace=True); housing_merged.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity street_number street_name city postal_code
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 3130 Grizzly Peak Boulevard Berkeley 94705.0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 2005 Tunnel Road Oakland 94611.0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 6886 Chabot Road Oakland 94618.0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 6365 Florio Street Oakland 94618.0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 6365 Florio Street Oakland 94618.0


Figure: Visualisation of the axis parameter in pandas.

Saving a DataFrame to disk

At this point, we have a unfified table with housing data and their addresses. This is usually a good point to save the intermediate results to disk so that we can reload them without having to run all preprocessing steps. To do so in pandas, we can make use of the DataFrame.to_csv() function:

# use index=False to avoid adding an extra column for the index
housing_merged.to_csv(path_or_buf=DATA/'housing_merged.csv', index=False)

Indexing, selection, and filtering

Now that we have a tidy pandas.DataFrame let's apply some of the most common pandas methods to make queries on the data.

Series

A pandas.Series object provides array-style item selection that allows one to perform slicing, masking, and fancy indexing. For example, we can study the housing_median_age colum from a variety of angles:

# define a new Series object 
age = housing_merged['housing_median_age']
age
0        41.0
1        21.0
2        52.0
3        52.0
4        52.0
         ... 
20635    25.0
20636    18.0
20637    17.0
20638    18.0
20639    16.0
Name: housing_median_age, Length: 20640, dtype: float64
# get first entry
age[0]
41.0
# slicing by index
age[2:5]
2    52.0
3    52.0
4    52.0
Name: housing_median_age, dtype: float64
# slice and just get values
age[2:5].values
array([52., 52., 52.])
age.describe()
count    20640.000000
mean        28.639486
std         12.585558
min          1.000000
25%         18.000000
50%         29.000000
75%         37.000000
max         52.000000
Name: housing_median_age, dtype: float64
# masking (filtering)
age[(age > 10.0) & (age < 20.0)]
75       17.0
77       19.0
80       17.0
90       16.0
131      18.0
         ... 
20632    15.0
20636    18.0
20637    17.0
20638    18.0
20639    16.0
Name: housing_median_age, Length: 4259, dtype: float64
# fancy indexing
age[[0, 4]]
0    41.0
4    52.0
Name: housing_median_age, dtype: float64

DataFrame

Since a pandas.DataFrame acts like a two-dimensional array, pandas provides special indexing operators called DataFrame.iloc[] and DataFrame.loc[] to slice and dice our data. As an example, let's select a single row and multiple columns by label:

housing_merged.loc[2, ['city', 'population']]
city          Oakland
population        496
Name: 2, dtype: object

We can perform similar selections with integers using DataFrame.iloc[]:

housing_merged.iloc[2, [3, 11]]
total_rooms           1467
street_name    Chabot Road
Name: 2, dtype: object

Masks or filters are especially common to use, e.g. let's select the subset of expensive houses:

housing_merged.loc[housing_merged['median_house_value'] > 450000]
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity street_number street_name city postal_code
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 3130 Grizzly Peak Boulevard Berkeley 94705.0
89 -122.27 37.80 52.0 249.0 78.0 396.0 85.0 1.2434 500001.0 NEAR BAY 321 10th Street Oakland 94607.0
140 -122.18 37.81 30.0 292.0 38.0 126.0 52.0 6.3624 483300.0 NEAR BAY NaN NaN Oakland 94611.0
459 -122.25 37.87 52.0 609.0 236.0 1349.0 250.0 1.1696 500001.0 NEAR BAY 15 Canyon Road Berkeley 94704.0
489 -122.25 37.86 48.0 2153.0 517.0 1656.0 459.0 3.0417 489600.0 NEAR BAY 2805 Kelsey Street Berkeley 94705.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20422 -118.90 34.14 35.0 1503.0 263.0 576.0 216.0 5.1457 500001.0 <1H OCEAN 890 West Potrero Road Thousand Oaks 91361.0
20426 -118.69 34.18 11.0 1177.0 138.0 415.0 119.0 10.0472 500001.0 <1H OCEAN NaN East Las Virgenes Canyon Road NaN 91307.0
20427 -118.80 34.19 4.0 15572.0 2222.0 5495.0 2152.0 8.6499 500001.0 <1H OCEAN 5135 Island Forest Place Westlake Village 91362.0
20436 -118.69 34.21 10.0 3663.0 409.0 1179.0 371.0 12.5420 500001.0 <1H OCEAN 6 Ranchero Road Bell Canyon 91307.0
20443 -118.85 34.27 50.0 187.0 33.0 130.0 35.0 3.3438 500001.0 <1H OCEAN 2182 Tierra Rejada Road Moorpark 93021.0

1257 rows × 14 columns

We can also use Python's string methods to filter for things like

housing_merged.loc[housing_merged['ocean_proximity'].str.contains('O')]
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity street_number street_name city postal_code
701 -121.97 37.64 32.0 1283.0 194.0 485.0 171.0 6.0574 431000.0 <1H OCEAN 33803 Palomares Road Castro Valley 94552.0
830 -121.99 37.61 9.0 3666.0 711.0 2341.0 703.0 4.6458 217000.0 <1H OCEAN NaN South Fork Trail Castro Valley 94552.0
859 -121.97 37.57 21.0 4342.0 783.0 2172.0 789.0 4.6146 247600.0 <1H OCEAN 121 Overacker Terrace Fremont 94536.0
860 -121.96 37.58 15.0 3575.0 597.0 1777.0 559.0 5.7192 283500.0 <1H OCEAN NaN Deer Gulch Loop Trail Fremont 94536.0
861 -121.98 37.58 20.0 4126.0 1031.0 2079.0 975.0 3.6832 216900.0 <1H OCEAN 37296 Mission Boulevard Fremont 94536.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20502 -118.68 34.33 45.0 121.0 25.0 67.0 27.0 2.9821 325000.0 <1H OCEAN NaN Windmill Canyon Road Simi Valley 93063.0
20503 -118.75 34.33 27.0 534.0 85.0 243.0 77.0 8.2787 330000.0 <1H OCEAN NaN Middle Ridge Fire Road Simi Valley 93065.0
20504 -118.73 34.29 11.0 5451.0 736.0 2526.0 752.0 7.3550 343900.0 <1H OCEAN 3435 Avenida Simi Simi Valley 93063.0
20505 -118.72 34.29 22.0 3266.0 529.0 1595.0 494.0 6.0368 248000.0 <1H OCEAN 3889 Avenida Simi Simi Valley 93063.0
20506 -118.73 34.29 8.0 4983.0 754.0 2510.0 725.0 6.9454 276500.0 <1H OCEAN 3435 Avenida Simi Simi Valley 93063.0

11794 rows × 14 columns

For example, our age object has an index for each row or house in the dataset:

age.index
Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            20630, 20631, 20632, 20633, 20634, 20635, 20636, 20637, 20638,
            20639],
           dtype='int64', length=20640)

We can also sort the values according to the index, which in our example amounts to reversing the order:

age.sort_index(ascending=False)
20639    16.0
20638    18.0
20637    17.0
20636    18.0
20635    25.0
         ... 
4        52.0
3        52.0
2        52.0
1        21.0
0        41.0
Name: housing_median_age, Length: 20640, dtype: float64

Finally, there are times when you want to reset the index of the pandas.Series or pandas.DataFrame objects; this can be achieved as follows:

age.reset_index()
index housing_median_age
0 0 41.0
1 1 21.0
2 2 52.0
3 3 52.0
4 4 52.0
... ... ...
20635 20635 25.0
20636 20636 18.0
20637 20637 17.0
20638 20638 18.0
20639 20639 16.0

20640 rows × 2 columns

Note this creates a new index column and resets the order of the pandas.Series object in ascending order.


Which cities have the most houses?

Whenever you need to quickly find the frequencies associated with categorical data, the DataFrame.value_counts() and Series.nlargest() functions come in handy. For example, if we want to see which city has the most houses, we can run the following:

housing_merged['city'].value_counts().nlargest(10)
Los Angeles      2372
San Diego         809
San Francisco     574
San Jose          498
Sacramento        422
Oakland           374
Long Beach        317
Fresno            309
Stockton          242
Bakersfield       222
Name: city, dtype: int64

This seems to make sense, since Los Angeles, San Diego, and San Francisco have some of the largest populations. We can check whether this is indeed the case by aggregating the data to calculate the total population value across the group of cities. pandas provides a flexibly DataFrame.groupby() interface that enables us to slice, dice, and summarise datasets in a natural way. In particular, pandas allows one to:

  • Split a pandas object into pieces using one or more keys
  • Calculate group summary statistics, like count, mean, standard deviation, or a user-defined function
  • Apply within-group transformations or other manipulations, like normalisation, rank, or subset selection
  • Compute pivot tables and cross-tabulations.

Let's combine these ideas to answer our question, followed by an explanation of how the GroupBy mechanics really work:

housing_merged.groupby('city').agg({'population':'sum'})
population
city
Acampo 9626.0
Acton 6740.0
Adelanto 6583.0
Adin 364.0
Agoura Hills 26776.0
... ...
Yreka 8971.0
Yuba City 49772.0
Yucaipa 31965.0
Yucca Valley 18955.0
Zenia 228.0

989 rows × 1 columns

This seems to work - we now have the average number of people residing in a block of houses, but the result seems to be sorted alphabetically. To get the cities with the largest populations, we can use the Series.sort_values() method as follows:

housing_merged.groupby("city").agg({"population": "sum"}).sort_values(
    by="population", ascending=False
)
population
city
Los Angeles 3495957.0
San Diego 1069557.0
San Jose 818234.0
San Francisco 702282.0
Sacramento 614478.0
... ...
Parker 83.0
El Portal 79.0
Pearsonville 48.0
Forest Ranch 47.0
North Richmond 42.0

989 rows × 1 columns

That's much better! We can store the result as a new pandas.DataFrame and plot the distribution:

# use outer parentheses () to add line breaks
top10_largest_cities = (
    # group by key=city
    housing_merged.groupby("city")
    # calculate total population
    .agg({"population": "sum"})
    # select 10 largest
    .nlargest(10, columns="population")
    # reset index so city becomes column
    .reset_index()
)
top10_largest_cities.head()
city population
0 Los Angeles 3495957.0
1 San Diego 1069557.0
2 San Jose 818234.0
3 San Francisco 702282.0
4 Sacramento 614478.0
sns.barplot(x='population', y='city', data=top10_largest_cities, color='b');

GroupBy mechanics

The group operation above can be best understood by the H. Wickam's split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

Split: In the first stage of the process, data contained in a pandas.Series or pandas.DataFrame object is split into groups based on one or more specified keys. The splitting is performed on a particular axis of the object (see the notebook from lesson 2). For example, a pandas.DataFrame can be grouped on its rows (axis=0) or its columns (axis=1).

Apply: Once the split done, a function is applied to each group, producing a new value.

Combine: Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usualy depend on the what's being done to the data.

See the figure below for an example of a simple group aggregation.


Figure: Illustraion of a group aggregation.

In general, the grouping key can take many forms and the key do not have to be all of the same type. Frequently, the grouping information is found in the same pandas.DataFrame as the data you want to work on, so the key is usually a column name. For example let's create a simple pandas.DataFrame as follows:

df_foo = pd.DataFrame(
    {
        "key1": ["a", "a", "b", "b", "a"],
        "key2": ["one", "two", "one", "two", "one"],
        "data1": np.random.randn(5),
        "data2": np.random.randn(5),
    }
)
df_foo
key1 key2 data1 data2
0 a one 1.837510 0.875095
1 a two 0.771915 -0.256958
2 b one -0.830038 -0.037795
3 b two 0.727881 -0.254475
4 a one 0.168928 -1.287844

We can then use the column names as the group keys (similar to what we did above with the beers):

grouped = df_foo.groupby('key1'); grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x132484ed0>

This grouped variable is now a GroupBy object. It has not actually calculated anything yet except for some intermediate data about the group key df_foo['key1']. The main idea is that this object has all of the information needed to then apply some operation to each of the groups. For example, we can get the mean per group as follows:

df_grouped = grouped.mean(); df_grouped
data1 data2
key1
a 0.926117 -0.223236
b -0.051078 -0.146135

Exercise #5

  • Use the above DataFrame.groupby() techniques to find the top 10 cities which have the most expensive houses on average. Look up some of the names on the web - do the results make sense?
  • Use the DataFrame.loc[] method to filter out the houses with the capped values of over $500,000. Repeat the same step as above.

Exercise #6

  • Use the DataFrame.groupby() and agg() techniques to find the distribution of mean house prices according to ocean_proximity.
  • Store the result from the above step in a new pandas.DataFrame and visualise it with seaborn's barplot.