Learning objectives

Understand the main steps involved in preparing data for machine learning algorithms
Create Python functions to automate steps of the data cleaning process
Gain an introduction to matplotlib's object-oriented interface to combine plots on the same figure

References

This lesson draws heavily on the following textbook chapter:

Chapter 2 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron

You may also find the following blog posts useful:

Machine Learning with Kaggle: Feature Engineering
Sections 2 and 3 of Intermediate Machine Learning on Kaggle Learn
Effectively Using Matplotlib by C. Moffitt

Homework

Solve the exercises included in this notebook
Read chapter 3 of Data Science for Business by Provost & Fawcett

What is data cleaning?

When you receive a new dataset at the beginning of a project, the first task usually involves some form of data cleaning.

To solve the task at hand, you might need data from multiple sources which you need to combine into one unified table. However, this is usually a tricky task; the different data sources might have different naming conventions, some of them might be human-generated, while others are automatic system reports. A list of things you usually have to go through are the following:

Merge multiple sources into one table
Remove duplicate entries
Clean corrupted entries
Handle missing data

In lesson 2, we examined how to merge the table of housing data with their addresses; in this lesson we will focus on the remainign three steps.

Although building algorithms that are able to classify data or estimate outcomes is arguably the most interesting part of data science, data cleaning is the one that takes up most of the time. According to a study by CrowdFlower, data scientists spend about 60-80% of their time preparing datasets for machine learning algorithms.

The data

In this lesson we will analyse the unified table of housing data and their addresses that we created in lesson 2:

housing_merged.csv

Import libraries

As in previous lessons, we will be making use of the pandas and seaborn libraries.

# reload modules before executing user code
%load_ext autoreload
# reload all modules every time before executing the Python code
%autoreload 2
# render plots in notebook
%matplotlib inline

# data wrangling
import pandas as pd
import numpy as np
from pathlib import Path
from dslectures.core import *

# data viz
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# these commands define the color scheme
sns.set(color_codes=True)
sns.set_palette(sns.color_palette('muted'))

Load the data

As usual, we can download our datasets using our helper function get_datasets:

get_dataset('housing_merged.csv')

Download of housing_merged.csv dataset complete.

We also make use of the pathlib library to handle our filepaths:

DATA = Path('../data/')
!ls {DATA}

housing.csv                housing_merged.csv
housing_addresses.csv      housing_processed.csv
housing_backup.csv         imdb.csv
housing_gmaps_data_raw.csv uc

housing_data = pd.read_csv(DATA/'housing_merged.csv'); housing_data.head()

Feature engineering

Before we prepare the data for training machine learning models, it is useful to experiment with creating new features (i.e. columns) that may provide more information than the raw data alone. For example the column total_rooms refers to the total number of rooms in a housing district, and thus it is more useful to know the number of rooms per household. In pandas, we can create this new column as follows:

housing_data['rooms_per_household'] = housing_data['total_rooms'] / housing_data['households']
# check we have added the column
housing_data.head(1)

Exercise #1

Create a new feature called bedrooms_per_household from the total_bedrooms and households features
Create a new feature called bedrooms_per_room from the total_bedrooms and total_rooms features
Create a new feature called population_per_household from the population and households features
Recalculate the correlation matrix from lesson 1 - what can you conclude about the correlation of the new features with the median house value?

Remove the housing price cap

Recall from lesson 1 that the quantity we wish to predict (median house value) has a cap around $500,000:

# use plt.subplots() to create multiple plots
fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=(7, 4))
# put one plot on axis ax0
sns.distplot(housing_data["median_house_value"], kde=False, ax=ax0)
# put second plot on axis ax1
sns.scatterplot("median_income", "median_house_value", data=housing_data, ax=ax1)
# tight_layout() fixes spacing between plots
fig.tight_layout()

Tip: Up until now, we have always made individual plots to visualise the data. The example above shows how matplotlib can be used to show two plots on the same figure! See the Effectively Using Matplotlib link in the references for more information.

The presence of this cap is potentially problematic since our machine learning algorithms may learn that the housing prices never go beyond that limit. Let's assume that we want to predict housing prices above $500,000, in which case we should remove these districts from the dataset.

Exercise #2

Store the number of rows in housing_data in a variable called n_rows_raw
Use the DataFrame.loc[] method to remove all rows where median_house_value is greater than or equal to $500,000
Calculate the fraction of data that has been removed by this filter.
Create new histogram and scatter plots to make sure you have removed the capped values correctly.

Convert strings to categories

If we inspect the data types associated with our housing pandas.DataFrame

housing_data.dtypes

longitude                   float64
latitude                    float64
housing_median_age          float64
total_rooms                 float64
total_bedrooms              float64
population                  float64
households                  float64
median_income               float64
median_house_value          float64
ocean_proximity              object
street_number                object
street_name                  object
city                         object
postal_code                 float64
rooms_per_household         float64
bedrooms_per_household      float64
bedrooms_per_room           float64
population_per_household    float64
dtype: object

we see that in addition to numerical features, we have features of object data type, which pandas denotes with the string O:

housing_data['ocean_proximity'].dtype

dtype('O')

# compare against numerical column
housing_data['median_house_value'].dtype

dtype('float64')

pandas has a handy set of functions to test the data type of each column. For example, to check whether a column is of object or numeric type we can import the following functions

from pandas.api.types import is_object_dtype, is_numeric_dtype

and then test them against some columns:

is_object_dtype(housing_data['ocean_proximity'])

True

is_numeric_dtype(housing_data['ocean_proximity'])

False

is_numeric_dtype(housing_data['median_house_value'])

True

In this case, we know these columns are strings and some step is needed to convert them to numerical form because most machine learning algorithms are best suited for doing computations on arrays of numbers, not strings.

pandas has a special Categorical type for holding data that uses the integer-based categorical representation or encoding. For example housing_data['ocean_proximity'] is a pandas.Series of Python string objects ['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND']. We can convert a pandas.DataFrame column to categorical as follows:

housing_data['ocean_proximity'] = housing_data['ocean_proximity'].astype('category')

The resulting Categorical object has categories and codes attributes that can be accessed as follows:

housing_data['ocean_proximity'].cat.categories

Index(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype='object')

housing_data['ocean_proximity'].cat.codes

0        3
1        3
2        3
3        3
4        3
        ..
20635    1
20636    1
20637    1
20638    1
20639    1
Length: 19648, dtype: int8

Note: pandas will display Categorical features as text and treat them internally as numerical.

Sometimes we may want to reorder by hand the categorical variables. For example, with our ocean_proximity feature, it makes more sense to order the categories by distance to the ocean:

housing_data["ocean_proximity"].cat.set_categories(
    ["INLAND", "<1H OCEAN", "NEAR BAY", "NEAR OCEAN", "ISLAND"], ordered=True, inplace=True
)

Tip: Instead of doing each data cleaning step manually, it is a good idea to write functions that automate the process. The main benefits from doing so is that you will be able to reproduce your results at a later date and can reuse the functions in future projects. The exercise below is desiged to help you create your first data cleaning function!

Exercise #3

Create a function called convert_strings_to_categories that takes a pandas.DataFrame as an argument and converts all columns of object type into Categorical. Note that the operation can be done in-place and thus your function should not return any objects. You may find the commands DataFrame.columns and is_numeric_dtype are useful.
Check that the transformed housing_data has the expected data types.

Dealing with missing values

In general, machine learning algorithms will fail to work with missing data, and in general you have three options to handle them:

Get rid of the corresponding rows
Get rid of the whole feature or column
Replace the missing values with some value like zero or the mean, median of the column.

A quick way to check if there's any missing data is to run the pandas DataFrame.info() method:

housing_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19648 entries, 0 to 20639
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   longitude                 19648 non-null  float64 
 1   latitude                  19648 non-null  float64 
 2   housing_median_age        19648 non-null  float64 
 3   total_rooms               19648 non-null  float64 
 4   total_bedrooms            19448 non-null  float64 
 5   population                19648 non-null  float64 
 6   households                19648 non-null  float64 
 7   median_income             19648 non-null  float64 
 8   median_house_value        19648 non-null  float64 
 9   ocean_proximity           19648 non-null  category
 10  street_number             18112 non-null  category
 11  street_name               19144 non-null  category
 12  city                      19488 non-null  category
 13  postal_code               19492 non-null  float64 
 14  rooms_per_household       19648 non-null  float64 
 15  bedrooms_per_household    19448 non-null  float64 
 16  bedrooms_per_room         19448 non-null  float64 
 17  population_per_household  19648 non-null  float64 
dtypes: category(4), float64(14)
memory usage: 3.2 MB

Since housing_data has 20,640 rows we see that the following columns are missing values:

total_bedrooms
street_number
street_name
city
postal_code
bedrooms_per_household
bedrooms_per_room

An alternative way to verify this is to apply the DataFrame.isnull() method and calculate the sum of missing values in housing_data:

housing_data.isnull().sum()

longitude                      0
latitude                       0
housing_median_age             0
total_rooms                    0
total_bedrooms               200
population                     0
households                     0
median_income                  0
median_house_value             0
ocean_proximity                0
street_number               1536
street_name                  504
city                         160
postal_code                  156
rooms_per_household            0
bedrooms_per_household       200
bedrooms_per_room            200
population_per_household       0
dtype: int64

Exercise #4

It is often more informative to know the fraction or percentage of missing values in a pandas.DataFrame.

Calculate the fraction of missing values in housing_data and sort them in descending order.
Use seaborn to create a bar plot that shows the fraction of missing data that you calculated above.

Getting rid of rows

Let's look at our first option to handle missing data: getting rid of rows. One candidate for this is the city column since dropping the 188 rows amounts to less than 1% of the total dataset. To achieve this we can use the DataFrame.dropna() method as follows:

housing_data.dropna(subset=['city'], inplace=True)
# check city has no missing values
housing_data['city'].isnull().sum()

0

Exercise #5

Use the same technique as above to remove the rows whether the postal_code values are missing
Use the Series.astype() method to convert the postal_code to int type

Getting rid of whole features

We still have missing values for the categorical features street_number and street_name. For such data one simple approach is to replace the missing values with the most frequent entry. However, for these specific attributes it does not make much sense to replace e.g. the missing street names with the most common ones in some other city.

To that end, we will drop the street_number and street_name columns.

Exercise #6

Drop the street_number, street_name and postal_code columns from housing_data

Fill missing values

For the numerical columns, let's replace the missing values by the median.

Note: We choose the median since it is more robust to outliers than the mean, but in general one should experiment with different combinations.

For example, with total_bedrooms this might look like the following:

# calculate median total number of bedrooms
total_bedrooms_median = housing_data['total_bedrooms'].median()
# use inplace=True to make replacement in place
housing_data['total_bedrooms'].fillna(total_bedrooms_median, inplace=True)
# check replacement worked
housing_data['total_bedrooms'].isnull().sum()

0

Although doing this replacement manually for each numerical column is feasible for this small dataset, it would be much better to have a function that automates this process.

Exercise #7

Create a function called fill_missing_values_with_median that takes a pandas.DataFrame as an argument and replaces missing values in each column with the median. Note that the operation can be don in-place, so your function should not return any objects. You may find the command is_numeric_dtype to be useful.
Check that missing values are filled in the transformed pandas.DataFrame.

Convert categories to numbers

We've now reached the stage where we have a cleaned pandas.DataFrame and the final step is to convert our categorical columns to numerical form. For example, we can numericalise the city column by replacing the categories with their corresponding codes:

# add +1 so codes start from 1
housing_data['city'] = housing_data['city'].cat.codes + 1
# check output
housing_data.head(1)

One potential problem with the above representation is that machine learning algorithms will treat two cities that are numerically close to each other as being similar. Thus an alternative approach is to apply a technique known as one-hot encoding, where we create a binary feature per category. In pandas we can do this by simply running pandas.get_dummies():

housing_data = pd.get_dummies(housing_data)
housing_data.head()

Note that the above has converted our ocean_proximity column into one new column per category!

Note: For columns with hundreds or thousands of categories, one-hot encoding may not be ideal as it will create a large number of columns and slow down the training phase of your analysis. For this reason we did not one-hot encode the city or post_code columns, even though they are strictly categorical.

# sanity check
housing_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19443 entries, 0 to 20639
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   longitude                   19443 non-null  float64
 1   latitude                    19443 non-null  float64
 2   housing_median_age          19443 non-null  float64
 3   total_rooms                 19443 non-null  float64
 4   total_bedrooms              19443 non-null  float64
 5   population                  19443 non-null  float64
 6   households                  19443 non-null  float64
 7   median_income               19443 non-null  float64
 8   median_house_value          19443 non-null  float64
 9   city                        19443 non-null  int16  
 10  postal_code                 19443 non-null  int64  
 11  rooms_per_household         19443 non-null  float64
 12  bedrooms_per_household      19443 non-null  float64
 13  bedrooms_per_room           19443 non-null  float64
 14  population_per_household    19443 non-null  float64
 15  ocean_proximity_INLAND      19443 non-null  uint8  
 16  ocean_proximity_<1H OCEAN   19443 non-null  uint8  
 17  ocean_proximity_NEAR BAY    19443 non-null  uint8  
 18  ocean_proximity_NEAR OCEAN  19443 non-null  uint8  
 19  ocean_proximity_ISLAND      19443 non-null  uint8  
dtypes: float64(13), int16(1), int64(1), uint8(5)
memory usage: 2.4 MB

Save processed data to disk

housing_data.to_csv(DATA/'housing_processed.csv', index=False)

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	street_number	street_name	city	postal_code
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY	3130	Grizzly Peak Boulevard	Berkeley	94705.0
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY	2005	Tunnel Road	Oakland	94611.0
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY	6886	Chabot Road	Oakland	94618.0
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY	6365	Florio Street	Oakland	94618.0
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY	6365	Florio Street	Oakland	94618.0

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	city	postal_code	rooms_per_household	bedrooms_per_household	bedrooms_per_room	population_per_household	ocean_proximity_NEAR BAY
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	69	94705	6.984127	1.023810	0.146591	2.555556	1
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	620	94611	6.238137	0.971880	0.155797	2.109842	1
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	620	94618	8.288136	1.073446	0.129516	2.802260	1
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	620	94618	5.817352	1.073059	0.184458	2.547945	1
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	620	94618	6.281853	1.081081	0.172096	2.181467	1

Lesson 3 - Data cleaning and feature engineering