How to prepare data for machine learning algorithms and create new, informative features

Binder slides

Learning objectives

  • Understand the main steps involved in preparing data for machine learning algorithms
  • Create Python functions to automate steps of the data cleaning process
  • Gain an introduction to matplotlib's object-oriented interface to combine plots on the same figure

References

This lesson draws heavily on the following textbook chapter:

  • Chapter 2 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron

You may also find the following blog posts useful:

Homework

  • Solve the exercises included in this notebook
  • Read chapter 3 of Data Science for Business by Provost & Fawcett

What is data cleaning?

When you receive a new dataset at the beginning of a project, the first task usually involves some form of data cleaning.

To solve the task at hand, you might need data from multiple sources which you need to combine into one unified table. However, this is usually a tricky task; the different data sources might have different naming conventions, some of them might be human-generated, while others are automatic system reports. A list of things you usually have to go through are the following:

  • Merge multiple sources into one table
  • Remove duplicate entries
  • Clean corrupted entries
  • Handle missing data

In lesson 2, we examined how to merge the table of housing data with their addresses; in this lesson we will focus on the remainign three steps.

Although building algorithms that are able to classify data or estimate outcomes is arguably the most interesting part of data science, data cleaning is the one that takes up most of the time. According to a study by CrowdFlower, data scientists spend about 60-80% of their time preparing datasets for machine learning algorithms.

The data

In this lesson we will analyse the unified table of housing data and their addresses that we created in lesson 2:

  • housing_merged.csv

Import libraries

As in previous lessons, we will be making use of the pandas and seaborn libraries.

# reload modules before executing user code
%load_ext autoreload
# reload all modules every time before executing the Python code
%autoreload 2
# render plots in notebook
%matplotlib inline
# data wrangling
import pandas as pd
import numpy as np
from pathlib import Path
from dslectures.core import *

# data viz
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# these commands define the color scheme
sns.set(color_codes=True)
sns.set_palette(sns.color_palette('muted'))

Load the data

As usual, we can download our datasets using our helper function get_datasets:

get_dataset('housing_merged.csv')
Download of housing_merged.csv dataset complete.

We also make use of the pathlib library to handle our filepaths:

DATA = Path('../data/')
!ls {DATA}
housing.csv                housing_merged.csv
housing_addresses.csv      housing_processed.csv
housing_backup.csv         imdb.csv
housing_gmaps_data_raw.csv uc
housing_data = pd.read_csv(DATA/'housing_merged.csv'); housing_data.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity street_number street_name city postal_code
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 3130 Grizzly Peak Boulevard Berkeley 94705.0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 2005 Tunnel Road Oakland 94611.0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 6886 Chabot Road Oakland 94618.0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 6365 Florio Street Oakland 94618.0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 6365 Florio Street Oakland 94618.0

Feature engineering

Before we prepare the data for training machine learning models, it is useful to experiment with creating new features (i.e. columns) that may provide more information than the raw data alone. For example the column total_rooms refers to the total number of rooms in a housing district, and thus it is more useful to know the number of rooms per household. In pandas, we can create this new column as follows:

housing_data['rooms_per_household'] = housing_data['total_rooms'] / housing_data['households']
# check we have added the column
housing_data.head(1)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity street_number street_name city postal_code rooms_per_household
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 3130 Grizzly Peak Boulevard Berkeley 94705.0 6.984127

Exercise #1

  • Create a new feature called bedrooms_per_household from the total_bedrooms and households features
  • Create a new feature called bedrooms_per_room from the total_bedrooms and total_rooms features
  • Create a new feature called population_per_household from the population and households features
  • Recalculate the correlation matrix from lesson 1 - what can you conclude about the correlation of the new features with the median house value?

Remove the housing price cap

Recall from lesson 1 that the quantity we wish to predict (median house value) has a cap around $500,000:

# use plt.subplots() to create multiple plots
fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=(7, 4))
# put one plot on axis ax0
sns.distplot(housing_data["median_house_value"], kde=False, ax=ax0)
# put second plot on axis ax1
sns.scatterplot("median_income", "median_house_value", data=housing_data, ax=ax1)
# tight_layout() fixes spacing between plots
fig.tight_layout()

The presence of this cap is potentially problematic since our machine learning algorithms may learn that the housing prices never go beyond that limit. Let's assume that we want to predict housing prices above $500,000, in which case we should remove these districts from the dataset.


Exercise #2

  • Store the number of rows in housing_data in a variable called n_rows_raw
  • Use the DataFrame.loc[] method to remove all rows where median_house_value is greater than or equal to $500,000
  • Calculate the fraction of data that has been removed by this filter.
  • Create new histogram and scatter plots to make sure you have removed the capped values correctly.

Convert strings to categories

If we inspect the data types associated with our housing pandas.DataFrame

housing_data.dtypes
longitude                   float64
latitude                    float64
housing_median_age          float64
total_rooms                 float64
total_bedrooms              float64
population                  float64
households                  float64
median_income               float64
median_house_value          float64
ocean_proximity              object
street_number                object
street_name                  object
city                         object
postal_code                 float64
rooms_per_household         float64
bedrooms_per_household      float64
bedrooms_per_room           float64
population_per_household    float64
dtype: object

we see that in addition to numerical features, we have features of object data type, which pandas denotes with the string O:

housing_data['ocean_proximity'].dtype
dtype('O')
# compare against numerical column
housing_data['median_house_value'].dtype
dtype('float64')

pandas has a handy set of functions to test the data type of each column. For example, to check whether a column is of object or numeric type we can import the following functions

from pandas.api.types import is_object_dtype, is_numeric_dtype

and then test them against some columns:

is_object_dtype(housing_data['ocean_proximity'])
True
is_numeric_dtype(housing_data['ocean_proximity'])
False
is_numeric_dtype(housing_data['median_house_value'])
True

In this case, we know these columns are strings and some step is needed to convert them to numerical form because most machine learning algorithms are best suited for doing computations on arrays of numbers, not strings.

pandas has a special Categorical type for holding data that uses the integer-based categorical representation or encoding. For example housing_data['ocean_proximity'] is a pandas.Series of Python string objects ['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND']. We can convert a pandas.DataFrame column to categorical as follows:

housing_data['ocean_proximity'] = housing_data['ocean_proximity'].astype('category')

The resulting Categorical object has categories and codes attributes that can be accessed as follows:

housing_data['ocean_proximity'].cat.categories
Index(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype='object')
housing_data['ocean_proximity'].cat.codes
0        3
1        3
2        3
3        3
4        3
        ..
20635    1
20636    1
20637    1
20638    1
20639    1
Length: 19648, dtype: int8

Sometimes we may want to reorder by hand the categorical variables. For example, with our ocean_proximity feature, it makes more sense to order the categories by distance to the ocean:

housing_data["ocean_proximity"].cat.set_categories(
    ["INLAND", "<1H OCEAN", "NEAR BAY", "NEAR OCEAN", "ISLAND"], ordered=True, inplace=True
)


Exercise #3

  • Create a function called convert_strings_to_categories that takes a pandas.DataFrame as an argument and converts all columns of object type into Categorical. Note that the operation can be done in-place and thus your function should not return any objects. You may find the commands DataFrame.columns and is_numeric_dtype are useful.
  • Check that the transformed housing_data has the expected data types.

Dealing with missing values

In general, machine learning algorithms will fail to work with missing data, and in general you have three options to handle them:

  • Get rid of the corresponding rows
  • Get rid of the whole feature or column
  • Replace the missing values with some value like zero or the mean, median of the column.

A quick way to check if there's any missing data is to run the pandas DataFrame.info() method:

housing_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19648 entries, 0 to 20639
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   longitude                 19648 non-null  float64 
 1   latitude                  19648 non-null  float64 
 2   housing_median_age        19648 non-null  float64 
 3   total_rooms               19648 non-null  float64 
 4   total_bedrooms            19448 non-null  float64 
 5   population                19648 non-null  float64 
 6   households                19648 non-null  float64 
 7   median_income             19648 non-null  float64 
 8   median_house_value        19648 non-null  float64 
 9   ocean_proximity           19648 non-null  category
 10  street_number             18112 non-null  category
 11  street_name               19144 non-null  category
 12  city                      19488 non-null  category
 13  postal_code               19492 non-null  float64 
 14  rooms_per_household       19648 non-null  float64 
 15  bedrooms_per_household    19448 non-null  float64 
 16  bedrooms_per_room         19448 non-null  float64 
 17  population_per_household  19648 non-null  float64 
dtypes: category(4), float64(14)
memory usage: 3.2 MB

Since housing_data has 20,640 rows we see that the following columns are missing values:

  • total_bedrooms
  • street_number
  • street_name
  • city
  • postal_code
  • bedrooms_per_household
  • bedrooms_per_room

An alternative way to verify this is to apply the DataFrame.isnull() method and calculate the sum of missing values in housing_data:

housing_data.isnull().sum()
longitude                      0
latitude                       0
housing_median_age             0
total_rooms                    0
total_bedrooms               200
population                     0
households                     0
median_income                  0
median_house_value             0
ocean_proximity                0
street_number               1536
street_name                  504
city                         160
postal_code                  156
rooms_per_household            0
bedrooms_per_household       200
bedrooms_per_room            200
population_per_household       0
dtype: int64

Exercise #4

It is often more informative to know the fraction or percentage of missing values in a pandas.DataFrame.

  • Calculate the fraction of missing values in housing_data and sort them in descending order.
  • Use seaborn to create a bar plot that shows the fraction of missing data that you calculated above.

Getting rid of rows

Let's look at our first option to handle missing data: getting rid of rows. One candidate for this is the city column since dropping the 188 rows amounts to less than 1% of the total dataset. To achieve this we can use the DataFrame.dropna() method as follows:

housing_data.dropna(subset=['city'], inplace=True)
# check city has no missing values
housing_data['city'].isnull().sum()
0

Exercise #5

  • Use the same technique as above to remove the rows whether the postal_code values are missing
  • Use the Series.astype() method to convert the postal_code to int type

Getting rid of whole features

We still have missing values for the categorical features street_number and street_name. For such data one simple approach is to replace the missing values with the most frequent entry. However, for these specific attributes it does not make much sense to replace e.g. the missing street names with the most common ones in some other city.

To that end, we will drop the street_number and street_name columns.


Exercise #6

Drop the street_number, street_name and postal_code columns from housing_data


Fill missing values

For the numerical columns, let's replace the missing values by the median.

For example, with total_bedrooms this might look like the following:

# calculate median total number of bedrooms
total_bedrooms_median = housing_data['total_bedrooms'].median()
# use inplace=True to make replacement in place
housing_data['total_bedrooms'].fillna(total_bedrooms_median, inplace=True)
# check replacement worked
housing_data['total_bedrooms'].isnull().sum()
0

Although doing this replacement manually for each numerical column is feasible for this small dataset, it would be much better to have a function that automates this process.


Exercise #7

  • Create a function called fill_missing_values_with_median that takes a pandas.DataFrame as an argument and replaces missing values in each column with the median. Note that the operation can be don in-place, so your function should not return any objects. You may find the command is_numeric_dtype to be useful.
  • Check that missing values are filled in the transformed pandas.DataFrame.

Convert categories to numbers

We've now reached the stage where we have a cleaned pandas.DataFrame and the final step is to convert our categorical columns to numerical form. For example, we can numericalise the city column by replacing the categories with their corresponding codes:

# add +1 so codes start from 1
housing_data['city'] = housing_data['city'].cat.codes + 1
# check output
housing_data.head(1)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity city postal_code rooms_per_household bedrooms_per_household bedrooms_per_room population_per_household
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 69 94705 6.984127 1.02381 0.146591 2.555556

One potential problem with the above representation is that machine learning algorithms will treat two cities that are numerically close to each other as being similar. Thus an alternative approach is to apply a technique known as one-hot encoding, where we create a binary feature per category. In pandas we can do this by simply running pandas.get_dummies():

housing_data = pd.get_dummies(housing_data)
housing_data.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value city postal_code rooms_per_household bedrooms_per_household bedrooms_per_room population_per_household ocean_proximity_INLAND ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN ocean_proximity_ISLAND
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 69 94705 6.984127 1.023810 0.146591 2.555556 0 0 1 0 0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 620 94611 6.238137 0.971880 0.155797 2.109842 0 0 1 0 0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 620 94618 8.288136 1.073446 0.129516 2.802260 0 0 1 0 0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 620 94618 5.817352 1.073059 0.184458 2.547945 0 0 1 0 0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 620 94618 6.281853 1.081081 0.172096 2.181467 0 0 1 0 0

Note that the above has converted our ocean_proximity column into one new column per category!

# sanity check
housing_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19443 entries, 0 to 20639
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   longitude                   19443 non-null  float64
 1   latitude                    19443 non-null  float64
 2   housing_median_age          19443 non-null  float64
 3   total_rooms                 19443 non-null  float64
 4   total_bedrooms              19443 non-null  float64
 5   population                  19443 non-null  float64
 6   households                  19443 non-null  float64
 7   median_income               19443 non-null  float64
 8   median_house_value          19443 non-null  float64
 9   city                        19443 non-null  int16  
 10  postal_code                 19443 non-null  int64  
 11  rooms_per_household         19443 non-null  float64
 12  bedrooms_per_household      19443 non-null  float64
 13  bedrooms_per_room           19443 non-null  float64
 14  population_per_household    19443 non-null  float64
 15  ocean_proximity_INLAND      19443 non-null  uint8  
 16  ocean_proximity_<1H OCEAN   19443 non-null  uint8  
 17  ocean_proximity_NEAR BAY    19443 non-null  uint8  
 18  ocean_proximity_NEAR OCEAN  19443 non-null  uint8  
 19  ocean_proximity_ISLAND      19443 non-null  uint8  
dtypes: float64(13), int16(1), int64(1), uint8(5)
memory usage: 2.4 MB

Save processed data to disk

housing_data.to_csv(DATA/'housing_processed.csv', index=False)