- Understand the main steps involved in training a machine learning model
- Gain an introduction to scikit-learn's API
- Understand the need to generate a training and validation set
This lesson is adapted from Jeremy Howard's fantastic online course Introduction to Machine Learning for Coders, in particular:
You may also find the following textbook chapters and blog posts useful:
- Chapters 2 & 5 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron
- About Train, Validation and Test Sets in Machine Learning
In this lesson we will analyse the preprocessed table of clean housing data and their addresses that we prepared in lesson 3:
- housing_processed.csv

Tom Mitchell, one of the pioneers of machine learning, proposed this definition:
A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.
Framed in our example to predict housing prices in California (task $T$), we can run a Random Forest algorithm on data about past housing prices (experience $E$) and, if it has successfully "learned", it will then do better at predicting future housing prices (performance measure $P$).
# reload modules before executing user code
%load_ext autoreload
# reload all modules every time before executing Python code
%autoreload 2
# render plots in notebook
%matplotlib inline
# data wrangling
import pandas as pd
import numpy as np
from dslectures.core import *
from pathlib import Path
# data viz
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
sns.set_palette(sns.color_palette("muted"))
# ml magic
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
As usual, we can download our datasets using our helper function get_datasets:
get_dataset('housing_processed.csv')
We also make use of the pathlib library to handle our filepaths:
DATA = Path('../data/')
!ls {DATA}
housing_data = pd.read_csv(DATA/'housing_processed.csv'); housing_data.head()
DataFrame.head() to peek at the first 5 rows of a pandas.DataFrame. When you have a lot of columns, you may find it is simpler to peek at the transpose with DataFrame.head().Thousing_data.head().T
Before we can train any model, we need to think about which performance measure we wish to optimise for. For regression problems the Root Mean Square Error (RMSE) is often used as it measures the standard deviation of the errors the algorithm makes in its predictions and gives a higher weight to large errors. For example, an RMSE equal to 50,000 means that about 68% of the algorithm's predictions fall within 50,000 CHF of the actual value, and about 95% fall within 100,000 CHF.
Mathematically, the formula for RMSE is:
$$ \mathrm{RMSE} = \sqrt{\frac{1}{m}\sum_{i=1}^m \left(\hat{y}_i - y_i\right)^2}$$
where $m$ is the number of instances in the dataset you are measuring the RMSE on, $\hat{y}_i$ is the model's prediction for the $i^{th}$ instance, and $y_i$ is the actual label. Let's create a simple function that uses scitkit-learn's mean_squared_error function (which is just RMSE$^2$):
def rmse(y, yhat):
    """A utility function to calculate the Root Mean Square Error (RMSE).
    
    Args:
        y (array): Actual values for target.
        yhat (array): Predicted values for target.
        
    Returns:
        rmse (double): The RMSE.
    """
    return np.sqrt(mean_squared_error(y, yhat))
Exercise #2
Whenever you create a Python function it is a good idea to test that it behaves as you expect on some dummy data. Given the two NumPy arrays:
y_dummy = np.array([2,2,3])
yhat_dummy = np.array([0,2,6])
check that our rmse function matches what you would get by calculating the explicit formula for RMSE. You may find the numpy.sum() and array.size methods to be useful.
Now that we've checked that the training data is clean and free from obvious anomalies, it's time to train our model! To do so, we will make use of the scikit-learn library.
scikit-learn is one of the best known Python libraries for machine learning and provides efficient implementations of a large number of common algorithms. It has a uniform Estimator API as well as excellent online documentation. The main benefit of its API is that once you understand the basic use and syntax of scikit-learn for one type of model, switching to a new model or algorithm is very easy.
Basics of the API
The most common steps one takes when building a model in scikit-learn are:
- Choose a class of model by importing the appropriate estimator class from scikit-learn.
- Choose model hyperparameters by instantiating this class with the desired values.
- Arrange data into a feature matrix and target vector (see discussion below).
- Fit the model to your data by calling the fit()method.
- Evaluate the predictions of the model:- For supervised learning we typically predict labels for new data using the predict()method.
- For unsupervised learning, we often transform or infer properties of the data using the transform()orpredict()methods.
 
- For supervised learning we typically predict labels for new data using the 
Let's go through each of these steps to build a Random Forest regressor to predict California housing prices.
In scikit-learn, every class of model is represented by a Python class. We want a Random Forest regressor, so looking at the online docs we should import the RandomForestRegressor:
from sklearn.ensemble import RandomForestRegressor
Once we have chosen our model class, there are still some options open to us:
- What is the maximum depth of the tree? The default is Nonewhich means the nodes are expanded until all leaves are pure.
- Other parameters can be found in the docs, but for now we take a simple model with just 10 trees.
The above choices are often referred to as hyperparameters or parameters that must be set before the model is fit to the data. We can instantiate the RandomForestRegressor class and specify the desired hyperparameters as follows:
model = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=42)
random_state in scikit-learn and other libraries in the PyData stack. This parameter usually controls the random seed used to generate the function’s output and setting is explicitly allows us to have reproducible results.scikit-learn requires that the data be arranged into a two-dimensional feature matrix and a one-dimensional target array. By convention:
- The feature matrix is often stored in a variable called X. This matrix is typically two-dimensional with shape[n_samples, n_features], wheren_samplesrefers to the number of row (i.e. housing districts in our example) andn_featuresrefers to all columns exceptmedian_house_valuewhich is our target.
- The target or label array is usually denoted by y.
Now it is time to apply our model to data! This can be done with the fit() method:
model.fit(X, y)
The final step is to generate predictions and evaluate them with our chosen performance metric, in this case the RMSE.
yhat = model.predict(X)
rmse(y, yhat)
This is not a bad score since the majority of the house prices fall in the range of $115,000-250,000
housing_data['median_house_value'].describe()
and thus we are looking at roughly a 10-20% error in our predictions.
One way to measure how well a model will generalise to new cases is to split your data into two sets: the training set and the validation set. As these names imply, you train your model using the training set and validate it using the validation set. The error rate on new cases is called the generalisation error and by evaluating your model on the validation set, you get an estimation of this error.
Creating a validation set is theoretically quite simple: just pick some instances randomly and set them aside (we set the random number generator's seed random_state so that is always generates the same shuffled indices):
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'{len(X_train)} train rows + {len(X_valid)} valid rows')
With these two datasets, we first fit on the training set and evaluate the prediction on the validation one:
model = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_valid)
rmse(y_valid, y_pred)
Although numerical scores are useful, deeper insights can often be gained by visualising the errors the model makes. Let's look at two common ways to diagnose regression models.
To get a sense of how often our model is predicting values that are close to the expected values, we'll plot the actual median_house_value labels from the test dataset against the predicted value generated by our final model:
def plot_prediction_error(fitted_model, X, y):
    """
    A utility function to visualise the prediction errors of regression models.
    
    Args:
        fitted_model: A scikit-learn regression model.
        X: The feature matrix to generate predictions on.
        y: The target vector compare the predictions against.
    """
    y_pred = model.predict(X)
    plt.figure(figsize=(8, 4))
    sns.scatterplot(y, y_pred)
    sns.lineplot([y.min(), y.max()], [y.min(), y.max()], lw=2, color="r")
    plt.xlabel("Actual Median House Price")
    plt.ylabel("Predicted Median House Price")
    plt.title(f"Prediction Error for {model.__class__.__name__}")
    plt.show()
plot_prediction_error(model, X_valid, y_valid)
What we’re looking for here is a clear, linear relationship between the predicted and actual values. The red line denotes what could be considered an "optimal" model, so we want our points to be bunched around this line. We can see the apart from a few outliers, the random forest performs fairly well. (In fact those outliers might suggest something is fishy with the data or that these houses are special for reasons not reflected in the data.)
A residual is the difference between the labeled value and the predicted value for each instance in our dataset:
$$ \mathrm{residual} = y_\mathrm{actual} - y_\mathrm{predicted} $$
We can plot residuals to visualize the extent to which our model has captured the behavior of the data. By plotting the residuals for a series of instances, we can check whether they’re consistent with random error; we should not be able to predict the error for any given instance. If the data points appear to be evenly (randomly) dispersed around the plotted line, our model is performing well. In some sense, the resulting plot is a rotated version of our prediction error one above:
def plot_residuals(fitted_model, X, y):
    '''
    A utility function to visualise the residuals of regression models.
    
    Args:
        fitted_model: A scikit-learn regression model.
        X: The feature matrix to generate predictions on.
        y: The target vector compare the predictions against.   
    '''
    y_pred = model.predict(X)
    sns.residplot(y_pred, y - y_pred)
    plt.ylabel('Residuals')
    plt.xlabel('Predicted Median House Price')
    plt.title(f'Residuals for {model.__class__.__name__}')
    plt.show()
plot_residuals(model, X_valid, y_valid)
What we’re looking for is a mostly symmetrical distribution with points that tend to cluster towards the middle of the plot, ideally around smaller numbers of the y-axis. If we observe some kind of structure that does not coincide with the plotted line, we have failed to capture the behavior of the data and should either consider some feature engineering, selecting a new model, or an exploration of the hyperparameters.
In the case above, we see that again the outliers suggest some room for improvement with our Random Forest model.
Exercise #3
Use the plt.subplots() functionality from lesson 3 to create a new function plot_errors_and_residuals that combines the above plots into a single figure. You may find the ax.set_xlabel(), ax.set_ylabel(), and ax.set_title() functions are useful for configuring the labels and title on each individual plot.
Exercise #4
Instead of using an ensemble of decision trees, scikit-learn also provides an estimator to train a single decision tree on the data (see documentation here). Repeat the same 5 steps above for a decision tree regressor, using the default hyperparameters. Do you notice anything unusual in the performance metrics if you fit the model on the whole dataset?