Learning objectives

Know how to apply Random Forests to classification tasks
Understand the performance metrics associated with binary classification
Gain an introduction to fast.ai's data preprocessing functions

References

This lesson is inspired by the following textbooks and online courses:

Chapter 3 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron
Chapter 7 of Data Science for Business by Provost and Fawcett
Lessons 1 - 4 of Jeremy Howard's fantastic online course Introduction to Machine Learning for Coders

You may also find the following blog post useful:

Grumpy, euphoric, and smart classifiers (interactive)

Homework

Solve the exercises included in this notebook
Read chapter 3 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron
Read chapter 7 of Data Science for Business by Provost and Fawcett

What is customer churn?

Figure reference: https://s16353.pcdn.co/wp-content/uploads/2018/06/Churn.png

We will explore IBM's telecommunications dataset and determine which attributes are most informative for predicting customer retention (also known as customer churn). As described by IBM, the problem setting is as follows:

A telecommunications company is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why.

The kind of questions we'd like to find answers to are: Which customers are likely to leave? Which attributes influence customers who leave?

The data

As noted above, in this lesson we will analyse IBM's customer churn dataset:

churn.csv

The dataset includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer (tenure), contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, whether they're a senior citizen or not, and if they have partners and dependents

Import libraries

# reload modules before executing user code
%load_ext autoreload
# reload all modules every time before executing Python code
%autoreload 2
# render plots in notebook
%matplotlib inline

# uncomment to update the library if working locally
# !pip install dslectures --upgrade

# data wrangling
import pandas as pd
import numpy as np
from dslectures.core import (
    get_dataset,
    display_large,
    convert_strings_to_categories,
    rf_feature_importance,
    plot_feature_importance,
    plot_dendogram,
)
from dslectures.structured import proc_df
from pathlib import Path

# data viz
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve
from sklearn.tree import plot_tree

sns.set(color_codes=True)
sns.set_palette(sns.color_palette("muted"))

# ml magic
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
import scipy
from scipy.cluster import hierarchy as hc

Load the data

get_dataset("churn.csv")

Download of churn.csv dataset complete.

We also make use of the pathlib library to handle our filepaths:

DATA = Path('../data/')
!ls {DATA}

churn.csv                  housing_merged.csv
churn_processed.csv        housing_processed.csv
housing.csv                submission.csv
housing_addresses.csv      test.csv
housing_gmaps_data_raw.csv train.csv

churn_data = pd.read_csv(DATA / "churn.csv")

Inspect the data

Preview the data

Sometimes you will find that the dataset has too many columns to be displayed with the standard DataFrame.head() method and just shows ... for intermediate columns:

churn_data.head()

To fix that we can configure the options in pandas which we can wrap inside a simple function:

display_large(churn_data.head())

Alternatively, you can take the transpose to see all the columns more easily:

churn_data.head().T

The shape of data

# get number of rows
len(churn_data)

7043

# get tuples of (n_rows, n_columns)
churn_data.shape

(7043, 21)

In this case, we see that we have 7043 customers and 21 variables or attributes that describe their telecom subscription. Let's have a look at the columns:

churn_data.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

Note: As explained in the summary, the target attribute is Churn and thus we have a classification problem (rather than regression) because the target is a category (Yes or No) rather than a coninuous number.

Unique values

Whenever we see an ID column like Id, it is useful to perform a sanity check that each value is unique. Otherwise it may be possible that you have duplicates in your data that can bias your models and hence conclusions.

churn_data["customerID"].nunique()

7043

Good! The number of unique IDs matches the number of rows in our DataFrame.

Data types

churn_data.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

Hmm, TotalCharges is of type object (i.e. string) even though it is clearly a float. Since null values or NaNs don't produce this behaviour, there are presumably empty strings lurking in this column. Let's test this hypothesis using DataFrame.value_counts():

churn_data["TotalCharges"].value_counts()

20.2       11
           11
19.75       9
20.05       8
19.9        8
           ..
514.75      1
676.35      1
6510.45     1
428.45      1
6004.85     1
Name: TotalCharges, Length: 6531, dtype: int64

We will deal with this empty strings in the preprocessing steps below.

Data preprocessing

Recall in our housing analysis, we needed to perform three main steps to bring out DataFrame to a form suitable for training a Random Forest on:

Convert strings to categorical data type
Fill missing values
Numericalise the DataFrame and create a features matrix $X$ and target vector $y$
Create train and validation sets

Let's perform each of those steps below.

Convert strings to categories

First we convert all the string columns to pandas' categorical data type:

convert_strings_to_categories(churn_data)

churn_data.dtypes

customerID          category
gender              category
SeniorCitizen          int64
Partner             category
Dependents          category
tenure                 int64
PhoneService        category
MultipleLines       category
InternetService     category
OnlineSecurity      category
OnlineBackup        category
DeviceProtection    category
TechSupport         category
StreamingTV         category
StreamingMovies     category
Contract            category
PaperlessBilling    category
PaymentMethod       category
MonthlyCharges       float64
TotalCharges        category
Churn               category
dtype: object

This is almost correct, although a closer look at SeniorCitizen reveals that it refers to a binary feature and thus should also be categorical:

churn_data["SeniorCitizen"].unique()

array([0, 1])

We can fix this easily by simply changing the data type:

churn_data["SeniorCitizen"] = churn_data["SeniorCitizen"].astype("category")

# sanity check on the transformation
churn_data.dtypes

customerID          category
gender              category
SeniorCitizen       category
Partner             category
Dependents          category
tenure                 int64
PhoneService        category
MultipleLines       category
InternetService     category
OnlineSecurity      category
OnlineBackup        category
DeviceProtection    category
TechSupport         category
StreamingTV         category
StreamingMovies     category
Contract            category
PaperlessBilling    category
PaymentMethod       category
MonthlyCharges       float64
TotalCharges        category
Churn               category
dtype: object

Fill missing values

A quick way to test for missing values is to apply the isna method from pandas and calculate the sum of missing values in our DataFrame:

(churn_data.isna().sum() / len(churn_data)).sort_values(ascending=False)

Churn               0.0
OnlineSecurity      0.0
gender              0.0
SeniorCitizen       0.0
Partner             0.0
Dependents          0.0
tenure              0.0
PhoneService        0.0
MultipleLines       0.0
InternetService     0.0
OnlineBackup        0.0
TotalCharges        0.0
DeviceProtection    0.0
TechSupport         0.0
StreamingTV         0.0
StreamingMovies     0.0
Contract            0.0
PaperlessBilling    0.0
PaymentMethod       0.0
MonthlyCharges      0.0
customerID          0.0
dtype: float64

In this case, it looks like we're lucky and have a pre-cleaned dataset!

Create feature matrix and target vector

Now that we have done some basic preprocessing, the final step is to numericalise the pandas.DataFrame and create the feature matrix $X$ and target vector $y$. In previous lessons we created some functions to automate these steps. Below we use fast.ai's utility function proc_df to wrap all these steps into a single step:

X, y, nas = proc_df(churn_data, "Churn")

display_large(X.head())

X.shape, y.shape

((7043, 20), (7043,))

For future use we can save our processed quantities:

churn_processed = X.join(pd.Series(y, name="Churn"))

churn_processed.to_csv(DATA / "churn_processed.csv", index=False)

Create train and validation sets

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"{len(X_train)} train rows + {len(X_valid)} valid rows")

5634 train rows + 1409 valid rows

Select a performance measure

Evaluating classifiers is often significantly trickier than evaluating a regressor. One way to do this is to compare the accuracy of each classifier, where

$$ \mbox{accuracy} = \frac{\mbox{Number of correct decisions made}}{\mbox{Total number of decisions made}} $$

In general, however, accuracy is not the preferred performance measures for classifiers, especially when you are dealing with skewed datasets (i.e. when some classes are much more frequent than others). For our churn example, supose we build a model that generates 75% accuracy. Is this any good? Let's have a look at the distribution of churn in the data:

sns.countplot(x="Churn", data=churn_data)
plt.show()

churn_data["Churn"].value_counts(normalize=True)

No     0.73463
Yes    0.26537
Name: Churn, dtype: float64

From the plot and numbers we see that the "No Churn" and "Churn" classes appear in approximately a 3:1 ratio. If we built a dumb classifier that just classifies every single customer as "No Churn", then we would be right about 73.5% of the time! In practice skews of 99:1 are common, for which a report of 99% accuracy is somewhat meaningless.

Confusion matrix

A much better way to evaluate the performance of a classifier is to look at the confusion matrix. Recall that a confusion matrix for a problem involving $n$ classes is an $n\times n$ matrix with the rows labelled by the actual classes and the columns with the predicted classes. Our churn example is a two-class problem ("Churn" vs "No Churn"), so the confusion matrix is $2\times 2$.

If we denote the true classes as $\mathbf{p}$(positive) and $\mathbf{n}$(egative), and the classes predicted by the model as $\mathbf{Y}$(es) and $\mathbf{N}$(o) then the confusion matrix has the form:

	N	Y
n	True negatives	False positives
p	False negatives	True positives

The main diagonal contains the counts of correct decisions. The errors of the classifier are the false negatives (positives classified as negative) and false positives (negatives classified as positive).

You should know

In the Data Science for Business textbook, the confusion matrix is given a different layout, namely the columns a relabelled by the actual classes and the rows by the predicted classes:

	p	n
Y	True positives	False positives
P	False negatives	True negatives

The layout adopted above and in this notebook is the one produced by scikit-learn, since we'd like to make use of the in-built functions included in that library.

Baseline model

Let's begin by creating a baseline Random Forest model to build upon. First we need a scoring function for classifiers, similar to our $R^2$ and RMSE function for regression:

def print_scores(fitted_model):
    res = {
        "Accuracy on train:": accuracy_score(fitted_model.predict(X_train), y_train),
        "ROC AUC on train:": roc_auc_score(
            y_train, fitted_model.predict_proba(X_train)[:, 1]
        ),
        "Accuracy on valid:": accuracy_score(fitted_model.predict(X_valid), y_valid),
        "ROC AUC on valid:": roc_auc_score(
            y_valid, fitted_model.predict_proba(X_valid)[:, 1]
        ),
    }
    if hasattr(fitted_model, "oob_score_"):
        res["OOB accuracy:"] = fitted_model.oob_score_

    for k, v in res.items():
        print(k, round(v, 3))

model = RandomForestClassifier(n_estimators=10, n_jobs=-1, random_state=42)
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

print_scores(model)

Accuracy on train: 0.984
ROC AUC on train: 0.999
Accuracy on valid: 0.771
ROC AUC on valid: 0.801

ROC Curves and AUC

Note that in addition to accuracy, we also show a second value, the Area Under the ROC Curve (AUC). The AUC is a good summary statistic of the predictiveness of a binary classifier. It varies from zero to one. A value of 0.5 corresponds to randomness (the classifier cannot distinguish at all between "churn" and "no churn") and a value of 1.0 means that it is perfect.

The "ROC" refers to the Receiver Operating Characteristic (ROC) curve which plots the true positive rate

$$ \mbox{TPR} = \frac{\mbox{TP}}{\mbox{TP} + \mbox{FP}} \,, \qquad \mbox{TP (FP)} = \mbox{number of true (false) positives}\,,$$

against the false positive rate FPR, where the FPR is the ratio of negative instances that are incorrectly classified as positive. In general there is a tradeoff between these two quantities: the higher the TPR, the more false positives (FPR) the classifier produces. A good classifiers stays as close to the top-left corner of a ROC curve plot as possible.

In scikit-learn we can visualise the ROC curve of an estimator using the plotting API:

plot_roc_curve(model, X_valid, y_valid)
plt.show()

Next we plot the confusion matrix using scikit-learn's plotting API:

# extract labels for classes
class_names = churn_data["Churn"].cat.categories

plot_confusion_matrix(
    model,
    X_valid,
    y_valid,
    display_labels=class_names,
    cmap=plt.cm.Blues,
    normalize="true",
)
plt.grid(None)

From the confusion matrix we see that our baseline model is able to identify churners only 42% of the time and incorrectly classifies people who churned 58% of the time. We can do better, but first let's inspect how a single decision tree is making decisions on this data.

Single tree

Now that we have a baseline model, let's make a single tree so we can gain some insight into how the decisions are being made:

model = RandomForestClassifier(
    n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1, random_state=42
)
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=3, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1, n_jobs=-1,
                       oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

# get column names
feature_names = X_train.columns
# we need to specify the background color because of a bug in sklearn
fig, ax = plt.subplots(figsize=(30, 10), facecolor="k")
# generate tree plot
plot_tree(model.estimators_[0], filled=True, feature_names=feature_names, ax=ax)
plt.show()

The main difference here compared to our housing example is that the splitting criterion is no longer the mean squared error, but instead is something known as the Gini index:

$$ G = 1 - \sum_{i=1}^n p_i^2 $$

where $p_i$ is the probability of an object being classified to a particular class (in our case "Yes" or "No"). For classification tasks, the goal is to minimise the Gini index across each split, which amounts to finding which segments are most "pure".

From the figure, we can already start seeing some features that might be interesting for predicting churn, e.g. TotalCharges, tenure, and TechSupport seem like good indicators.

Hyperparameter tuning

Our baseline model has an accuracy of 0.771 and ROC AUC score of 0.801 on the validation set. Let's now examine whether we can improve this by tuning the:

number of trees in the forest
minimum number of samples per leaf
maximum number of features per split

In our previous lessons we manually inspected how the performance evolved when we changed these hyperparameters one at a time. Instead we can automate this process using scikit-learn's GridSearchCV to search for the best combination of hyperparameter values:

from sklearn.model_selection import GridSearchCV

# define range of values for each hyperparameter
param_grid = [
    {
        "n_estimators": [10, 20, 40, 80, 100],
        "max_features": [0.5, 1.0, "sqrt", "log2"],
        "min_samples_leaf": [1, 3, 5, 10, 25],
    }
]

# instantiate baseline model
model = RandomForestClassifier(n_estimators=10, n_jobs=-1, random_state=42)

# initialise grid search with cross-validation
grid_search = GridSearchCV(
    model, param_grid=param_grid, cv=3, scoring="roc_auc", n_jobs=-1
)

%time grid_search.fit(X, y)

CPU times: user 2.15 s, sys: 285 ms, total: 2.44 s
Wall time: 17.3 s

GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=10, n_jobs=-1,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid=[{'max_features': [0.5, 1.0, 'sqrt', 'log2'],
                          'min_samples_leaf': [1, 3, 5, 10, 25],
                          'n_estimators': [10, 20, 40, 80, 100]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

Once the search is finished, we can get the best combination of parameters as follows:

best_params = grid_search.best_params_
best_params

{'max_features': 'sqrt', 'min_samples_leaf': 25, 'n_estimators': 80}

Similarly we can get the best model in the search:

best_model = grid_search.best_estimator_
best_model

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=25, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=80, n_jobs=-1,
                       oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

Let's see how this model performs on our validation set in terms of metrics and the confusion matrix:

print_scores(best_model)

Accuracy on train: 0.819
ROC AUC on train: 0.876
Accuracy on valid: 0.828
ROC AUC on valid: 0.891

plot_confusion_matrix(
    best_model,
    X_valid,
    y_valid,
    display_labels=class_names,
    cmap=plt.cm.Blues,
    normalize="true",
)
plt.grid(None)

In terms of the AUC score, we see about a 10% boost over our baseline model - not bad! Our confusion matrix has also visibly improved.

Model interpretability

As we did in the housing example, we now examine which features were deemed to be important for our Random Forest model:

# expected shape - (n_features, 2)
feature_importance = rf_feature_importance(best_model, X)

# peek at top 10
feature_importance[:10]

plot_feature_importance(feature_importance)
plt.show()

From the plot, it seems that the feature importance flattens out around 0.005, so let's use that as a threshold to remove some uninformative features:

feature_importance_threshold = 0.008
cols_to_keep = feature_importance[
    feature_importance["Importance"] > feature_importance_threshold
]["Column"]

len(cols_to_keep)

14

cols_to_keep

15            Contract
5               tenure
9       OnlineSecurity
18      MonthlyCharges
12         TechSupport
8      InternetService
19        TotalCharges
17       PaymentMethod
10        OnlineBackup
11    DeviceProtection
0           customerID
16    PaperlessBilling
7        MultipleLines
14     StreamingMovies
Name: Column, dtype: object

Let's retrain with this subset and see if the score drop or not (for reference, we have accuracy of 0.807 and AUC of 0.871 on the validation set):

# create a copy of the data with selected columns and create new train / test set
X_keep = X.copy()[cols_to_keep]
X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)

model = RandomForestClassifier(
    n_estimators=best_params["n_estimators"],
    min_samples_leaf=best_params["n_estimators"],
    max_features=best_params["max_features"],
    n_jobs=-1,
    oob_score=True,
    random_state=42,
)

model.fit(X_train, y_train)

print_scores(model)

Accuracy on train: 0.802
ROC AUC on train: 0.852
Accuracy on valid: 0.798
ROC AUC on valid: 0.862
OOB accuracy: 0.793

The score has not changed significantly, so let's continue and replot the feature importances:

feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance)
plt.show()

One hot encoding

We can also examine whether one-hot encoding produces a better representation of the data, and hence a better model. As we did in the housing example, we need to create a new feature matrix and target vector, along with a threshold on the maximum number of categories allowed:

X2, y2, nas = proc_df(churn_data, "Churn", max_n_cat=7)
X_train, X_valid, y_train, y_valid = train_test_split(
    X2, y2, test_size=0.2, random_state=42
)

model = RandomForestClassifier(
    n_estimators=best_params["n_estimators"],
    min_samples_leaf=best_params["min_samples_leaf"],
    max_features=best_params["max_features"],
    n_jobs=-1,
    oob_score=True,
    random_state=42,
)
model.fit(X_train, y_train)
print_scores(model)

Accuracy on train: 0.813
ROC AUC on train: 0.869
Accuracy on valid: 0.806
ROC AUC on valid: 0.863
OOB accuracy: 0.797

feature_importance = rf_feature_importance(model, X2)
plot_feature_importance(feature_importance[:30])
plt.show()

This encoding actually gives us a slightly better model and has the added bonus of being slightly more interpretable since we can easily read off which characteristics about the customers are influencing the prediction.

Note: In general, you will not know in advance whether one-hot encoding will improve the performance of your model - the only way to know is to test it out eexperimentally. In our example, the increase in performance suggests that some individual categories are more predictive than the rest, thus allowing the Random Forest to focus on these features.

Let's trim the features as before and see if we can improve this model further still:

feature_importance_threshold = 0.007
cols_to_keep = feature_importance[
    feature_importance["Importance"] > feature_importance_threshold
]["Column"]

len(cols_to_keep)

22

# create a copy of the data with selected columns and create new train / test set
X_keep = X2.copy()[cols_to_keep]
X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)

model = RandomForestClassifier(
    n_estimators=best_params["n_estimators"],
    min_samples_leaf=best_params["min_samples_leaf"],
    max_features=best_params["max_features"],
    n_jobs=-1,
    oob_score=True,
    random_state=42,
)

model.fit(X_train, y_train)
print_scores(model)

Accuracy on train: 0.813
ROC AUC on train: 0.871
Accuracy on valid: 0.809
ROC AUC on valid: 0.863
OOB accuracy: 0.798

OK, the model has not improved, but we have managed to reduce the number of features.

feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance)
plt.show()

Remove redundant features

We still have a lot of features to contend with, so let's see if any of them are redundant. As in the housing example, we use clustering to work out which pairs of features are "close" in some sense:

plot_dendogram(X_keep)

def get_oob(df):
    model = RandomForestClassifier(
        n_estimators=best_params["n_estimators"],
        min_samples_leaf=best_params["min_samples_leaf"],
        max_features=best_params["max_features"],
        n_jobs=-1,
        oob_score=True,
        random_state=42,
    )
    X, _ = train_test_split(df, test_size=0.2, random_state=42)
    model.fit(X, y_train)
    return model.oob_score_

get_oob(X_keep)

0.7976570820021299

for c in (
    "TechSupport_Yes",
    "OnlineSecurity_Yes",
    "Contract_Two year",
    "tenure",
    "OnlineSecurity_No internet service",
    "DeviceProtection_No internet service",
    "StreamingTV_No internet service",
    "InternetService_No",
    "TechSupport_No",
    "OnlineSecurity_No",
    "MonthlyCharges",
    "InternetService_Fiber optic",
):
    print(c, get_oob(X_keep.drop(c, axis=1)))

TechSupport_Yes 0.7976570820021299
OnlineSecurity_Yes 0.7971246006389776
Contract_Two year 0.7955271565495208
tenure 0.7834575789847356
OnlineSecurity_No internet service 0.7974795882144124
DeviceProtection_No internet service 0.7974795882144124
StreamingTV_No internet service 0.7969471068512602
InternetService_No 0.7969471068512602
TechSupport_No 0.7980120695775648
OnlineSecurity_No 0.7996095136670217
MonthlyCharges 0.7946396876109336
InternetService_Fiber optic 0.7941072062477813

cols_to_drop = [
    "OnlineSecurity_Yes",
    "DeviceProtection_No internet service",
    "StreamingTV_No internet service",
    "OnlineSecurity_No",
]
get_oob(X_keep.drop(cols_to_drop, axis=1))

0.7978345757898474

Good, our OOB accuracy has not changed and we've managed to remove a few features.

X_keep.drop(cols_to_drop, axis=1, inplace=True)
X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)

model = RandomForestClassifier(
    n_estimators=best_params["n_estimators"],
    min_samples_leaf=best_params["min_samples_leaf"],
    max_features=best_params["max_features"],
    n_jobs=-1,
    oob_score=True,
    random_state=42,
)

model.fit(X_train, y_train)
print_scores(model)

Accuracy on train: 0.817
ROC AUC on train: 0.873
Accuracy on valid: 0.804
ROC AUC on valid: 0.864
OOB accuracy: 0.798

feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance)
plt.show()

Feature engineering and our final model

One thing we consistently see in our feature importance is the role of the tenure attribute. One way to gain a deeper insight into this attribute is to consider converting it to a categorical variable. Continuous data is often discretised or separated into "bins" for analysis. This is typically done when you want to group people into discrete age buckets or in the case of our data, discrete tenure buckets. Let's have a look at the distribution of tenure values to get an idea about how to discretise the data:

sns.distplot(X_keep["tenure"], kde=False)
plt.show()

From the plot, we could reasonably divide the tenure values into bins of 0 to 6, 7 to 20, 21 to 40, 41 to 60, and finally 61 and over. To do so we use the cut function from pandas:

bins = [0, 6, 20, 40, 60, 100]
group_names = ["<=6", "7-20", "21-40", "41-60", "60+"]
X_keep["tenure_categories"] = pd.cut(
    X_keep["tenure"], bins, labels=group_names, include_lowest=True
)

Note that the object pandas returns is also a special Categorical object. We can also examine the value counts:

X_keep["tenure_categories"].value_counts()

<=6      1481
21-40    1408
60+      1407
7-20     1397
41-60    1350
Name: tenure_categories, dtype: int64

Finally let's drop the tenure column since it's redundant:

X_keep.drop(["tenure"], axis=1, inplace=True)

Since tenure_cats is a string, we also need to one-hot encode it like the rest of our features:

X_keep = pd.get_dummies(X_keep)

Finally we train our model:

X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)

model = RandomForestClassifier(
    n_estimators=1000,
    min_samples_leaf=best_params["min_samples_leaf"],
    max_features=best_params["max_features"],
    n_jobs=-1,
    oob_score=True,
    random_state=42,
)
model.fit(X_train, y_train)
print_scores(model)

Accuracy on train: 0.815
ROC AUC on train: 0.866
Accuracy on valid: 0.805
ROC AUC on valid: 0.861
OOB accuracy: 0.799

Although categorising the tenure feature did not boost performance, it does help considerably with the interpretation:

feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance)
plt.show()

From the plot we can see that the following 3 features are the most important indicators of churn:

Customers with month to month contracts
Customers who have been with the Telco for less than 6 months
Customers who do not have tech support

From our analysis we could advise the client to focus on customers with these subscriptions and thereby minimise the risk that they churn.

Let's wrap up by plotting the confusion matrix of our final model:

plot_confusion_matrix(
    model,
    X_valid,
    y_valid,
    display_labels=class_names,
    cmap=plt.cm.Blues,
    normalize="true",
)
plt.grid(None)

From this figure we can conclude that:

If our model predicts that a customer will churn, then we can expect it to be correct 51% of the time (true positive)
If our model predict that a customer will not churn, the we can expect the prediction to be correct 91% of the time (true negative)

Exercise

Use the techniques in this lesson to solve Kaggle's famous Titanic Challenge. As in the housing challenge from lesson 6, the goal is not to build a perfect model, but to get familiar with building models on new datasets.

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	...	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	...	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	...	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

	0	1	2	3	4
customerID	7590-VHVEG	5575-GNVDE	3668-QPYBK	7795-CFOCW	9237-HQITU
gender	Female	Male	Male	Male	Female
SeniorCitizen	0	0	0	0	0
Partner	Yes	No	No	No	No
Dependents	No	No	No	No	No
tenure	1	34	2	45	2
PhoneService	No	Yes	Yes	No	Yes
MultipleLines	No phone service	No	No	No phone service	No
InternetService	DSL	DSL	DSL	DSL	Fiber optic
OnlineSecurity	No	Yes	Yes	Yes	No
OnlineBackup	Yes	No	Yes	No	No
DeviceProtection	No	Yes	No	Yes	No
TechSupport	No	No	No	Yes	No
StreamingTV	No	No	No	No	No
StreamingMovies	No	No	No	No	No
Contract	Month-to-month	One year	Month-to-month	One year	Month-to-month
PaperlessBilling	Yes	No	Yes	No	Yes
PaymentMethod	Electronic check	Mailed check	Mailed check	Bank transfer (automatic)	Electronic check
MonthlyCharges	29.85	56.95	53.85	42.3	70.7
TotalCharges	29.85	1889.5	108.15	1840.75	151.65
Churn	No	No	Yes	No	Yes

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges
0	5376	1	1	2	1	1	1	2	1	1	3	1	1	1	1	1	2	3	29.85	2506
1	3963	2	1	1	1	34	2	1	1	3	1	3	1	1	1	2	1	4	56.95	1467
2	2565	2	1	1	1	2	2	1	1	3	3	1	1	1	1	1	2	4	53.85	158
3	5536	2	1	1	1	45	1	2	1	3	1	3	3	1	1	2	1	1	42.30	1401
4	6512	1	1	1	1	2	2	1	2	1	1	1	1	1	1	1	2	3	70.70	926

	Column	Importance
15	Contract	0.228229
5	tenure	0.208876
9	OnlineSecurity	0.098802
18	MonthlyCharges	0.092676
12	TechSupport	0.069717
8	InternetService	0.056771
19	TotalCharges	0.049491
17	PaymentMethod	0.033299
10	OnlineBackup	0.030297
11	DeviceProtection	0.029809

Lesson 7 - Classification