Introduction to dimensionality reduction and visualisation.

Binder slides

Learning objectives

In this lecture we will have a look at dimensionality reduction and data visualisation. When we look at the feature vector of a dataset it can contain many features and each feature corresponds to its own dimension. In the first part of the notebook we will explore a method to reduce the dimensionality of a dataset. In the second part we will look at a an algorithm to extract information from high-dimensional data and display its structure in two dimensions. Both these methods fall into the cateogory of unsupersvised algorithms. The learning objectives are to be able to answer the following questions:

  • What is dimensionality reduction?
  • When is dimensionality reduction useful?
  • What is topology and how is it connected to Mapper?
  • What are the steps involved in Mapper?

References

Homework

As homework read the references, work carefully through the notebook and solve the exercises. This lecture covers several complex topics and it is important that you familiarise yourself by experimenting with the notebook.

Part 1: Principal component analysis

First we have a look at a method to reduce the dimensions of a dataset. There are two main reasons why we would like to reduce the dimensionality of such datasets:

  1. Some machine learning algorithms struggle with high dimensional data.
  2. It is hard to visualize high dimensional data. In practice it is hard to visualise datasets that have more than two or three dimensions.

There is one very common approach to reduce the dimensionality of data: principal component analysis (PCA). The idea is that not all axis contain the same amount of variance and that there are even combination of axis that contain most variance. PCA seeks to find new coordinate axis such that the variance along the axis is maximised and ordered (the first principal component conatains most variance).

Figure: PCA can look intimidating at the beginning.

Imports

# dalineplotrangling
import pandas as pd
import numpy as np
from dslectures.core import get_dataset, convert_strings_to_categories, rmse, fill_missing_values_with_median
from pathlib import Path
from tqdm import tqdm

# data viz
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.image as mpimg
import seaborn as sns

sns.set(color_codes=True)
sns.set_palette(sns.color_palette("muted"))

# ml magic
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_validate

#dslecture
from dslectures.core import rmse
# uncomment if running on Binder / your laptop
# !pip install --upgrade dslectures

Load the data

We will use the processed housing dataset for the principal component analysis. First we have to load it.

get_dataset('housing_processed.csv')
Dataset already exists at '../data/housing_processed.csv' and is not downloaded again.
data_path = Path('../data/')
housing_data = pd.read_csv(data_path/'housing_processed.csv')
housing_data.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value city postal_code rooms_per_household bedrooms_per_household bedrooms_per_room population_per_household ocean_proximity_INLAND ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN ocean_proximity_ISLAND
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 69 94705 6.984127 1.023810 0.146591 2.555556 0 0 1 0 0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 620 94611 6.238137 0.971880 0.155797 2.109842 0 0 1 0 0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 620 94618 8.288136 1.073446 0.129516 2.802260 0 0 1 0 0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 620 94618 5.817352 1.073059 0.184458 2.547945 0 0 1 0 0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 620 94618 6.281853 1.081081 0.172096 2.181467 0 0 1 0 0
housing_data.shape
(19443, 20)
housing_data.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value city postal_code rooms_per_household bedrooms_per_household bedrooms_per_room population_per_household ocean_proximity_INLAND ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN ocean_proximity_ISLAND
count 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000
mean -119.560363 35.646739 28.435118 2617.678548 538.136964 1442.129970 501.427352 3.675099 191793.406162 541.629224 93030.145605 5.340245 1.091741 0.214812 3.095953 0.331482 0.439953 0.106774 0.121535 0.000257
std 2.002697 2.145335 12.504584 2179.553070 420.168532 1140.254218 383.064222 1.569687 96775.724042 260.704512 1853.684352 2.190405 0.429728 0.056667 10.679036 0.470758 0.496394 0.308833 0.326756 0.016035
min -124.350000 32.550000 1.000000 2.000000 2.000000 3.000000 2.000000 0.499900 14999.000000 1.000000 85344.000000 0.846154 0.333333 0.100000 0.750000 0.000000 0.000000 0.000000 0.000000 0.000000
25% -121.760000 33.930000 18.000000 1438.500000 299.000000 799.000000 282.000000 2.526500 116700.000000 328.000000 91706.000000 4.412378 1.006140 0.177906 2.449692 0.000000 0.000000 0.000000 0.000000 0.000000
50% -118.490000 34.260000 29.000000 2111.000000 436.000000 1181.000000 411.000000 3.446400 173400.000000 545.000000 92860.000000 5.180451 1.048276 0.204545 2.841155 0.000000 0.000000 0.000000 0.000000 0.000000
75% -117.990000 37.730000 37.000000 3119.000000 644.000000 1746.500000 606.000000 4.579750 247100.000000 770.000000 94606.000000 5.963796 1.097701 0.240414 3.308208 1.000000 1.000000 0.000000 0.000000 0.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 499100.000000 977.000000 96161.000000 132.533333 34.066667 1.000000 1243.333333 1.000000 1.000000 1.000000 1.000000 1.000000

We drop the two columns that contain categorical data with many categories. We could create extra columns for each categoriy but since there are many of them that would create a lot of extra features.

housing_data.drop(['city', 'postal_code'], axis=1, inplace=True)

Then we split the data into features and labels as usual:

X = housing_data.drop('median_house_value', axis=1)
y = housing_data['median_house_value']
feature_labels = X.columns

Baseline

First we train the a Random Forest with the settings we tuned in lesson 6 as a baseline for future model.

rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', n_jobs=-1, random_state=42)
                             
results = cross_validate(rf, X, y,
                         cv=5,
                         return_train_score=True,
                         scoring='neg_root_mean_squared_error')
rmse_full = -np.mean(results["test_score"])

We get an RMSE of roughly 60'000:

print(f'RMSE: {rmse_full}')
RMSE: 60301.958010516806

Standardisation

PCA works by finding new axes in the dataset that cover the most variance. Since variance is the average squared distance of each sample to the sample mean this depends on the scale of the feature. To illustrate this think of two columns that contain the height of the person measured in centimeter and meter. Although height described in these two columns are the same the variance will be 10'000 times larger in the centimeter column (100^2) and the standard deviation 100 times larger.

rand_values = np.random.randn(1000)
rand_values.var()
0.8323069928850167
(100*rand_values).var()
8323.069928850167

For this reason it is important to scale the features such that they are comparable to each other. A common approach to do this is to transform the data such that the mean is zero and the standard deviation is one. The following formula achieves this:

$x_{std}=\frac{x_{old}-\mu}{\sigma}$

Scikit-learn provides a function to do this for us called StandardScaler. We can do this in one line with the function .fit_transform(). Similar to the function fit_predict() it combines the process of fitting which means calculating the mean and standard deviation of the dataset and then transforming the data by applying the above described formula.

X_std = StandardScaler().fit_transform(X)
X_std.shape
(19443, 17)

We can see that the mean is zero (or close enough) and the standard deviation is one:

X_std.mean(axis=0)
array([-4.44386137e-16,  2.19854194e-15,  9.35549763e-17, -2.48505406e-17,
       -1.02325755e-16,  7.74752147e-17, -1.24252703e-17, -1.46179650e-16,
       -9.94021623e-17,  1.57874022e-16,  7.01662322e-17,  2.92359301e-17,
        1.60797615e-16,  1.43256057e-16,  6.13954532e-17, -3.14286248e-17,
        1.16943720e-17])
X_std.std(axis=0)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Principal component analysis

Now we apply principal component analysis to the standardised features. After initialising a PCA object we can use the .fit_transform() to find the principal components and then transform the dataset into these new coordinates.

pca = PCA()
X_pca = pca.fit_transform(X_std)

We can see that the transformed dataset still has the same shape. However the columns are no longer the features but the coordinates in the new principal component coordinates.

X_pca.shape
(19443, 17)

Explained variance

The explained variance is an important concept in principal component analysis. It tells us how much variance of the whole dataset is contained along one principal component axis. The component with the more variation can encode more information than features with little variation. The explained variance is stored in the pca object and we can use it to visualise the distribution. The explained_variance_ratio_ tells us the percentage of variance each component contains from the total variance.

pca.explained_variance_ratio_
array([2.32152692e-01, 1.58210508e-01, 1.27146991e-01, 1.00792055e-01,
       7.93230339e-02, 7.26018573e-02, 5.93200071e-02, 5.87781344e-02,
       4.47812090e-02, 3.66692458e-02, 1.63533745e-02, 7.52994128e-03,
       2.65596682e-03, 1.50076877e-03, 1.26555424e-03, 9.18660939e-04,
       1.01541990e-32])
pca_labels = [f'principal component {i+1}' for i in range(len(pca.explained_variance_ratio_))]
barplot = sns.barplot(x=pca_labels, y=pca.explained_variance_ratio_)
barplot.set_xticklabels(pca_labels, rotation=90)
plt.title('explained variance ratio')
plt.show()

We can see that already the first three components together account for 50% of the total variance in the dataset. We can visualise this more systematically by plotting the cumulative sum of the explained variance rations:

sns.lineplot(np.arange(len(pca.explained_variance_ratio_)), np.cumsum(pca.explained_variance_ratio_))
plt.ylabel('total explained variance')
plt.xlabel('number of included PCA components')
plt.show()

We see that with 7 of the 17 axes we already explain 90% percent of the variance in the dataset and that after 10 we almost reach 100%. That means the components 10-17 probably don't contain much information and can be discarded.

PCA vector composition

Naturally, we would like to know what these principal components mean. For example the first component contains 25% of the total variance. What information is stored in that component? The pca object also contains the components_ which define the transformation between the original dataset and the transformed one. We can print a few of the components and interpret them:

def print_tabular(labels, values):
    df = pd.DataFrame(data=values, columns=labels)
    display(df.T)
print('The first three principal components:')
print_tabular(feature_labels, pca.components_[:3])
The first three principal components:
0 1 2
longitude 0.096240 -0.440492 0.279899
latitude -0.093538 0.511034 -0.231581
housing_median_age -0.228990 -0.102574 -0.203651
total_rooms 0.480823 0.110303 0.003569
total_bedrooms 0.481068 0.046098 -0.114091
population 0.466057 -0.002745 -0.121771
households 0.482901 0.023411 -0.149648
median_income 0.083829 0.066170 0.381905
rooms_per_household 0.029542 0.299368 0.519937
bedrooms_per_household 0.006354 0.204508 0.343374
bedrooms_per_room -0.029537 -0.230823 -0.379778
population_per_household -0.002497 -0.003664 0.003476
ocean_proximity_INLAND -0.010714 0.326173 0.037222
ocean_proximity_<1H OCEAN 0.054867 -0.404136 0.152350
ocean_proximity_NEAR BAY -0.067325 0.233822 -0.269103
ocean_proximity_NEAR OCEAN -0.003977 -0.076523 -0.030817
ocean_proximity_ISLAND -0.006251 -0.009071 0.001835

The way to read this table is to think of it as a recipe to construct the first three principal components coordinates from the original features. Given a new datapoint we can for example calculate the first principal component by multiplying the longitude by the value in column 0 plus the latitude multiplied by the corresponding value in column 0 etc.. This weighted sum of the original featuers will yield the first coordinate in the PCA coordinate system.

${x_{PC,i}} = \sum_{i=0}^{n-1} w_{i,j} \cdot x_{feat, j}$

From this we can see that the first component mainly consists of the three features total_rooms, total_bedrooms, and households:

$x_{PC,0} = 0.096 \cdot longitude + (-0.093) \cdot lattidue + (-0.229) \cdot housing\_median\_value + \dots$

Visualisation

We can now use the principle components to visualise the data. We have a closer look at the first two components.

sns.scatterplot(X_pca[:,0], X_pca[:,1], alpha=0.1)
plt.show()
jplot = sns.jointplot(X_pca[:,0],  X_pca[:,1], kind="hex", color="#4CB391",xlim=[-5,10], ylim=[-5,10])
jplot.set_axis_labels('Principal component 1', 'Principal component 2')
<seaborn.axisgrid.JointGrid at 0x122ee7a58>

We observe that there seem to be two visible clusters along these two axis. We could now investigate these clusters further and might find some property of the dataset that might be useful to us or our customer. Let's do it!

Exercise 1

Create plots of other principal components and investigate if you find any interesting features.

Advanced: Cluster investigation

In this section we have a closer look at the clusters we observed in the previous plot. The clustering seems to mainly happen along the second principle component. Let's plot a fine-grained histogram of that component:

plt.hist(X_pca[:,1], bins=np.linspace(-5, 5, 50))
plt.show()

We can see that the clusters seem to be seperated around -0.5. We can create a mask for the data points that are below that threshold.

left_group = X_pca[:,1]<-0.5

Furthermore, we only want to investigate the features that contribute to the second principal component. Therefore, we set another threshold for the minimum absolute feature coefficient.

pc1_features = list(feature_labels[np.abs(pca.components_[1])>0.2])

These are the components that mainly contribute to the second principal component:

pc1_features
['longitude',
 'latitude',
 'rooms_per_household',
 'bedrooms_per_household',
 'bedrooms_per_room',
 'ocean_proximity_INLAND',
 'ocean_proximity_<1H OCEAN',
 'ocean_proximity_NEAR BAY']

Unfortunately, the StandardScaler transformed the DataFrame X into an array X_std. Let's transform it to a DataFrame again:

X_std = pd.DataFrame(data=X_std, columns=feature_labels)

With loc and the mask we just created we can now slice out the datapoints of the left and right group. We have a look at the statistics of the relevant features of the two groups with .describe().

X_std.loc[left_group, pc1_features].describe()
longitude latitude rooms_per_household bedrooms_per_household bedrooms_per_room ocean_proximity_INLAND ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY
count 8602.000000 8602.000000 8602.000000 8602.000000 8602.000000 8602.000000 8602.000000 8602.000000
mean 0.767273 -0.858163 -0.236650 -0.099011 0.289184 -0.630324 0.729185 -0.345365
std 0.374053 0.273084 0.523087 0.205740 1.184709 0.389128 0.802984 0.034913
min -1.577731 -1.443513 -2.051770 -1.764901 -1.787272 -0.704163 -0.886320 -0.345741
25% 0.619362 -0.902791 -0.611887 -0.208570 -0.519333 -0.704163 1.128260 -0.345741
50% 0.709243 -0.804902 -0.279346 -0.111908 0.105280 -0.704163 1.128260 -0.345741
75% 0.874025 -0.734981 0.105771 -0.012303 0.870705 -0.704163 1.128260 -0.345741
max 2.501871 1.367309 1.901482 2.484505 13.856510 1.420126 1.128260 2.892336
X_std.loc[~left_group, pc1_features].describe()
longitude latitude rooms_per_household bedrooms_per_household bedrooms_per_room ocean_proximity_INLAND ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY
count 10841.000000 10841.000000 10841.000000 10841.000000 10841.000000 10841.000000 10841.000000 10841.000000
mean -0.608807 0.680926 0.187775 0.078563 -0.229459 0.500143 -0.578586 0.274036
std 0.919116 0.828470 1.223449 1.321400 0.748986 1.052636 0.724778 1.273936
min -2.391655 -1.406222 -1.655426 -1.667938 -2.026123 -0.704163 -0.886320 -0.345741
25% -1.298101 0.299849 -0.246291 -0.191329 -0.724091 -0.704163 -0.886320 -0.345741
50% -0.943570 0.929138 0.056233 -0.092209 -0.309306 1.420126 -0.886320 -0.345741
75% -0.034773 1.143562 0.386952 0.040083 0.146278 1.420126 -0.886320 -0.345741
max 2.621713 2.938200 58.069790 76.736388 6.038055 1.420126 1.128260 2.892336

Just looking at the feature mean of the two groups we see that there seems to be a significant difference between almost all features. To have a more detailed view let's compare the histograms of each feature:

for col in pc1_features:
    bins = np.linspace(-3, 3, 20)
    fig, ax = plt.subplots(1,1, figsize=(10,5))
    X_std.loc[left_group, col].hist(ax=ax, bins=bins, color='C0', alpha=0.5, label='Group 1')
    X_std.loc[~left_group, col].hist(ax=ax, bins=bins, color='C1', alpha=0.5, label='Group 2')
    plt.legend(loc='best')
    plt.title(col)
    plt.show()

We can see that the most significant difference is between the longitute and latitude. Another interesting observation is that in the first group (blue) the rooms per household seem to be slightly lower and the bedrooms per room slightly higher. We will discuss further below. First we want to look at the longitude and latitude distribution of the two groups. First we make a hexplot to see the hotspots.

import matplotlib.gridspec as gridspec

print("Group 1")
sns.jointplot(
    kind='hex',
    x="longitude",
    y="latitude",
    cmap='magma',

    data=X.loc[left_group],
)
plt.show()

print("Group 2")
sns.jointplot(
    kind='hex',
    x="longitude",
    y="latitude",
    cmap='magma',
    data=X.loc[~left_group],
)
plt.show()
Group 1
Group 2

We see that in the first group there are mainly two hotspots while in the second group the hotspots seem to be further distributed. Recall that we can plot the longitude and latitude on top of a map. Let's investigate what these hot regions correspond to.

# ref - www.kaggle.com/camnugent/geospatial-feature-engineering-and-visualization/
california_img=mpimg.imread('images/california.png')
fig, axes = plt.subplots(1, 2, figsize=(12,6))

fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=X.loc[left_group],
    alpha=0.1,
    s=1,
    color="r",
    edgecolor="none", 
    ax=axes[0]
)

axes[0].imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
axes[0].set_title("Group 1")
fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=X.loc[~left_group],
    alpha=0.05,
    s=3,
    color="r",
    edgecolor="none",
    ax=axes[1]

)
axes[1].imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
axes[1].set_title("Group 2")

plt.show()

Comparing to a satellite image of California (see below) we see that the two hotspots in the first group correspond to Los Angeles and San Diego. The second group mainly includes San Francisco, the Silicon Valley, Sacramento, Fresno and Bakersfield. What is interesting to observe is that the second group also includes Riverside which is just next to Los Angeles. That means that the second principal component is not just the diagonal along the coast but specifically encodes Los Angeles and San Diege. If one wanted to build a classifier to determine whether a house is in Los Angeles or an other region of California one would only need to look at the second principle component of the dataset.

Circling back the observation about the rooms per households. It seems that in Los Angeles the appartments have slightly fewer rooms than the rest of California but a similar amount of bedrooms, hence the higher number of bedrooms per rooms. In other words in Los Angeles you would get less extra rooms per household.

Figure: Satellite image of California from Google Maps.

Finally, we can also look at the median value of the houses in the two groups:

bins=50
fig, ax = plt.subplots(1,1, figsize=(10,5))
y.loc[left_group].hist(ax=ax, bins=bins, color='C0', alpha=0.5, label='Group 1')
y.loc[~left_group].hist(ax=ax, bins=bins, color='C1', alpha=0.5, label='Group 2')
plt.legend(loc='best')
plt.title('median value')
plt.show()

One observes that the median value is higher for the first group (Los Angeles & San Diego). However, the long tail in both groups is quite similar. So living seems to be more expensive for low and medium income people while for the upper class the prices seem to be comparable.

Compressed model

Now we want to use PCA to build a model that is more compact than the model that utilizes all features. We add one principal component after the other and measure how well a Random Forest performs with this subset of features. We do this in our standard cross-validation loop.

rmse_valid =[]
for i in tqdm(range(1, 18)):
    rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', n_jobs=-1, random_state=42)
                             
    results = cross_validate(rf, X_pca[:, :i], y, cv=5,
                             return_train_score=True,
                             scoring='neg_root_mean_squared_error')
    rmse_valid.append(-np.mean(results["test_score"]))
100%|██████████| 17/17 [03:17<00:00, 11.60s/it]

As an additional baseline we also create a median regressor:

rmse_median = rmse(y, [np.median(y)]*len(y))

Now we can plot the performances with the full Random Forest and the median regresssor as baselines for all subsets of components.

lineplot = sns.lineplot(np.arange(len(rmse_valid))+1 , rmse_valid, label='pca')
lineplot.axhline(rmse_median, c='y', linestyle='--', label='median')
lineplot.axhline(rmse_full, c='r', linestyle='--', label='full model')
plt.legend(loc='best')
plt.xlabel('number of PCA components')
plt.ylabel('RMSE')
Text(0, 0.5, 'RMSE')

So we can see that already with 4-7 principal components we get results that are quite close to the full model. As mentioned previously this has two advantages:

  1. Visualising 4 components is still hard but feasable compared to 17
  2. Training a Random Forest with half the features is twice as fast for training and inference.

Looking at the graph we see that there are several steep steps. It seems that adding these components improved the performance significantly. By taking the components with the biggest steps we can try to build a better selection of features:

features = [0,1,3,6,10]
rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', n_jobs=-1, random_state=42)
                             
results = cross_validate(rf, X_pca[:, features], y, cv=5,
                         return_train_score=True,
                         scoring='neg_root_mean_squared_error')
rmse_pca = -np.mean(results["test_score"])
rmse_pca
64700.242256282574

Just with 5 of 17 features we get <10% of the model that used all features. This model is 3-4 times faster to train and to make predictions. In practice there are more systematic ways of finding the best subset of features such as forward and backward selection (see link).

Exercise 2

Make your own selection of principle components (e.g. the first five) and see if you can beat the selection above.

Part 2: The Mapper algorithm

What is topology?

Topology is the mathematical field of shapes where one aims at measuring and categorising different bodies. One way to do this is to look at the number of holes an object has. In this regard a coffe mug is equivalent to a donut in the sense that the both have one hole. A deeper understanding of the shape of our dataset can help finding new interesting properties in our dataset. The Mapper algorithm aims at doing just that: within the unstructed datapoints in our dataset it finds links and clusters and helps us visualize them in two dimensional space. This is an exciting research area of the past few years with many applications.

Figure: A coffee mug is a donut.

Mapper overview

Mapper uses four steps to analyse the topological structure of a dataset. These four steps are:

Filtering/projecting: The high dimensional data is passed through a function (called the filter function) that transforms the data into a lower dimensional space. An xamples of such transformation is projecting onto an axis ($[x_1, x_2, x_3, \dots] \rightarrow [x_1]$). In general this could be any function that takes a vector and produces a vector of lower dimension.

Covering: Now this lower dimensional space is covered with intervals. It is important that these intervals should overlap a little as we will see when we go from clustering to constructing the network. In two dimensions such a cover could be overlapping square patches. These patches divide the dataset into subsets.

Clustering: Now we go back to the original space but keep the groupings from the covering. For each subset from the step before we now apply a clustering algorithm and store the clusters.

Constructing the topological network: Finally, we setup the topological network. To do so we create a node for each cluster we found in the previous step. Since we setup the patches such that they overlap one datapoint can be connected to two clusters. If that is the case we create a link between the two nodes of the cluster.

Figure: Schematic overview of Mapper [1].

[1] Liao, Tianhua & Wei, Yuchen & Luo, Mingjing & Zhao, Guo-Ping & Zhou, Haokui. (2019). tmap: an integrative framework based on topological data analysis for population-scale microbiome stratification and association studies. Genome Biology. 20. 10.1186/s13059-019-1871-4.

Giotto TDA

The application of topological methods to data is called topological data analysis (TDA). A Swiss company called L2F developed a scikit-learn comaptible library for TDA called giotto-tda and Lewis Tunstall is a core contributor. In the following section we will use giotto-tda to analyse a small toy dataset and give an some examples of interesting applications.

Imports

# TDA magic
from gtda.mapper import (
    CubicalCover,
    make_mapper_pipeline,
    Projection,
    plot_static_mapper_graph,
    plot_interactive_mapper_graph,
)
from gtda.mapper.utils.visualization import set_node_sizeref

# ML tools
from sklearn import datasets
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA

Helper functions

def plot_cover(cover, ax, every_n=1):
    """Plot cover intervals."""
    x_interval = list([i,j] for i, j in zip(cover._coverers[0].left_limits_,cover._coverers[0].right_limits_))
    y_interval = list([i,j] for i, j in zip(cover._coverers[1].left_limits_, cover._coverers[1].right_limits_))
            
    normalize_interval(x_interval)
    normalize_interval(y_interval)
    
    plot_2d_intervals(x_interval, y_interval, ax, every_n=every_n)
def normalize_interval(interval, x_min=-1.2, x_max=1.2):
    """Clip all values in interval to x_min and x_max."""
    for i in range(len(interval)):
        interval[i][0] = min(max(interval[i][0], x_min), x_max)
        interval[i][1] = min(max(interval[i][1], x_min), x_max)
def plot_2d_intervals(x_intervals, y_intervals, ax, cmap_name='magma', every_n=1):
    """Plot a set of 2d-intervals with rectangular boxes."""
    cmap = mpl.cm.get_cmap(cmap_name)

    
    for i, x_inter in enumerate(x_intervals):
        for j, y_inter in enumerate(y_intervals):
            if (i+j)%every_n==0:
                origin = (x_inter[0], y_inter[0])
                width = x_inter[1]-x_inter[0]
                height = y_inter[1]-y_inter[0]
                #print(origin, width, height)
                rect = plt.Rectangle(origin, width, height, facecolor=cmap(np.random.rand()), alpha=0.5, edgecolor=None)
                ax.add_patch(rect)

Dataset: two circles

Our toy dataset consists of two concentric circles. From a topology standpoint these are two separate objects with one hole each. In the end we will see that this is what Mapper produces.

data, _ = datasets.make_circles(n_samples=5000, noise=0.05, factor=0.3, random_state=42)
plt.figure(figsize=(10,10))
plt.scatter(data[:,0], data[:,1], c=np.random.rand(data.shape[0]), cmap='magma')
plt.show()

Projection

Since we are already in a low dimensional space we do not need to transform the data to a lower dimensional space. Therefore, we project to the existing axis which does not change the data:

# Define filter function - can be any scikit-learn transformer
filter_func = Projection(columns=[0, 1])

Cover

Now we need to cover the space of our dataset with little patches. To do so we can use the CubicalCover function. We just need to set how many intervals we want per dimension and how much they should overlap. Then we can fit it with the same scikit-learn fit interface we already know.

cover = CubicalCover(n_intervals=10, overlap_frac=0.3)
cover.fit(data)
CubicalCover(kind='uniform', n_intervals=10, overlap_frac=0.3)

We can plot the patches on top of the dataset to see how they cover the space:

fig, ax = plt.subplots(figsize=(10,10))
plt.scatter(data[:,0], data[:,1], c='black', alpha=0.3)
plot_cover(cover, ax, every_n=1)
plt.show()

With so many overlapping patches it is a bit hard to see what is going on. Therefore, we can plot just every other patch to see how the patches actually look like:

fig, ax = plt.subplots(figsize=(10,10))
plt.scatter(data[:,0], data[:,1], c='black', alpha=0.3)
plot_cover(cover, ax, every_n=2)
plt.show()

We can see that the patches overlap the each other by about a third which is what we would expect based on the initial setting. On each patch we will now run a clustering algorithm.

Clustering

Before we can construct the graph we need to cluster each patch. In the previous lesson we learned about k-means which is a simple clustering algorithm, however one needs to specify the number of expected clusters. There is another powerful clustering algorithm that does not require that parameter since it only works with density, called Density-based spatial clustering of applications with noise (DBSCAN).

Figure reference: https://github.com/NSHipster/DBSCAN

The algorithm for DBSCAN is the following:

  1. Start with a point that is not part of a cluster.
  2. Find all neighbours that are within a distance less than R.
  3. If there at least n neighbours they build a cluster. Else the point is an outlier and go back to 1.
  4. For each new point in the cluster repeat 2 until you can't find any new connected points. Then go back 1.
  5. When all points are either in a cluster or outliers without enough neighbours you are done.

The maximal distance and minimal number of neighbours define a minimal density for a cluster, hence the name.

clusterer = DBSCAN()

Network

Now we have everything in place run all steps and then construct the topological network. We create a pipe object that combines all steps.

pipe = make_mapper_pipeline(
    filter_func=filter_func,
    cover=cover,
    clusterer=clusterer,
    verbose=False,
    n_jobs=1,
)

Finally, we can visualise the results. We can see that the two rings were correctly reconstructed in the topological network

fig = plot_static_mapper_graph(pipe, data)
fig.show(config={'scrollZoom': True})

Exercise 3

Try different settings for the cover (no overlap, more/less intervals) and observe how the graph changes. Can you explain why?

Application: Breast cancer

A team of biologist and mathematicians applied the Mapper algorithm to genomic breast cancer data. Looking at the topological structure of the data they found a before unknown subgroup of breast cancer. Patients with on the newly discovered branch called c-MYB+ show 100% survival rate without metastasis wheras the other branch is deadly. Clinicians can now use the discovered genetic profile to test patients and adapt the treatment accordingly. This highlights the impact modern data science can have on people's life.

Figure: Mapper applied to breast cancer data.

Reference: Nicolau, Monica, Arnold J. Levine, and Gunnar Carlsson. "Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival." Proceedings of the National Academy of Sciences 108.17 (2011): 7265-7270.

Link: https://www.pnas.org/content/108/17/7265