Helper functions used throughout the lessons

Datasets

download_dataset[source]

download_dataset(dataset_name:str)

Download datasets from Google Drive.

SUSY

The SUSY dataset from the UCI Machine Learning repository:

download_dataset("susy.csv.gz")
Download of susy.csv.gz dataset complete.

A compressed version in feather format is also available for faster loading in-class:

download_dataset("susy.feather")
Download of susy.feather dataset complete.

To get the training (first 4,500,000 rows) and test (last 500,000 rows) sets, run:

download_dataset("susy_train.feather")
download_dataset("susy_test.feather")
Download of susy_train.feather dataset complete.
Download of susy_test.feather dataset complete.

To get a random sample of 100,000 rows from susy_train, run:

download_dataset("susy_sample.feather")
Download of susy_sample.feather dataset complete.

Topological data analysis

For the 3D shape classification task in lesson 6, you can download the ZIP file of real-world objects as follows:

download_dataset("shapes.zip")
Download of shapes.zip dataset complete.

We also provide precomputed persistence diagrams so you can save time when running on Binder / Colab:

# circles, spheres, tori
download_dataset("diagrams_basic.pkl")
# real-world objects
download_dataset("diagrams.pkl")
Download of diagrams_basic.pkl dataset complete.
Download of diagrams.pkl dataset complete.

For the computer vision experiments, you can download the images as follows:

download_dataset("Cells.jpg")
download_dataset("BlackHole.jpg")
Download of Cells.jpg dataset complete.
Download of BlackHole.jpg dataset complete.

make_point_clouds[source]

make_point_clouds(n_samples_per_shape:int, n_points:int, noise:float)

Make point clouds for circles, spheres, and tori with random noise.

load_shapes[source]

load_shapes(path:Path, classes:List[T], n_points:int)

Load 3D shapes as a single pandas.DataFrame.

Gravitational waves

The following function generates noisy time series embedded with gravitational waves. We thank C. Bersten for providing the code and data from his article with J.H. Jung: Detection of gravitational waves using topological data analysis and convolutional neural network: An improved approach

make_gravitational_waves[source]

make_gravitational_waves(path_to_data:Path, n_signals:int=30, downsample_factor:int=2, r_min:float=0.075, r_max:float=0.65, n_snr_values:int=10)

download_dataset('gravitational-wave-signals.npy')
Dataset already exists at '../data/gravitational-wave-signals.npy' and is not downloaded again.
DATA = Path('../data')
noisy_signals, gw_signals, labels = make_gravitational_waves(path_to_data=DATA)
# get the index corresponding to the first pure noise time series
background_idx = np.argmin(labels)
# get the index corresponding to the first noise + gravitational wave time series
signal_idx = np.argmax(labels)

fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(12, 4), sharey=True)

ax0.plot(noisy_signals[background_idx])
ax0.set_ylabel("Amplitude")
ax0.set_xlabel("Time step")
ax0.set_title("Pure noise")

ax1.plot(noisy_signals[1])
ax1.plot(gw_signals[signal_idx])
ax1.set_xlabel("Time step")
ax1.set_title("Noise with gravitational wave signal")

plt.tight_layout()

Data wrangling

display_large[source]

display_large(df)

Displays up to 1000 columns and rows of pandas.DataFrame or pandas.Series objects.

rf_feature_importance[source]

rf_feature_importance(fitted_model, df)

Creates a pandas.Dataframe of a Random Forest's feature importance per column.

Data visualisation

plot_feature_importance[source]

plot_feature_importance(feature_importance)

plot_regression_tree[source]

plot_regression_tree(fitted_model, feature_names, fontsize=18)

plot_predictions[source]

plot_predictions(regressors, X, y, axes, label=None, style='r-', data_style='b.', data_label=None)