{ "cells": [ { "cell_type": "markdown", "id": "4375c2f8-249a-4f6d-8089-2b647bd83ecb", "metadata": {}, "source": [ "# Lecture 3 - Neural network deep dive" ] }, { "cell_type": "markdown", "id": "07d6c97c-1780-40a8-a739-f4c962182986", "metadata": {}, "source": [ "> A deep dive into optimising neural networks with stochastic gradient descent" ] }, { "cell_type": "markdown", "id": "dc5f2762-c6b8-4368-8042-336c40a448d8", "metadata": {}, "source": [ "## Learning objectives\n", "\n", "* Understand how to implement neural networks from scratch\n", "* Understand all the ingredients needed to define a `Learner` in fastai" ] }, { "cell_type": "markdown", "id": "47dfe094-686e-40e7-8eaa-0d4991a2243c", "metadata": {}, "source": [ "## References\n", "\n", "* Chapter 4 of [_Deep Learning for Coders with fastai & PyTorch_](https://github.com/fastai/fastbook) by Jeremy Howard and Sylvain Gugger.\n", "* [What is `torch.nn` really?](https://pytorch.org/tutorials/beginner/nn_tutorial.html#what-is-torch-nn-really) by Jeremy Howard." ] }, { "cell_type": "markdown", "id": "f0ee1908-d0d4-455b-b758-cd48ac5a377b", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "id": "9f53116e-b38d-43c3-bcb5-8bc9636a0b57", "metadata": {}, "outputs": [], "source": [ "# Uncomment and run this cell if using Colab, Kaggle etc\n", "# %pip install fastai==2.6.0 datasets" ] }, { "cell_type": "markdown", "id": "699fa5d4-0c13-4e79-821c-e2701c4279c3", "metadata": { "tags": [] }, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 103, "id": "73261150-b566-48b0-812f-293e97564d7c", "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "import torch\n", "from datasets import load_dataset\n", "from fastai.tabular.all import *\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import MinMaxScaler\n", "from torch.utils.data import DataLoader, TensorDataset\n", "from tqdm.auto import tqdm" ] }, { "cell_type": "code", "execution_count": 2, "id": "62c70888-089f-42ec-904a-a271a66c5060", "metadata": {}, "outputs": [], "source": [ "import datasets\n", "\n", "# Suppress logs to keep things tidy\n", "datasets.logging.set_verbosity_error()" ] }, { "cell_type": "markdown", "id": "230c5b5f-3b6c-4c99-a4aa-94393105d7a4", "metadata": {}, "source": [ "## The dataset" ] }, { "cell_type": "markdown", "id": "0cb5955a-ecce-4d19-8f2b-08581a0fdac6", "metadata": {}, "source": [ "In lecture 2, we focused on optimising simple functions with stochastic gradient descent. Let's now tackle a real-world problem using neural networks! We'll use the $N$-subjettiness dataset from lecture 1 that represents jets in terms of $\\tau_N^{(\\beta)}$ variables that measure the radiation about $N$ axes in the jet according to an angular exponent $\\beta>0$. As usual, we'll load the dataset from the Hugging Face Hub and convert it to a Pandas `DataFrame` via the `to_pandas()` method:" ] }, { "cell_type": "code", "execution_count": 3, "id": "d67c6c7d-f7fc-44bf-8ba0-448d66495554", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6fb4a1359f144a2ba5a44a814e7b586b", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/3 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pTmasstau_1_0.5tau_1_1tau_1_2tau_2_0.5tau_2_1tau_2_2tau_3_0.5tau_3_1...tau_4_0.5tau_4_1tau_4_2tau_5_0.5tau_5_1tau_5_2tau_6_0.5tau_6_1tau_6_2label
0543.63394425.8467920.1651220.0326610.0022620.0488300.0037110.0000440.0309940.001630...0.0243360.0011150.0000080.0042520.0002347.706005e-070.0000000.0000000.000000e+000
1452.41186013.3886790.1629380.0275980.0008760.0959020.0154610.0005060.0797500.009733...0.0568540.0054540.0000720.0442110.0044306.175314e-050.0374580.0033963.670517e-050
2429.49525832.0210910.2444360.0659010.0055570.1552020.0388070.0027620.1232850.025339...0.0782050.0126780.0005670.0523740.0059359.395772e-050.0375720.0029322.237277e-050
3512.6754436.6847340.1025800.0113690.0001700.0863060.0077600.0000710.0681690.005386...0.0447050.0023760.0000080.0278950.0013644.400042e-060.0090120.0003796.731099e-070
4527.956859133.9854150.4070090.1918390.0651690.2914600.1054790.0297530.2093410.049187...0.1437680.0332490.0036890.1354070.0290542.593460e-030.1108050.0231792.202088e-030
\n", "

5 rows × 21 columns

\n", "" ], "text/plain": [ " pT mass tau_1_0.5 tau_1_1 tau_1_2 tau_2_0.5 tau_2_1 \\\n", "0 543.633944 25.846792 0.165122 0.032661 0.002262 0.048830 0.003711 \n", "1 452.411860 13.388679 0.162938 0.027598 0.000876 0.095902 0.015461 \n", "2 429.495258 32.021091 0.244436 0.065901 0.005557 0.155202 0.038807 \n", "3 512.675443 6.684734 0.102580 0.011369 0.000170 0.086306 0.007760 \n", "4 527.956859 133.985415 0.407009 0.191839 0.065169 0.291460 0.105479 \n", "\n", " tau_2_2 tau_3_0.5 tau_3_1 ... tau_4_0.5 tau_4_1 tau_4_2 \\\n", "0 0.000044 0.030994 0.001630 ... 0.024336 0.001115 0.000008 \n", "1 0.000506 0.079750 0.009733 ... 0.056854 0.005454 0.000072 \n", "2 0.002762 0.123285 0.025339 ... 0.078205 0.012678 0.000567 \n", "3 0.000071 0.068169 0.005386 ... 0.044705 0.002376 0.000008 \n", "4 0.029753 0.209341 0.049187 ... 0.143768 0.033249 0.003689 \n", "\n", " tau_5_0.5 tau_5_1 tau_5_2 tau_6_0.5 tau_6_1 tau_6_2 label \n", "0 0.004252 0.000234 7.706005e-07 0.000000 0.000000 0.000000e+00 0 \n", "1 0.044211 0.004430 6.175314e-05 0.037458 0.003396 3.670517e-05 0 \n", "2 0.052374 0.005935 9.395772e-05 0.037572 0.002932 2.237277e-05 0 \n", "3 0.027895 0.001364 4.400042e-06 0.009012 0.000379 6.731099e-07 0 \n", "4 0.135407 0.029054 2.593460e-03 0.110805 0.023179 2.202088e-03 0 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nsubjet_ds = load_dataset(\"dl4phys/top_tagging_nsubjettiness\")\n", "df = nsubjet_ds[\"train\"].to_pandas()\n", "df.head()" ] }, { "cell_type": "markdown", "id": "525f61c3-7372-42a9-ba7e-12a28260fecf", "metadata": {}, "source": [ "### Preparing the data" ] }, { "cell_type": "markdown", "id": "4c9e87bc-05c4-4b26-8bdf-a7e8180b457b", "metadata": {}, "source": [ "In lecture 1, we used the `TabularDataLoaders.from_df()` method from fastai to quickly create dataloaders for the train and validation sets. In this lecture, we'll be working with PyTorch tensors directly, so we'll take a different approach. To get started, we'll need to split our data into training and validation sets. We can do this easily via the `train_test_split()` function from scikit-learn:" ] }, { "cell_type": "code", "execution_count": 4, "id": "8e62de99-b6ed-477e-82a2-ba8f880d53ed", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((908250, 21), (302750, 21))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df, valid_df = train_test_split(df, random_state=42)\n", "train_df.shape, valid_df.shape" ] }, { "cell_type": "markdown", "id": "4b9cf9bf-e72c-4c27-9722-57f1272199eb", "metadata": {}, "source": [ "This has allocated 75% of our original dataset to `train_df` and the remainder to `valid_df`. Now that we have these `DataFrames`, the next thing we'll need are tensors for the features $(p_T, m, \\tau_1^{(0.5)}, \\tau_1^{(1)}, \\tau_1^{(2)}, \\ldots )$ and labels. There is, however, one potential problem: the jet $p_T$ and mass have much larger scales than the $N$-subjettiness $\\tau_N^{(\\beta)}$ features. We can see this by summarising the statistics of the training set with the `describe()` function: " ] }, { "cell_type": "code", "execution_count": 5, "id": "439362a9-115f-4250-8b2a-e226dab6a4d2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pTmasstau_1_0.5tau_1_1tau_1_2tau_2_0.5tau_2_1tau_2_2tau_3_0.5tau_3_1...tau_4_0.5tau_4_1tau_4_2tau_5_0.5tau_5_1tau_5_2tau_6_0.5tau_6_1tau_6_2label
count908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000...908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000908250.000000
mean487.10739388.0905200.3667160.1984460.3195590.2227590.0792430.0725350.1481370.035372...0.1120240.0221500.0086700.0884000.0153290.0048750.0706790.0110190.0029140.500366
std48.56826748.3936460.1869220.3395422.0038980.1109550.1251550.6740910.0726270.051869...0.0593930.0320040.1554680.0519490.0228660.1076410.0465710.0171330.0782470.500000
min225.490387-0.4335730.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%452.87928939.9581780.2244560.0583810.0064430.1392690.0256380.0015650.0946030.013308...0.0690370.0079490.0001880.0510120.0049360.0000790.0361420.0029770.0000330.000000
50%485.89405099.8874180.3801720.1660160.0458870.2227630.0615970.0087880.1488100.028501...0.1102200.0176090.0007870.0860450.0117550.0003870.0677970.0080280.0001931.000000
75%520.506446126.5185450.4771220.2405500.0744170.2997080.1082070.0224410.1961560.046588...0.1511370.0299900.0020060.1219050.0210890.0011030.1004370.0153590.0006351.000000
max647.493145299.2115552.4318886.01330937.7024222.2189565.39268333.3522491.9179124.502011...1.6162803.75371621.1619481.4073563.15835217.6456031.3888793.12737117.3409701.000000
\n", "

8 rows × 21 columns

\n", "
" ], "text/plain": [ " pT mass tau_1_0.5 tau_1_1 \\\n", "count 908250.000000 908250.000000 908250.000000 908250.000000 \n", "mean 487.107393 88.090520 0.366716 0.198446 \n", "std 48.568267 48.393646 0.186922 0.339542 \n", "min 225.490387 -0.433573 0.000000 0.000000 \n", "25% 452.879289 39.958178 0.224456 0.058381 \n", "50% 485.894050 99.887418 0.380172 0.166016 \n", "75% 520.506446 126.518545 0.477122 0.240550 \n", "max 647.493145 299.211555 2.431888 6.013309 \n", "\n", " tau_1_2 tau_2_0.5 tau_2_1 tau_2_2 \\\n", "count 908250.000000 908250.000000 908250.000000 908250.000000 \n", "mean 0.319559 0.222759 0.079243 0.072535 \n", "std 2.003898 0.110955 0.125155 0.674091 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.006443 0.139269 0.025638 0.001565 \n", "50% 0.045887 0.222763 0.061597 0.008788 \n", "75% 0.074417 0.299708 0.108207 0.022441 \n", "max 37.702422 2.218956 5.392683 33.352249 \n", "\n", " tau_3_0.5 tau_3_1 ... tau_4_0.5 tau_4_1 \\\n", "count 908250.000000 908250.000000 ... 908250.000000 908250.000000 \n", "mean 0.148137 0.035372 ... 0.112024 0.022150 \n", "std 0.072627 0.051869 ... 0.059393 0.032004 \n", "min 0.000000 0.000000 ... 0.000000 0.000000 \n", "25% 0.094603 0.013308 ... 0.069037 0.007949 \n", "50% 0.148810 0.028501 ... 0.110220 0.017609 \n", "75% 0.196156 0.046588 ... 0.151137 0.029990 \n", "max 1.917912 4.502011 ... 1.616280 3.753716 \n", "\n", " tau_4_2 tau_5_0.5 tau_5_1 tau_5_2 \\\n", "count 908250.000000 908250.000000 908250.000000 908250.000000 \n", "mean 0.008670 0.088400 0.015329 0.004875 \n", "std 0.155468 0.051949 0.022866 0.107641 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000188 0.051012 0.004936 0.000079 \n", "50% 0.000787 0.086045 0.011755 0.000387 \n", "75% 0.002006 0.121905 0.021089 0.001103 \n", "max 21.161948 1.407356 3.158352 17.645603 \n", "\n", " tau_6_0.5 tau_6_1 tau_6_2 label \n", "count 908250.000000 908250.000000 908250.000000 908250.000000 \n", "mean 0.070679 0.011019 0.002914 0.500366 \n", "std 0.046571 0.017133 0.078247 0.500000 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.036142 0.002977 0.000033 0.000000 \n", "50% 0.067797 0.008028 0.000193 1.000000 \n", "75% 0.100437 0.015359 0.000635 1.000000 \n", "max 1.388879 3.127371 17.340970 1.000000 \n", "\n", "[8 rows x 21 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.describe()" ] }, { "cell_type": "markdown", "id": "710585ed-85ec-4684-98ef-b740c93e2687", "metadata": {}, "source": [ "Here we can see that the jet $p_T$ and mass have average values of around 480 and 90 GeV, while the $N$-subjettiness variables $\\tau_N^{(\\beta)}$ have values that are orders of magnitude smaller. As we saw in lecture 2, SGD can struggle to optimise the loss function when the feature scales are very different. To handle this, it is common to _normalize_ the features. One way to do this is by rescaling all the features $x_i$ to lie in the interval $[0,1]$:\n", "\n", "$$ x_i' = \\frac{x_i - x_{i,\\mathrm{min}}}{x_{i,\\mathrm{max}} - x_{i,\\mathrm{min}}} $$" ] }, { "cell_type": "markdown", "id": "02585396-0a97-40a0-8461-a1921e6461f1", "metadata": {}, "source": [ "To apply this _minmax_ normalization, let's first grab the NumPy arrays of the features and labels:" ] }, { "cell_type": "code", "execution_count": 6, "id": "2745a555-ffa3-4528-836e-3934d9293f08", "metadata": {}, "outputs": [], "source": [ "# Slice out all feature columns\n", "train_x = train_df.iloc[:, :-1].values\n", "# Slice out the label column\n", "train_y = train_df.iloc[:, -1].values" ] }, { "cell_type": "markdown", "id": "636dd6fe-cd29-4a67-906f-3aaaa90d2289", "metadata": {}, "source": [ "Next, we use the `MinMaxScaler` from scikit-learn to apply the normalization on the features array:" ] }, { "cell_type": "code", "execution_count": 7, "id": "3ad9afff-ba7a-44af-8e6c-678574411ee1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.0, 1.0)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scaler = MinMaxScaler()\n", "train_x = scaler.fit_transform(train_x)\n", "# Sanity check the normalization worked\n", "np.min(train_x), np.max(train_x)" ] }, { "cell_type": "markdown", "id": "22ec2e54-0a7f-4abc-944f-29a403aeee21", "metadata": {}, "source": [ "Great, this worked! Now that our features are all nicely normalised, let's convert these NumPy arrays to PyTorch tensors. PyTorch provides a handy `from_numpy()` method that allows us to do the conversion easily:" ] }, { "cell_type": "code", "execution_count": 8, "id": "a69d9dfe-246e-4087-99ef-864480d3bc67", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(torch.Size([908250, 20]), torch.Size([908250]))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cast to float32\n", "train_x = torch.from_numpy(train_x).float()\n", "train_y = torch.from_numpy(train_df.iloc[:, -1].values)\n", "# Sanity check on the shapes\n", "train_x.shape, train_y.shape" ] }, { "cell_type": "markdown", "id": "c056e46a-9386-4652-8996-d5ce37adfad1", "metadata": {}, "source": [ "Okay, now that we have our tensors it's time to train a neural network!" ] }, { "cell_type": "markdown", "id": "123cce97-501f-46f9-b7cb-3641aac5b572", "metadata": {}, "source": [ "## Logistic regression as a neural network" ] }, { "cell_type": "markdown", "id": "6b8505a2-1cad-478d-81b5-819e2252324b", "metadata": {}, "source": [ "To warm up, let's train the simplest type of neural network for classification tasks: logistic regression! You might be surprised to hear that logistic regression can be viewed as a neural network. However, a _one-layer network_ has the same properties, so let's look at how we can implement this in PyTorch." ] }, { "cell_type": "markdown", "id": "b96c6ae3-879d-4356-9678-1ed9b7ff68ed", "metadata": {}, "source": [ "To get started, we'll need some weights and biases, so let's create random tensors using a type of intialization called [_Xavier initialization_](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). This initializes the biases to zero, while the weights $W_{ij}$ are sampled from a normal distribution in the interval $(-1/\\sqrt{n},1/\\sqrt{n})$, where $n$ is the number of features. We can implement Xavier initialization in PyTorch as follows:" ] }, { "cell_type": "code", "execution_count": 9, "id": "091368ff-3b2a-44db-a656-eb1a73162057", "metadata": {}, "outputs": [], "source": [ "set_seed(42)\n", "# Xavier initialisation\n", "weights = torch.randn(20, 2) / math.sqrt(20)\n", "# Track grads after initialization\n", "weights.requires_grad_()\n", "bias = torch.zeros(2, requires_grad=True)" ] }, { "cell_type": "markdown", "id": "510a17e2-90bf-453a-afe4-fedd64af57f1", "metadata": {}, "source": [ "Now that we have the weights and biases, the next ingredient we need is an activation function. For binary classification tasks, this usually takes the form of a sigmoid function, whose generalization to $K>2$ classes is called the _softmax_ function:\n", "\n", "$$ \\sigma(\\mathbf{x})_i = \\frac{e^{x_i}}{\\sum_{j=1}^K e^{x_j}} \\qquad \\mbox{for } i=1, \\ldots , K$$\n", "\n", "The sigmoid and the softmax functions have the effect of normalizing the output of the network to be a probability distribution. To keep things general, we'll use the softmax in this lecture. However, implementing softmax naively presents some numerical stability challenges. Consider, for example, computing the following:" ] }, { "cell_type": "code", "execution_count": 10, "id": "4b171af8-f983-4855-ba94-0125e8cb1d03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([inf, inf, inf])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = torch.tensor([1000.0, 1000.0, 1000.0])\n", "x.exp()" ] }, { "cell_type": "markdown", "id": "6a4070fb-b113-4f9c-94f9-8daf93bd7ab8", "metadata": {}, "source": [ "Hmm, a network that outputs infinity values will will cause the learning process to crash. This is an example of _numerical overflow_. Similarly, when the inputs are large negative numbers, we end up rounding the results to zero, an example of _numerical underflow_:" ] }, { "cell_type": "code", "execution_count": 11, "id": "1b92b415-f151-4878-af6a-fff56ebba84d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([0., 0., 0.])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = torch.tensor([-1000.0, -1000.0, -1000.0])\n", "x.exp()" ] }, { "cell_type": "markdown", "id": "4fe18f26-368d-4610-9a42-f23e3fcd35bc", "metadata": {}, "source": [ "To deal with these two problems, we can apply the [_log-sum-exp_ trick](https://www.xarg.org/2016/06/the-log-sum-exp-trick-in-machine-learning/):\n", "\n", "$$\\log \\sum_{i=1}^n e^{x_i} = a + \\log \\sum_{i=1}^n e^{x_i-a} $$\n", "\n", "where $a = \\max x_i$ is a constant that forces the greatest value to be zero. Since $\\log a/b = \\log a - \\log b$, taking the logarithm of the softmax function gives:" ] }, { "cell_type": "code", "execution_count": 12, "id": "e8bac3c6-0613-4bf0-830a-91b8e0b95a14", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([-1.0986, -1.0986, -1.0986])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def log_softmax(x):\n", " return (x - x.max()) - (x - x.max()).exp().sum(-1).log().unsqueeze(-1)\n", "\n", "\n", "log_softmax(x)" ] }, { "cell_type": "markdown", "id": "08ec92d5-d6f3-49e7-8286-f610cc44ff5f", "metadata": {}, "source": [ "Great, we now have an activation function that is numerically stable. Let's now define our logistic regression model to take a mini-batch `xb` of inputs and output the log-softmax values:" ] }, { "cell_type": "code", "execution_count": 13, "id": "501541d1-94d2-4d50-88b2-92f420d2ad84", "metadata": {}, "outputs": [], "source": [ "def model(xb):\n", " return log_softmax(xb @ weights + bias)" ] }, { "cell_type": "markdown", "id": "7ba4aa45-9e4b-4485-93b0-5c460746a00f", "metadata": {}, "source": [ "Let's test this model with a batch of data from our training set (also called a _forward pass_):" ] }, { "cell_type": "code", "execution_count": 14, "id": "38fbb26d-2247-4284-b34a-c84359aef066", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor([-0.5103, -0.9171], grad_fn=), torch.Size([1024, 2]))" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Batch size\n", "bs = 1024\n", "# A mini-batch from x\n", "xb = train_x[0:bs]\n", "# Model predictions\n", "preds = model(xb)\n", "preds[0], preds.shape" ] }, { "cell_type": "markdown", "id": "3254aa2b-e620-4ea0-ab2f-4417d8a1aaac", "metadata": {}, "source": [ "At this state the predictions are random, since we started with random weights. To improve these values, the next thing we need is a loss function. For classification tasks, one computes the _cross entropy_, which is the log likelihood of the softmax:\n", "\n", "$$ {\\cal L} = - \\frac{1}{m} \\sum_{i=1}^m \\sum_{k=1}^K y_k^{(i)}\\log\\hat{p}_k^{(i)} \\,.$$\n", "\n", "However, we've already taken the log of the softmax values $\\hat{p}_k^{(i)}$, so instead our loss will be the _negative log likelihood_, which doesn't include the logarithm. We can implement this easily in PyTorch as follows:" ] }, { "cell_type": "code", "execution_count": 15, "id": "771c575a-9cb9-4013-a9e7-1f1ced1c892d", "metadata": {}, "outputs": [], "source": [ "def nll_loss(predictions, target):\n", " # Mask predictions according to whether y_hat is 1 or 0\n", " return -predictions[range(target.shape[0]), target].mean()\n", "\n", "\n", "loss_func = nll_loss" ] }, { "cell_type": "markdown", "id": "e26105c8-89e3-48c3-854b-7690bb46c218", "metadata": {}, "source": [ "Now that we have a loss function, let's test we can compute the loss by comparing our mini-batch of predictions against a mini-batch of target values:" ] }, { "cell_type": "code", "execution_count": 16, "id": "1a5f66f4-fdd1-4de3-a730-1ab9aded0859", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor(0.7619, grad_fn=)\n" ] } ], "source": [ "yb = train_y[0:bs]\n", "print(loss_func(preds, yb))" ] }, { "cell_type": "markdown", "id": "8ad3d556-7068-46c1-8148-5f93308272b4", "metadata": {}, "source": [ "Again, the loss value is random, but we can minimise this function with backpropagation. Before doing that, let's also compute the accuracy of the model so that we track progress during training: " ] }, { "cell_type": "code", "execution_count": 17, "id": "02d93a12-2d60-4583-ad65-b2fbeb1b753b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(0.5020)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def accuracy(out, yb):\n", " preds = torch.argmax(out, dim=1)\n", " return (preds == yb).float().mean()\n", "\n", "\n", "accuracy(preds, yb)" ] }, { "cell_type": "markdown", "id": "61018c54-2b7b-4b59-abda-666b433ed87f", "metadata": {}, "source": [ "Indeed, the random model has an accuracy of 50% which is what we expect before any training. To implement the training loop, we'll take the following steps:\n", "\n", "1. Select a mini-batch of data of size `bs`\n", "2. Generate predictions from the model by computing the forward pass\n", "3. Compute the loss\n", "4. Compute the gradients of the loss wrt to the parameters by applying `loss.backward()`\n", "4. Update the weights and biases of the model by taking a step of gradient descent\n", "\n", "In code, this looks as follows:" ] }, { "cell_type": "code", "execution_count": 18, "id": "6e60d196-bfb9-47f3-af60-187661015f46", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9a0ef6e38a874b309f48d52e719a74bd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "num_epochs: 0%| | 0/3 [00:00)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = LogisticRegressor()\n", "loss_func(model(xb), yb)" ] }, { "cell_type": "markdown", "id": "7099005e-8e7a-4d90-8202-a8e6eb8f9a77", "metadata": {}, "source": [ "The big advatnage of the `nn.Module` and `nn.Parameter` classes is that we no longer have to manually update each parameter by name and zero out the gradients. We just need to iterate over the parameters associated with `nn.Module` and apply `model.zero_grad()` at the end of the updates. Let's wrap the training loop in a `fit()` function for later use:" ] }, { "cell_type": "code", "execution_count": 24, "id": "d766653c-6063-4189-bad2-4a0fa2a9ae4b", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6eb967e973dc4aecbb05ff9657039dca", "version_major": 2, "version_minor": 0 }, "text/plain": [ "num_epochs: 0%| | 0/3 [00:00)" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_model():\n", " model = LogisticRegressor()\n", " return model, torch.optim.SGD(model.parameters(), lr=lr)\n", "\n", "\n", "model, optimizer = get_model()\n", "loss_func(model(xb), yb)" ] }, { "cell_type": "markdown", "id": "db42d8c2-8bc6-44e0-8ea8-b35dc53802d4", "metadata": {}, "source": [ "Now that we have a model and optimizer, we can refactor our `fit()` function as follows:" ] }, { "cell_type": "code", "execution_count": 33, "id": "cb7c8b86-bd9e-4181-bd89-c966be22da16", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c61a7c1bd6814920b7eafd87214c1637", "version_major": 2, "version_minor": 0 }, "text/plain": [ "num_epochs: 0%| | 0/3 [00:00 The network consists of four fully connected hidden\n", "layers, the first two with 200 nodes and a dropout regularization of 0.2, and the last two\n", "with 50 nodes and a dropout regularization of 0.1. The output layer consists of two nodes.\n", "We use a ReLu activation function throughout and minimize the cross-entropy using Adam\n", "optimization" ] }, { "cell_type": "markdown", "id": "51af2132-6764-4b07-9238-1c561ab87b23", "metadata": {}, "source": [ "We briefly encountered dropout in the last lecture, so let's quckly explain how it works. Dropout is a _regularization technique_ (not the type of regularization you're familiar from QFT though!), that is designed to prevent the model from overfitting. The basic idea is to randomly change some of the activations in the network to zero during training time. An animation of the process is shown below, which shows how this process introduces some noise into the process and produces a more robust network:" ] }, { "cell_type": "markdown", "id": "c05e27f9-fd24-43f7-bb3d-54b0629be5bd", "metadata": {}, "source": [ "![](images/dropout.gif)" ] }, { "cell_type": "markdown", "id": "fe579ea9-02fe-4ece-b4d4-964f898f5fac", "metadata": {}, "source": [ "Now we can't just zero out activations naively because this will screw up the scales across each layer. Insted we apply dropout with probability `p` and then rescale all activations by `1-p` to keep the scales well behaved.\n", "\n", "The resulting model from the review article thus looks like:" ] }, { "cell_type": "code", "execution_count": 76, "id": "204e0f9c-85b3-4efa-943e-ab8113982c63", "metadata": {}, "outputs": [], "source": [ "model = nn.Sequential(\n", " nn.Linear(20, 200),\n", " nn.ReLU(),\n", " nn.Linear(200, 200),\n", " nn.ReLU(),\n", " nn.Dropout(p=0.2),\n", " nn.Linear(200, 50),\n", " nn.ReLU(),\n", " nn.Linear(50, 50),\n", " nn.ReLU(),\n", " nn.Dropout(p=0.1),\n", " nn.Linear(50, 2),\n", ")" ] }, { "cell_type": "markdown", "id": "02efd65c-6ad9-41e2-8dca-7fb14856a10b", "metadata": {}, "source": [ "And just like before, we can define the optimizer. In this case we'll use a special optimizer called Adam, which combines SGD with some other techniques to speed up training. You can find the details of Adam in Chapter 16 of the fastai book, but for now, we'll just instantiate it from PyTorch:" ] }, { "cell_type": "code", "execution_count": 80, "id": "49f48b18-d64d-4b1e-b2f3-4fd3454baf59", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "28c66060a17340c6a90305f8beefe93b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "num_epochs: 0%| | 0/3 [00:00\n", " /* Turns off some styling */\n", " progress {\n", " /* gets rid of default border in Firefox and Opera. */\n", " border: none;\n", " /* Needs to be in here for Safari polyfill so background images work as expected. */\n", " background-size: auto;\n", " }\n", " .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n", " background: #F44336;\n", " }\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracytime
00.2514000.3112210.83479400:13
10.2432410.3695330.79621500:13
20.2428420.3131260.86337200:13
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn.fit(3, lr)" ] }, { "cell_type": "markdown", "id": "87edc26a-93b5-4a38-97ab-30c3417197ca", "metadata": {}, "source": [ "Well, this was quite a deep dive into traiing neural networks from scratch and ending with with all the components that go into a fastai `Learner`! \n", "\n", "Next week, we'll move away from tabular data and take a look a class of neural networks for images that are based on convolutions 👀." ] }, { "cell_type": "markdown", "id": "5250f65a-222e-409b-aa97-2ebc862f39f7", "metadata": {}, "source": [ "## Exercises\n", "\n", "* Instead of using `nn.Sequential` to create our neural network, try implementing this as a subclass of `nn.Module` and training the resulting model.\n", "* Using the validation dataset and dataloader, try computing the validation loss and accuracy within the `fit()` function.\n", "* Read the [_Xavier initialization_ paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)" ] }, { "cell_type": "code", "execution_count": null, "id": "6b1130f9-ed4b-46a7-b546-b46470dffb86", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.11" } }, "nbformat": 4, "nbformat_minor": 5 }