{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4375c2f8-249a-4f6d-8089-2b647bd83ecb",
   "metadata": {},
   "source": [
    "# Lecture 3 - Neural network deep dive"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07d6c97c-1780-40a8-a739-f4c962182986",
   "metadata": {},
   "source": [
    "> A deep dive into optimising neural networks with stochastic gradient descent"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc5f2762-c6b8-4368-8042-336c40a448d8",
   "metadata": {},
   "source": [
    "## Learning objectives\n",
    "\n",
    "* Understand how to implement neural networks from scratch\n",
    "* Understand all the ingredients needed to define a `Learner` in fastai"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47dfe094-686e-40e7-8eaa-0d4991a2243c",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "* Chapter 4 of [_Deep Learning for Coders with fastai & PyTorch_](https://github.com/fastai/fastbook) by Jeremy Howard and Sylvain Gugger.\n",
    "* [What is `torch.nn` really?](https://pytorch.org/tutorials/beginner/nn_tutorial.html#what-is-torch-nn-really) by Jeremy Howard."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0ee1908-d0d4-455b-b758-cd48ac5a377b",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f53116e-b38d-43c3-bcb5-8bc9636a0b57",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment and run this cell if using Colab, Kaggle etc\n",
    "# %pip install fastai==2.6.0 datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "699fa5d4-0c13-4e79-821c-e2701c4279c3",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "id": "73261150-b566-48b0-812f-293e97564d7c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "import torch\n",
    "from datasets import load_dataset\n",
    "from fastai.tabular.all import *\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from torch.utils.data import DataLoader, TensorDataset\n",
    "from tqdm.auto import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "62c70888-089f-42ec-904a-a271a66c5060",
   "metadata": {},
   "outputs": [],
   "source": [
    "import datasets\n",
    "\n",
    "# Suppress logs to keep things tidy\n",
    "datasets.logging.set_verbosity_error()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "230c5b5f-3b6c-4c99-a4aa-94393105d7a4",
   "metadata": {},
   "source": [
    "## The dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cb5955a-ecce-4d19-8f2b-08581a0fdac6",
   "metadata": {},
   "source": [
    "In lecture 2, we focused on optimising simple functions with stochastic gradient descent. Let's now tackle a real-world problem using neural networks! We'll use the $N$-subjettiness dataset from lecture 1 that represents jets in terms of $\\tau_N^{(\\beta)}$ variables that measure the radiation about $N$ axes in the jet according to an angular exponent $\\beta>0$. As usual, we'll load the dataset from the Hugging Face Hub and convert it to a Pandas `DataFrame` via the `to_pandas()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d67c6c7d-f7fc-44bf-8ba0-448d66495554",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6fb4a1359f144a2ba5a44a814e7b586b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pT</th>\n",
       "      <th>mass</th>\n",
       "      <th>tau_1_0.5</th>\n",
       "      <th>tau_1_1</th>\n",
       "      <th>tau_1_2</th>\n",
       "      <th>tau_2_0.5</th>\n",
       "      <th>tau_2_1</th>\n",
       "      <th>tau_2_2</th>\n",
       "      <th>tau_3_0.5</th>\n",
       "      <th>tau_3_1</th>\n",
       "      <th>...</th>\n",
       "      <th>tau_4_0.5</th>\n",
       "      <th>tau_4_1</th>\n",
       "      <th>tau_4_2</th>\n",
       "      <th>tau_5_0.5</th>\n",
       "      <th>tau_5_1</th>\n",
       "      <th>tau_5_2</th>\n",
       "      <th>tau_6_0.5</th>\n",
       "      <th>tau_6_1</th>\n",
       "      <th>tau_6_2</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>543.633944</td>\n",
       "      <td>25.846792</td>\n",
       "      <td>0.165122</td>\n",
       "      <td>0.032661</td>\n",
       "      <td>0.002262</td>\n",
       "      <td>0.048830</td>\n",
       "      <td>0.003711</td>\n",
       "      <td>0.000044</td>\n",
       "      <td>0.030994</td>\n",
       "      <td>0.001630</td>\n",
       "      <td>...</td>\n",
       "      <td>0.024336</td>\n",
       "      <td>0.001115</td>\n",
       "      <td>0.000008</td>\n",
       "      <td>0.004252</td>\n",
       "      <td>0.000234</td>\n",
       "      <td>7.706005e-07</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>452.411860</td>\n",
       "      <td>13.388679</td>\n",
       "      <td>0.162938</td>\n",
       "      <td>0.027598</td>\n",
       "      <td>0.000876</td>\n",
       "      <td>0.095902</td>\n",
       "      <td>0.015461</td>\n",
       "      <td>0.000506</td>\n",
       "      <td>0.079750</td>\n",
       "      <td>0.009733</td>\n",
       "      <td>...</td>\n",
       "      <td>0.056854</td>\n",
       "      <td>0.005454</td>\n",
       "      <td>0.000072</td>\n",
       "      <td>0.044211</td>\n",
       "      <td>0.004430</td>\n",
       "      <td>6.175314e-05</td>\n",
       "      <td>0.037458</td>\n",
       "      <td>0.003396</td>\n",
       "      <td>3.670517e-05</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>429.495258</td>\n",
       "      <td>32.021091</td>\n",
       "      <td>0.244436</td>\n",
       "      <td>0.065901</td>\n",
       "      <td>0.005557</td>\n",
       "      <td>0.155202</td>\n",
       "      <td>0.038807</td>\n",
       "      <td>0.002762</td>\n",
       "      <td>0.123285</td>\n",
       "      <td>0.025339</td>\n",
       "      <td>...</td>\n",
       "      <td>0.078205</td>\n",
       "      <td>0.012678</td>\n",
       "      <td>0.000567</td>\n",
       "      <td>0.052374</td>\n",
       "      <td>0.005935</td>\n",
       "      <td>9.395772e-05</td>\n",
       "      <td>0.037572</td>\n",
       "      <td>0.002932</td>\n",
       "      <td>2.237277e-05</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>512.675443</td>\n",
       "      <td>6.684734</td>\n",
       "      <td>0.102580</td>\n",
       "      <td>0.011369</td>\n",
       "      <td>0.000170</td>\n",
       "      <td>0.086306</td>\n",
       "      <td>0.007760</td>\n",
       "      <td>0.000071</td>\n",
       "      <td>0.068169</td>\n",
       "      <td>0.005386</td>\n",
       "      <td>...</td>\n",
       "      <td>0.044705</td>\n",
       "      <td>0.002376</td>\n",
       "      <td>0.000008</td>\n",
       "      <td>0.027895</td>\n",
       "      <td>0.001364</td>\n",
       "      <td>4.400042e-06</td>\n",
       "      <td>0.009012</td>\n",
       "      <td>0.000379</td>\n",
       "      <td>6.731099e-07</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>527.956859</td>\n",
       "      <td>133.985415</td>\n",
       "      <td>0.407009</td>\n",
       "      <td>0.191839</td>\n",
       "      <td>0.065169</td>\n",
       "      <td>0.291460</td>\n",
       "      <td>0.105479</td>\n",
       "      <td>0.029753</td>\n",
       "      <td>0.209341</td>\n",
       "      <td>0.049187</td>\n",
       "      <td>...</td>\n",
       "      <td>0.143768</td>\n",
       "      <td>0.033249</td>\n",
       "      <td>0.003689</td>\n",
       "      <td>0.135407</td>\n",
       "      <td>0.029054</td>\n",
       "      <td>2.593460e-03</td>\n",
       "      <td>0.110805</td>\n",
       "      <td>0.023179</td>\n",
       "      <td>2.202088e-03</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           pT        mass  tau_1_0.5   tau_1_1   tau_1_2  tau_2_0.5   tau_2_1  \\\n",
       "0  543.633944   25.846792   0.165122  0.032661  0.002262   0.048830  0.003711   \n",
       "1  452.411860   13.388679   0.162938  0.027598  0.000876   0.095902  0.015461   \n",
       "2  429.495258   32.021091   0.244436  0.065901  0.005557   0.155202  0.038807   \n",
       "3  512.675443    6.684734   0.102580  0.011369  0.000170   0.086306  0.007760   \n",
       "4  527.956859  133.985415   0.407009  0.191839  0.065169   0.291460  0.105479   \n",
       "\n",
       "    tau_2_2  tau_3_0.5   tau_3_1  ...  tau_4_0.5   tau_4_1   tau_4_2  \\\n",
       "0  0.000044   0.030994  0.001630  ...   0.024336  0.001115  0.000008   \n",
       "1  0.000506   0.079750  0.009733  ...   0.056854  0.005454  0.000072   \n",
       "2  0.002762   0.123285  0.025339  ...   0.078205  0.012678  0.000567   \n",
       "3  0.000071   0.068169  0.005386  ...   0.044705  0.002376  0.000008   \n",
       "4  0.029753   0.209341  0.049187  ...   0.143768  0.033249  0.003689   \n",
       "\n",
       "   tau_5_0.5   tau_5_1       tau_5_2  tau_6_0.5   tau_6_1       tau_6_2  label  \n",
       "0   0.004252  0.000234  7.706005e-07   0.000000  0.000000  0.000000e+00      0  \n",
       "1   0.044211  0.004430  6.175314e-05   0.037458  0.003396  3.670517e-05      0  \n",
       "2   0.052374  0.005935  9.395772e-05   0.037572  0.002932  2.237277e-05      0  \n",
       "3   0.027895  0.001364  4.400042e-06   0.009012  0.000379  6.731099e-07      0  \n",
       "4   0.135407  0.029054  2.593460e-03   0.110805  0.023179  2.202088e-03      0  \n",
       "\n",
       "[5 rows x 21 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nsubjet_ds = load_dataset(\"dl4phys/top_tagging_nsubjettiness\")\n",
    "df = nsubjet_ds[\"train\"].to_pandas()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "525f61c3-7372-42a9-ba7e-12a28260fecf",
   "metadata": {},
   "source": [
    "### Preparing the data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c9e87bc-05c4-4b26-8bdf-a7e8180b457b",
   "metadata": {},
   "source": [
    "In lecture 1, we used the `TabularDataLoaders.from_df()` method from fastai to quickly create dataloaders for the train and validation sets. In this lecture, we'll be working with PyTorch tensors directly, so we'll take a different approach. To get started, we'll need to split our data into training and validation sets. We can do this easily via the `train_test_split()` function from scikit-learn:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8e62de99-b6ed-477e-82a2-ba8f880d53ed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((908250, 21), (302750, 21))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df, valid_df = train_test_split(df, random_state=42)\n",
    "train_df.shape, valid_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b9cf9bf-e72c-4c27-9722-57f1272199eb",
   "metadata": {},
   "source": [
    "This has allocated 75% of our original dataset to `train_df` and the remainder to `valid_df`. Now that we have these `DataFrames`, the next thing we'll need are tensors for the features $(p_T, m, \\tau_1^{(0.5)}, \\tau_1^{(1)}, \\tau_1^{(2)}, \\ldots )$ and labels. There is, however, one potential problem: the jet $p_T$ and mass have much larger scales than the $N$-subjettiness $\\tau_N^{(\\beta)}$ features. We can see this by summarising the statistics of the training set with the `describe()` function: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "439362a9-115f-4250-8b2a-e226dab6a4d2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pT</th>\n",
       "      <th>mass</th>\n",
       "      <th>tau_1_0.5</th>\n",
       "      <th>tau_1_1</th>\n",
       "      <th>tau_1_2</th>\n",
       "      <th>tau_2_0.5</th>\n",
       "      <th>tau_2_1</th>\n",
       "      <th>tau_2_2</th>\n",
       "      <th>tau_3_0.5</th>\n",
       "      <th>tau_3_1</th>\n",
       "      <th>...</th>\n",
       "      <th>tau_4_0.5</th>\n",
       "      <th>tau_4_1</th>\n",
       "      <th>tau_4_2</th>\n",
       "      <th>tau_5_0.5</th>\n",
       "      <th>tau_5_1</th>\n",
       "      <th>tau_5_2</th>\n",
       "      <th>tau_6_0.5</th>\n",
       "      <th>tau_6_1</th>\n",
       "      <th>tau_6_2</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "      <td>908250.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>487.107393</td>\n",
       "      <td>88.090520</td>\n",
       "      <td>0.366716</td>\n",
       "      <td>0.198446</td>\n",
       "      <td>0.319559</td>\n",
       "      <td>0.222759</td>\n",
       "      <td>0.079243</td>\n",
       "      <td>0.072535</td>\n",
       "      <td>0.148137</td>\n",
       "      <td>0.035372</td>\n",
       "      <td>...</td>\n",
       "      <td>0.112024</td>\n",
       "      <td>0.022150</td>\n",
       "      <td>0.008670</td>\n",
       "      <td>0.088400</td>\n",
       "      <td>0.015329</td>\n",
       "      <td>0.004875</td>\n",
       "      <td>0.070679</td>\n",
       "      <td>0.011019</td>\n",
       "      <td>0.002914</td>\n",
       "      <td>0.500366</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>48.568267</td>\n",
       "      <td>48.393646</td>\n",
       "      <td>0.186922</td>\n",
       "      <td>0.339542</td>\n",
       "      <td>2.003898</td>\n",
       "      <td>0.110955</td>\n",
       "      <td>0.125155</td>\n",
       "      <td>0.674091</td>\n",
       "      <td>0.072627</td>\n",
       "      <td>0.051869</td>\n",
       "      <td>...</td>\n",
       "      <td>0.059393</td>\n",
       "      <td>0.032004</td>\n",
       "      <td>0.155468</td>\n",
       "      <td>0.051949</td>\n",
       "      <td>0.022866</td>\n",
       "      <td>0.107641</td>\n",
       "      <td>0.046571</td>\n",
       "      <td>0.017133</td>\n",
       "      <td>0.078247</td>\n",
       "      <td>0.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>225.490387</td>\n",
       "      <td>-0.433573</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>452.879289</td>\n",
       "      <td>39.958178</td>\n",
       "      <td>0.224456</td>\n",
       "      <td>0.058381</td>\n",
       "      <td>0.006443</td>\n",
       "      <td>0.139269</td>\n",
       "      <td>0.025638</td>\n",
       "      <td>0.001565</td>\n",
       "      <td>0.094603</td>\n",
       "      <td>0.013308</td>\n",
       "      <td>...</td>\n",
       "      <td>0.069037</td>\n",
       "      <td>0.007949</td>\n",
       "      <td>0.000188</td>\n",
       "      <td>0.051012</td>\n",
       "      <td>0.004936</td>\n",
       "      <td>0.000079</td>\n",
       "      <td>0.036142</td>\n",
       "      <td>0.002977</td>\n",
       "      <td>0.000033</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>485.894050</td>\n",
       "      <td>99.887418</td>\n",
       "      <td>0.380172</td>\n",
       "      <td>0.166016</td>\n",
       "      <td>0.045887</td>\n",
       "      <td>0.222763</td>\n",
       "      <td>0.061597</td>\n",
       "      <td>0.008788</td>\n",
       "      <td>0.148810</td>\n",
       "      <td>0.028501</td>\n",
       "      <td>...</td>\n",
       "      <td>0.110220</td>\n",
       "      <td>0.017609</td>\n",
       "      <td>0.000787</td>\n",
       "      <td>0.086045</td>\n",
       "      <td>0.011755</td>\n",
       "      <td>0.000387</td>\n",
       "      <td>0.067797</td>\n",
       "      <td>0.008028</td>\n",
       "      <td>0.000193</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>520.506446</td>\n",
       "      <td>126.518545</td>\n",
       "      <td>0.477122</td>\n",
       "      <td>0.240550</td>\n",
       "      <td>0.074417</td>\n",
       "      <td>0.299708</td>\n",
       "      <td>0.108207</td>\n",
       "      <td>0.022441</td>\n",
       "      <td>0.196156</td>\n",
       "      <td>0.046588</td>\n",
       "      <td>...</td>\n",
       "      <td>0.151137</td>\n",
       "      <td>0.029990</td>\n",
       "      <td>0.002006</td>\n",
       "      <td>0.121905</td>\n",
       "      <td>0.021089</td>\n",
       "      <td>0.001103</td>\n",
       "      <td>0.100437</td>\n",
       "      <td>0.015359</td>\n",
       "      <td>0.000635</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>647.493145</td>\n",
       "      <td>299.211555</td>\n",
       "      <td>2.431888</td>\n",
       "      <td>6.013309</td>\n",
       "      <td>37.702422</td>\n",
       "      <td>2.218956</td>\n",
       "      <td>5.392683</td>\n",
       "      <td>33.352249</td>\n",
       "      <td>1.917912</td>\n",
       "      <td>4.502011</td>\n",
       "      <td>...</td>\n",
       "      <td>1.616280</td>\n",
       "      <td>3.753716</td>\n",
       "      <td>21.161948</td>\n",
       "      <td>1.407356</td>\n",
       "      <td>3.158352</td>\n",
       "      <td>17.645603</td>\n",
       "      <td>1.388879</td>\n",
       "      <td>3.127371</td>\n",
       "      <td>17.340970</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  pT           mass      tau_1_0.5        tau_1_1  \\\n",
       "count  908250.000000  908250.000000  908250.000000  908250.000000   \n",
       "mean      487.107393      88.090520       0.366716       0.198446   \n",
       "std        48.568267      48.393646       0.186922       0.339542   \n",
       "min       225.490387      -0.433573       0.000000       0.000000   \n",
       "25%       452.879289      39.958178       0.224456       0.058381   \n",
       "50%       485.894050      99.887418       0.380172       0.166016   \n",
       "75%       520.506446     126.518545       0.477122       0.240550   \n",
       "max       647.493145     299.211555       2.431888       6.013309   \n",
       "\n",
       "             tau_1_2      tau_2_0.5        tau_2_1        tau_2_2  \\\n",
       "count  908250.000000  908250.000000  908250.000000  908250.000000   \n",
       "mean        0.319559       0.222759       0.079243       0.072535   \n",
       "std         2.003898       0.110955       0.125155       0.674091   \n",
       "min         0.000000       0.000000       0.000000       0.000000   \n",
       "25%         0.006443       0.139269       0.025638       0.001565   \n",
       "50%         0.045887       0.222763       0.061597       0.008788   \n",
       "75%         0.074417       0.299708       0.108207       0.022441   \n",
       "max        37.702422       2.218956       5.392683      33.352249   \n",
       "\n",
       "           tau_3_0.5        tau_3_1  ...      tau_4_0.5        tau_4_1  \\\n",
       "count  908250.000000  908250.000000  ...  908250.000000  908250.000000   \n",
       "mean        0.148137       0.035372  ...       0.112024       0.022150   \n",
       "std         0.072627       0.051869  ...       0.059393       0.032004   \n",
       "min         0.000000       0.000000  ...       0.000000       0.000000   \n",
       "25%         0.094603       0.013308  ...       0.069037       0.007949   \n",
       "50%         0.148810       0.028501  ...       0.110220       0.017609   \n",
       "75%         0.196156       0.046588  ...       0.151137       0.029990   \n",
       "max         1.917912       4.502011  ...       1.616280       3.753716   \n",
       "\n",
       "             tau_4_2      tau_5_0.5        tau_5_1        tau_5_2  \\\n",
       "count  908250.000000  908250.000000  908250.000000  908250.000000   \n",
       "mean        0.008670       0.088400       0.015329       0.004875   \n",
       "std         0.155468       0.051949       0.022866       0.107641   \n",
       "min         0.000000       0.000000       0.000000       0.000000   \n",
       "25%         0.000188       0.051012       0.004936       0.000079   \n",
       "50%         0.000787       0.086045       0.011755       0.000387   \n",
       "75%         0.002006       0.121905       0.021089       0.001103   \n",
       "max        21.161948       1.407356       3.158352      17.645603   \n",
       "\n",
       "           tau_6_0.5        tau_6_1        tau_6_2          label  \n",
       "count  908250.000000  908250.000000  908250.000000  908250.000000  \n",
       "mean        0.070679       0.011019       0.002914       0.500366  \n",
       "std         0.046571       0.017133       0.078247       0.500000  \n",
       "min         0.000000       0.000000       0.000000       0.000000  \n",
       "25%         0.036142       0.002977       0.000033       0.000000  \n",
       "50%         0.067797       0.008028       0.000193       1.000000  \n",
       "75%         0.100437       0.015359       0.000635       1.000000  \n",
       "max         1.388879       3.127371      17.340970       1.000000  \n",
       "\n",
       "[8 rows x 21 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "710585ed-85ec-4684-98ef-b740c93e2687",
   "metadata": {},
   "source": [
    "Here we can see that the jet $p_T$ and mass have average values of around 480 and 90 GeV, while the $N$-subjettiness variables  $\\tau_N^{(\\beta)}$ have values that are orders of magnitude smaller. As we saw in lecture 2, SGD can struggle to optimise the loss function when the feature scales are very different. To handle this, it is common to _normalize_ the features. One way to do this is by rescaling all the features $x_i$ to lie in the interval $[0,1]$:\n",
    "\n",
    "$$ x_i' = \\frac{x_i - x_{i,\\mathrm{min}}}{x_{i,\\mathrm{max}} - x_{i,\\mathrm{min}}} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02585396-0a97-40a0-8461-a1921e6461f1",
   "metadata": {},
   "source": [
    "To apply this _minmax_ normalization, let's first grab the NumPy arrays of the features and labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2745a555-ffa3-4528-836e-3934d9293f08",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Slice out all feature columns\n",
    "train_x = train_df.iloc[:, :-1].values\n",
    "# Slice out the label column\n",
    "train_y = train_df.iloc[:, -1].values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "636dd6fe-cd29-4a67-906f-3aaaa90d2289",
   "metadata": {},
   "source": [
    "Next, we use the `MinMaxScaler` from scikit-learn to apply the normalization on the features array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3ad9afff-ba7a-44af-8e6c-678574411ee1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.0, 1.0)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scaler = MinMaxScaler()\n",
    "train_x = scaler.fit_transform(train_x)\n",
    "# Sanity check the normalization worked\n",
    "np.min(train_x), np.max(train_x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22ec2e54-0a7f-4abc-944f-29a403aeee21",
   "metadata": {},
   "source": [
    "Great, this worked! Now that our features are all nicely normalised, let's convert these NumPy arrays to PyTorch tensors. PyTorch provides a handy `from_numpy()` method that allows us to do the conversion easily:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a69d9dfe-246e-4087-99ef-864480d3bc67",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(torch.Size([908250, 20]), torch.Size([908250]))"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Cast to float32\n",
    "train_x = torch.from_numpy(train_x).float()\n",
    "train_y = torch.from_numpy(train_df.iloc[:, -1].values)\n",
    "# Sanity check on the shapes\n",
    "train_x.shape, train_y.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c056e46a-9386-4652-8996-d5ce37adfad1",
   "metadata": {},
   "source": [
    "Okay, now that we have our tensors it's time to train a neural network!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "123cce97-501f-46f9-b7cb-3641aac5b572",
   "metadata": {},
   "source": [
    "## Logistic regression as a neural network"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b8505a2-1cad-478d-81b5-819e2252324b",
   "metadata": {},
   "source": [
    "To warm up, let's train the simplest type of neural network for classification tasks: logistic regression! You might be surprised to hear that logistic regression can be viewed as a neural network. However, a _one-layer network_ has the same properties, so let's look at how we can implement this in PyTorch."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b96c6ae3-879d-4356-9678-1ed9b7ff68ed",
   "metadata": {},
   "source": [
    "To get started, we'll need some weights and biases, so let's create random tensors using a type of intialization called [_Xavier initialization_](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). This initializes the biases to zero, while the weights $W_{ij}$ are sampled from a normal distribution in the interval $(-1/\\sqrt{n},1/\\sqrt{n})$, where $n$ is the number of features. We can implement Xavier initialization in PyTorch as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "091368ff-3b2a-44db-a656-eb1a73162057",
   "metadata": {},
   "outputs": [],
   "source": [
    "set_seed(42)\n",
    "# Xavier initialisation\n",
    "weights = torch.randn(20, 2) / math.sqrt(20)\n",
    "# Track grads after initialization\n",
    "weights.requires_grad_()\n",
    "bias = torch.zeros(2, requires_grad=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "510a17e2-90bf-453a-afe4-fedd64af57f1",
   "metadata": {},
   "source": [
    "Now that we have the weights and biases, the next ingredient we need is an activation function. For binary classification tasks, this usually takes the form of a sigmoid function, whose generalization to $K>2$ classes is called the _softmax_ function:\n",
    "\n",
    "$$ \\sigma(\\mathbf{x})_i = \\frac{e^{x_i}}{\\sum_{j=1}^K e^{x_j}} \\qquad \\mbox{for } i=1, \\ldots , K$$\n",
    "\n",
    "The sigmoid and the softmax functions have the effect of normalizing the output of the network to be a probability distribution. To keep things general, we'll use the softmax in this lecture. However, implementing softmax naively presents some numerical stability challenges. Consider, for example, computing the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "4b171af8-f983-4855-ba94-0125e8cb1d03",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([inf, inf, inf])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = torch.tensor([1000.0, 1000.0, 1000.0])\n",
    "x.exp()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a4070fb-b113-4f9c-94f9-8daf93bd7ab8",
   "metadata": {},
   "source": [
    "Hmm, a network that outputs infinity values will will cause the learning process to crash. This is an example of _numerical overflow_. Similarly, when the inputs are large negative numbers, we end up rounding the results to zero, an example of _numerical underflow_:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "1b92b415-f151-4878-af6a-fff56ebba84d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0., 0., 0.])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = torch.tensor([-1000.0, -1000.0, -1000.0])\n",
    "x.exp()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fe18f26-368d-4610-9a42-f23e3fcd35bc",
   "metadata": {},
   "source": [
    "To deal with these two problems, we can apply the [_log-sum-exp_ trick](https://www.xarg.org/2016/06/the-log-sum-exp-trick-in-machine-learning/):\n",
    "\n",
    "$$\\log \\sum_{i=1}^n e^{x_i} = a + \\log \\sum_{i=1}^n e^{x_i-a} $$\n",
    "\n",
    "where $a = \\max x_i$ is a constant that forces the greatest value to be zero. Since $\\log a/b = \\log a - \\log b$, taking the logarithm of the softmax function gives:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e8bac3c6-0613-4bf0-830a-91b8e0b95a14",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([-1.0986, -1.0986, -1.0986])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def log_softmax(x):\n",
    "    return (x - x.max()) - (x - x.max()).exp().sum(-1).log().unsqueeze(-1)\n",
    "\n",
    "\n",
    "log_softmax(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08ec92d5-d6f3-49e7-8286-f610cc44ff5f",
   "metadata": {},
   "source": [
    "Great, we now have an activation function that is numerically stable. Let's now define our logistic regression model to take a mini-batch `xb` of inputs and output the log-softmax values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "501541d1-94d2-4d50-88b2-92f420d2ad84",
   "metadata": {},
   "outputs": [],
   "source": [
    "def model(xb):\n",
    "    return log_softmax(xb @ weights + bias)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ba4aa45-9e4b-4485-93b0-5c460746a00f",
   "metadata": {},
   "source": [
    "Let's test this model with a batch of data from our training set (also called a _forward pass_):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "38fbb26d-2247-4284-b34a-c84359aef066",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor([-0.5103, -0.9171], grad_fn=<SelectBackward0>), torch.Size([1024, 2]))"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Batch size\n",
    "bs = 1024\n",
    "# A mini-batch from x\n",
    "xb = train_x[0:bs]\n",
    "# Model predictions\n",
    "preds = model(xb)\n",
    "preds[0], preds.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3254aa2b-e620-4ea0-ab2f-4417d8a1aaac",
   "metadata": {},
   "source": [
    "At this state the predictions are random, since we started with random weights. To improve these values, the next thing we need is a loss function. For classification tasks, one computes the _cross entropy_, which is the log likelihood of the softmax:\n",
    "\n",
    "$$ {\\cal L} = - \\frac{1}{m} \\sum_{i=1}^m \\sum_{k=1}^K y_k^{(i)}\\log\\hat{p}_k^{(i)} \\,.$$\n",
    "\n",
    "However, we've already taken the log of the softmax values $\\hat{p}_k^{(i)}$, so instead our loss will be the _negative log likelihood_, which doesn't include the logarithm. We can implement this easily in PyTorch as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "771c575a-9cb9-4013-a9e7-1f1ced1c892d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def nll_loss(predictions, target):\n",
    "    # Mask predictions according to whether y_hat is 1 or 0\n",
    "    return -predictions[range(target.shape[0]), target].mean()\n",
    "\n",
    "\n",
    "loss_func = nll_loss"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e26105c8-89e3-48c3-854b-7690bb46c218",
   "metadata": {},
   "source": [
    "Now that we have a loss function, let's test we can compute the loss by comparing our mini-batch of predictions against a mini-batch of target values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "1a5f66f4-fdd1-4de3-a730-1ab9aded0859",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(0.7619, grad_fn=<NegBackward0>)\n"
     ]
    }
   ],
   "source": [
    "yb = train_y[0:bs]\n",
    "print(loss_func(preds, yb))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ad3d556-7068-46c1-8148-5f93308272b4",
   "metadata": {},
   "source": [
    "Again, the loss value is random, but we can minimise this function with backpropagation. Before doing that, let's also compute the accuracy of the model so that we track progress during training: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "02d93a12-2d60-4583-ad65-b2fbeb1b753b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor(0.5020)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def accuracy(out, yb):\n",
    "    preds = torch.argmax(out, dim=1)\n",
    "    return (preds == yb).float().mean()\n",
    "\n",
    "\n",
    "accuracy(preds, yb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61018c54-2b7b-4b59-abda-666b433ed87f",
   "metadata": {},
   "source": [
    "Indeed, the random model has an accuracy of 50% which is what we expect before any training. To implement the training loop, we'll take the following steps:\n",
    "\n",
    "1. Select a mini-batch of data of size `bs`\n",
    "2. Generate predictions from the model by computing the forward pass\n",
    "3. Compute the loss\n",
    "4. Compute the gradients of the loss wrt to the parameters by applying `loss.backward()`\n",
    "4. Update the weights and biases of the model by taking a step of gradient descent\n",
    "\n",
    "In code, this looks as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "6e60d196-bfb9-47f3-af60-187661015f46",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9a0ef6e38a874b309f48d52e719a74bd",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "num_epochs:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Learning rate\n",
    "lr = 1e-2\n",
    "# Number of epochs\n",
    "epochs = 3\n",
    "n = len(train_df)\n",
    "\n",
    "for epoch in tqdm(range(epochs), desc=\"num_epochs\"):\n",
    "    for i in tqdm(range((n - 1) // bs + 1), leave=False):\n",
    "        # 1. Select mini-batch\n",
    "        start_i = i * bs\n",
    "        end_i = start_i + bs\n",
    "        xb = train_x[start_i:end_i]\n",
    "        yb = train_y[start_i:end_i]\n",
    "        # 2. Generate predictions\n",
    "        pred = model(xb)\n",
    "        # 3. Compute the loss\n",
    "        loss = loss_func(pred, yb)\n",
    "        # 4. Compute the gradients\n",
    "        loss.backward()\n",
    "        # 5. Update the weights and biases\n",
    "        with torch.no_grad():\n",
    "            weights -= weights.grad * lr\n",
    "            bias -= bias.grad * lr\n",
    "            # Set current gradients to zero\n",
    "            weights.grad.zero_()\n",
    "            bias.grad.zero_()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ab4bf93-2f4c-4223-bf65-bbb33d557944",
   "metadata": {},
   "source": [
    "Note that here we update the weights and biases within the `torch.no_grad()` context manager - that's because we don't want these updates to be recorded in the tensors in the next iteration of gradient descent. We also set the gradients to zero after the update to prevent tracking these operations with every iteration.\n",
    "\n",
    "Now that we've trained our model, let's compute the loss and accuracy to see if they've improved:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "79168dce-fb00-4145-bbbb-489320fd7772",
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_scores():\n",
    "    print(f\"Loss: {loss_func(model(xb), yb):.3f}\")\n",
    "    print(f\"Accuracy: {accuracy(model(xb), yb):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "b802fafb-28f2-40f5-99cf-27e8ff3668c9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loss: 0.561\n",
      "Accuracy: 0.857\n"
     ]
    }
   ],
   "source": [
    "print_scores()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fb7be72-0945-408a-bb41-542ad813e223",
   "metadata": {},
   "source": [
    "Congratulations - you've trained your first neural network from scratch!\n",
    "\n",
    "In principle, there's nothing wrong with using raw PyTorch tensor operations to train models, but the framework provides various functions and classes that can simplify our code and make it more robust to errors. Let's take a look."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eeec624-ef8c-4d9e-a037-c9d30f9f1530",
   "metadata": {},
   "source": [
    "## Refactoring with PyTorch's functional API"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b805e11-1822-4a6f-a61f-6f567cc973d4",
   "metadata": {},
   "source": [
    "Instead of manually computing the log-softmax and negative log-likelihood, PyTorch provide a cross-entropy function that does all of this in one go! This function and many others live within the `torch.nn.functional` module, which is usually imported into the `F` namespace. Let's use the `F.cross_entropy` function as our loss function, which means we can remove the activation from our model's forward pass:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "77b6673c-2e63-4e0e-8cf4-bd4f369f0ad7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loss: 0.561\n",
      "Accuracy: 0.857\n"
     ]
    }
   ],
   "source": [
    "loss_func = F.cross_entropy\n",
    "\n",
    "\n",
    "def model(xb):\n",
    "    return xb @ weights + bias\n",
    "\n",
    "\n",
    "# Sanity check we get the same scores as before\n",
    "print_scores()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10569943-f405-4ecf-902c-f5c1acbec110",
   "metadata": {},
   "source": [
    "## Refactoring with PyTorch's `nn` classes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8eaca39-b050-4bd8-a46d-c47f9c5d36b3",
   "metadata": {},
   "source": [
    "The next thing we'll do is simplify our training loop by using the `nn.Module` and `nn.Parameter` classes. The first holds the weights and biases of the model, and defines the forward pass. The second, makes it simpler to keep track of the gradients:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "654d85e5-3c22-47a7-87a6-c032f9fa7416",
   "metadata": {},
   "outputs": [],
   "source": [
    "class LogisticRegressor(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.weights = nn.Parameter(torch.randn(20, 2) / math.sqrt(20))\n",
    "        self.bias = nn.Parameter(torch.zeros(2))\n",
    "\n",
    "    def forward(self, xb):\n",
    "        return xb @ self.weights + self.bias"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d40cb869-4eab-4284-bd79-04ddc1c54798",
   "metadata": {},
   "source": [
    "Now when we instantiate this class, we get a newly initialized model, which we can generate predictions from, compute the loss etc:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "4141d897-2802-47a5-aa95-5192c981b717",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor(0.7114, grad_fn=<NllLossBackward0>)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = LogisticRegressor()\n",
    "loss_func(model(xb), yb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7099005e-8e7a-4d90-8202-a8e6eb8f9a77",
   "metadata": {},
   "source": [
    "The big advatnage of the `nn.Module` and `nn.Parameter` classes is that we no longer have to manually update each parameter by name and zero out the gradients. We just need to iterate over the parameters associated with `nn.Module` and apply `model.zero_grad()` at the end of the updates. Let's wrap the training loop in a `fit()` function for later use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "d766653c-6063-4189-bad2-4a0fa2a9ae4b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6eb967e973dc4aecbb05ff9657039dca",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "num_epochs:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loss: 0.547\n",
      "Accuracy: 0.859\n"
     ]
    }
   ],
   "source": [
    "def fit():\n",
    "    for epoch in tqdm(range(epochs), desc=\"num_epochs\"):\n",
    "        for i in tqdm(range((n - 1) // bs + 1), leave=False):\n",
    "            # 1. Select mini-batch\n",
    "            start_i = i * bs\n",
    "            end_i = start_i + bs\n",
    "            xb = train_x[start_i:end_i]\n",
    "            yb = train_y[start_i:end_i]\n",
    "            # 2. Generate predictions\n",
    "            pred = model(xb)\n",
    "            # 3. Compute the loss\n",
    "            loss = loss_func(pred, yb)\n",
    "            # 4. Compute the gradients\n",
    "            loss.backward()\n",
    "            # 5. Update the weights and biases\n",
    "            with torch.no_grad():\n",
    "                for p in model.parameters():\n",
    "                    p -= p.grad * lr\n",
    "                model.zero_grad()\n",
    "\n",
    "\n",
    "fit()\n",
    "print_scores()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc490fc4-f0d2-4358-aba2-d05e25417f36",
   "metadata": {},
   "source": [
    "We can actually simplify our model even further by using the `nn.Linear` class, which defines a _linear layer_ in a neural network. This class automatically initializes the weights and biases with Xavier initialization and computes `xb @ weights + biases` for us. Let's use this layer and retrain our model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "43ea1d61-06b7-46df-8d5d-bb44757a042a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "812aa7fa951a4087bd78b5adde3dee72",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "num_epochs:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loss: 0.530\n",
      "Accuracy: 0.859\n"
     ]
    }
   ],
   "source": [
    "class LogisticRegressor(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.linear = nn.Linear(20, 2)\n",
    "\n",
    "    def forward(self, xb):\n",
    "        return self.linear(xb)\n",
    "\n",
    "\n",
    "model = LogisticRegressor()\n",
    "fit()\n",
    "print_scores()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e8b353c-27ae-454c-a540-f0036cad2310",
   "metadata": {},
   "source": [
    "It works!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab6be15c-db30-4054-bea8-468549e2b543",
   "metadata": {},
   "source": [
    "### Refactoring with PyTorch optimizers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "483a6eef-28ef-4501-851a-06b6662a161f",
   "metadata": {},
   "source": [
    "Now let's simplify the gradient update step of our training loop by using the `SGD` optimizer in PyTorch. This optimizer will allow us to reduce the whole logic under the `torch.no_grad()` context manager with just two steps:\n",
    "\n",
    "```python\n",
    "# 5. Update the weights and biases\n",
    "optimizer.step()\n",
    "optimizer.zero_grad()\n",
    "```\n",
    "\n",
    "To do so, let's create a simple helper function that initializes a new model and optimizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "c79a6af9-28c0-437e-9378-379c9d9c8024",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor(0.7052, grad_fn=<NllLossBackward0>)"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def get_model():\n",
    "    model = LogisticRegressor()\n",
    "    return model, torch.optim.SGD(model.parameters(), lr=lr)\n",
    "\n",
    "\n",
    "model, optimizer = get_model()\n",
    "loss_func(model(xb), yb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db42d8c2-8bc6-44e0-8ea8-b35dc53802d4",
   "metadata": {},
   "source": [
    "Now that we have a model and optimizer, we can refactor our `fit()` function as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "cb7c8b86-bd9e-4181-bd89-c966be22da16",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c61a7c1bd6814920b7eafd87214c1637",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "num_epochs:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loss: 0.535\n",
      "Accuracy: 0.859\n"
     ]
    }
   ],
   "source": [
    "def fit():\n",
    "    for epoch in tqdm(range(epochs), desc=\"num_epochs\"):\n",
    "        for i in tqdm(range((n - 1) // bs + 1), leave=False):\n",
    "            # 1. Select mini-batch\n",
    "            start_i = i * bs\n",
    "            end_i = start_i + bs\n",
    "            xb = train_x[start_i:end_i]\n",
    "            yb = train_y[start_i:end_i]\n",
    "            # 2. Generate predictions\n",
    "            pred = model(xb)\n",
    "            # 3. Compute the loss\n",
    "            loss = loss_func(pred, yb)\n",
    "            # 4. Compute the gradients\n",
    "            loss.backward()\n",
    "            # 5. Update the weights and biases\n",
    "            optimizer.step()\n",
    "            optimizer.zero_grad()\n",
    "\n",
    "\n",
    "fit()\n",
    "print_scores()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11c0ddb5-cb1b-42b5-9eb4-da9ac00a3b4e",
   "metadata": {},
   "source": [
    "Nice, our training loop is quite concise now, but notice that we still have to manually define the mini-batches. Let's see how we can simplify this with the `Dataset` and `DataLoader` classes in PyTorch."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04ae9c6e-3e5d-4fa0-89a4-6dbd7c623280",
   "metadata": {},
   "source": [
    "## Refactoring with `Dataset` classes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5711e11e-31cb-4c67-a42c-0d056d557ebc",
   "metadata": {},
   "source": [
    "PyTorch provides an abstract `Dataset` class that simplifies the way we access the features and labels of each mini-batch. The main requirement is that a `Dataset` should implement `__len__` and `__getitem__` functions that allow us to iterate over the data. PyTorch conveniently provides a `TensorDataset` that does this for tensors, so we can create our dataset by simply passing the tensors of features and labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "f64c224e-420e-4fd5-8509-b6cc49d0dcaa",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_ds = TensorDataset(train_x, train_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38cfbe84-2136-4602-9c5f-37f0a54d236e",
   "metadata": {},
   "source": [
    "This dataset has a length:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "13bca86e-a5a9-44df-bfdd-1f226dbe9f04",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "908250"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(train_ds)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40d1e676-0f5e-4ccc-83ad-6156fbf9263e",
   "metadata": {},
   "source": [
    "and we can index into it like a Python list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "f50a6eeb-633e-47e6-b78c-5889cf70f20d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor([6.1467e-01, 3.4553e-01, 2.0827e-01, 6.1930e-02, 2.9217e-02, 1.1426e-01,\n",
       "         3.9722e-02, 3.0127e-02, 1.1553e-01, 4.4638e-02, 3.7694e-02, 1.1224e-01,\n",
       "         5.0610e-02, 4.7183e-02, 7.6042e-02, 4.5295e-03, 2.0119e-05, 5.5926e-02,\n",
       "         2.8786e-03, 8.6186e-06]),\n",
       " tensor(1))"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_ds[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42095c29-4952-4457-b6ce-68f8d3a9ee9c",
   "metadata": {},
   "source": [
    "Note that each element returns a _tuple_ of the feature and corresponding label. This means we can replace the mini-batch step to a single line of code:\n",
    "\n",
    "```python\n",
    "xb, yb = train_ds[i * bs : i * bs + bs]\n",
    "```\n",
    "\n",
    "Let's refactor our `fit()` function to use the `train_ds` object now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "0e22a0c5-c9e1-48e7-b60b-0dd5424ce562",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d68cf2e2836b4786abddfb287c8d0c97",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "num_epochs:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loss: 0.545\n",
      "Accuracy: 0.856\n"
     ]
    }
   ],
   "source": [
    "model, optimizer = get_model()\n",
    "\n",
    "\n",
    "def fit():\n",
    "    for epoch in tqdm(range(epochs), desc=\"num_epochs\"):\n",
    "        for i in tqdm(range((n - 1) // bs + 1), leave=False):\n",
    "            # 1. Select mini-batch\n",
    "            xb, yb = train_ds[i * bs : i * bs + bs]\n",
    "            # 2. Generate predictions\n",
    "            pred = model(xb)\n",
    "            # 3. Compute the loss\n",
    "            loss = loss_func(pred, yb)\n",
    "            # 4. Compute the gradients\n",
    "            loss.backward()\n",
    "            # 5. Update the weights and biases\n",
    "            optimizer.step()\n",
    "            optimizer.zero_grad()\n",
    "\n",
    "\n",
    "fit()\n",
    "print_scores()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c89081f3-236c-491c-bc51-d418aa34649e",
   "metadata": {},
   "source": [
    "### Refactoring with DataLoaders"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70abba0d-b25a-424d-9d5b-d5b15bcfc049",
   "metadata": {},
   "source": [
    "We can actually simplify our training loop even further by using a PyTorch `DataLoader` class to manage the way we grab mini-batches. A `DataLoader` receives a `Dataset` and returns a generator we can iterate over:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "3d6e7b34-4d12-46a2-9de9-9fff2d48f033",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[tensor([[3.1934e-01, 3.0916e-01, 1.7249e-01,  ..., 5.0833e-02, 2.5030e-03,\n",
       "          6.6223e-06],\n",
       "         [5.9345e-01, 1.3606e-01, 1.0415e-01,  ..., 6.3039e-02, 3.7498e-03,\n",
       "          2.1251e-05],\n",
       "         [7.4757e-01, 3.8743e-01, 1.3758e-01,  ..., 2.7057e-02, 1.9911e-03,\n",
       "          2.2671e-05],\n",
       "         ...,\n",
       "         [5.9834e-01, 3.3535e-01, 1.8187e-01,  ..., 2.6380e-02, 1.0312e-03,\n",
       "          1.8795e-06],\n",
       "         [6.3238e-01, 4.1819e-01, 1.7070e-01,  ..., 6.9010e-02, 5.0007e-03,\n",
       "          1.0914e-04],\n",
       "         [7.1288e-01, 9.3378e-02, 6.7393e-02,  ..., 2.5031e-02, 7.3758e-04,\n",
       "          2.1064e-06]]),\n",
       " tensor([1, 0, 1,  ..., 1, 1, 0])]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_ds = TensorDataset(train_x, train_y)\n",
    "train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)\n",
    "next(iter(train_dl))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c81fd5be-2b65-436e-92a5-8b42464e5979",
   "metadata": {},
   "source": [
    "We can then simply iterate over the `DataLoader` to get our mini-batches for the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "09623063-19b2-4c06-9ed2-585231bcb885",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c992152104e845cdaf5b8ece7ed7e1f6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "num_epochs:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loss: 0.537\n",
      "Accuracy: 0.858\n"
     ]
    }
   ],
   "source": [
    "model, optimizer = get_model()\n",
    "\n",
    "\n",
    "def fit():\n",
    "    for epoch in tqdm(range(epochs), desc=\"num_epochs\"):\n",
    "        # 1. Select mini-batch\n",
    "        for xb, yb in tqdm(train_dl, leave=False):\n",
    "            # 2. Generate predictions\n",
    "            pred = model(xb)\n",
    "            # 3. Compute the loss\n",
    "            loss = loss_func(pred, yb)\n",
    "            # 4. Compute the gradients\n",
    "            loss.backward()\n",
    "            # 5. Update the weights and biases\n",
    "            optimizer.step()\n",
    "            optimizer.zero_grad()\n",
    "\n",
    "\n",
    "fit()\n",
    "print_scores()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ddb2e833-9b32-42bb-9f48-ef220734b43d",
   "metadata": {},
   "source": [
    "Great, we now have a rather simple training loop that works with any type of model! Let's now use a full-blown neural network with several hidden layers!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "729ac4df-9d8b-4666-80ce-f2bf822336e0",
   "metadata": {},
   "source": [
    "## Going deeper"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ddf7cc1-0723-4367-9c67-82982e3bdcac",
   "metadata": {},
   "source": [
    "Our logistic regression model is actually pretty good, but in many applications you'll want a _deep_ neural network to get better performance. To create neural networks, PyTorch provides a `nn.Sequential` class that allows you to stack layers one after another. Let's implement the same architecture defined in the top tagging review from the top tagging review:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e46ab8e-2902-4c73-965a-a6091dacf2e0",
   "metadata": {},
   "source": [
    "> The network consists of four fully connected hidden\n",
    "layers, the first two with 200 nodes and a dropout regularization of 0.2, and the last two\n",
    "with 50 nodes and a dropout regularization of 0.1. The output layer consists of two nodes.\n",
    "We use a ReLu activation function throughout and minimize the cross-entropy using Adam\n",
    "optimization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51af2132-6764-4b07-9238-1c561ab87b23",
   "metadata": {},
   "source": [
    "We briefly encountered dropout in the last lecture, so let's quckly explain how it works. Dropout is a _regularization technique_ (not the type of regularization you're familiar from QFT though!), that is designed to prevent the model from overfitting. The basic idea is to randomly change some of the activations in the network to zero during training time. An animation of the process is shown below, which shows how this process introduces some noise into the process and produces a more robust network:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c05e27f9-fd24-43f7-bb3d-54b0629be5bd",
   "metadata": {},
   "source": [
    "![](images/dropout.gif)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe579ea9-02fe-4ece-b4d4-964f898f5fac",
   "metadata": {},
   "source": [
    "Now we can't just zero out activations naively because this will screw up the scales across each layer. Insted we apply dropout with probability `p` and then rescale all activations by `1-p` to keep the scales well behaved.\n",
    "\n",
    "The resulting model from the review article thus looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "204e0f9c-85b3-4efa-943e-ab8113982c63",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = nn.Sequential(\n",
    "    nn.Linear(20, 200),\n",
    "    nn.ReLU(),\n",
    "    nn.Linear(200, 200),\n",
    "    nn.ReLU(),\n",
    "    nn.Dropout(p=0.2),\n",
    "    nn.Linear(200, 50),\n",
    "    nn.ReLU(),\n",
    "    nn.Linear(50, 50),\n",
    "    nn.ReLU(),\n",
    "    nn.Dropout(p=0.1),\n",
    "    nn.Linear(50, 2),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02efd65c-6ad9-41e2-8dca-7fb14856a10b",
   "metadata": {},
   "source": [
    "And just like before, we can define the optimizer. In this case we'll use a special optimizer called Adam, which combines SGD with some other techniques to speed up training. You can find the details of Adam in Chapter 16 of the fastai book, but for now, we'll just instantiate it from PyTorch:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "49f48b18-d64d-4b1e-b2f3-4fd3454baf59",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "28c66060a17340c6a90305f8beefe93b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "num_epochs:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/887 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loss: 0.263\n",
      "Accuracy: 0.890\n"
     ]
    }
   ],
   "source": [
    "optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n",
    "\n",
    "fit()\n",
    "print_scores()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1898d2f-4a29-47ad-9821-4bb62def2b22",
   "metadata": {},
   "source": [
    "Not bad, we've got a decent boost from using a deeper model and better optimizer!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3f5f023-cb2c-442f-9d01-84e961b12ac4",
   "metadata": {},
   "source": [
    "## Wrapping everything in a Learner"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de0728c9-a2c8-415c-a12d-11a6c0df3078",
   "metadata": {},
   "source": [
    "To wrap things up, let's show how we can feed all these building blocks into a fastai `Learner` that takes care of the training loop for us. First we'll need to create a validation set for evaluation, so let's do that using the same techniques we did for the training set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "id": "3b68829d-0680-4809-90e6-8d4e0759b4ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_ds = TensorDataset(train_x, train_y)\n",
    "train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)\n",
    "\n",
    "# Slice out all feature columns and cast to float32\n",
    "valid_x = valid_df.iloc[:, :-1].values\n",
    "valid_x = scaler.fit_transform(valid_x)\n",
    "valid_x = torch.from_numpy(valid_x).float()\n",
    "# Slice out the label column\n",
    "valid_y = torch.from_numpy(valid_df.iloc[:, -1].values)\n",
    "# Create dataset and dataloader for validation set\n",
    "valid_ds = TensorDataset(valid_x, valid_y)\n",
    "valid_dl = DataLoader(valid_ds, batch_size=bs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e3cdc86-c921-45fb-83b9-8194c59fc2ee",
   "metadata": {},
   "source": [
    "Now that we have dataloaders, recall that fastai wraps them in a single `DataLoaders` object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "f8c1cce0-2c97-4de7-bd3c-829d3c28de33",
   "metadata": {},
   "outputs": [],
   "source": [
    "dls = DataLoaders(train_dl, valid_dl)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "845b8a87-617d-425c-bffe-45dbb568e112",
   "metadata": {},
   "source": [
    "The final step is to define the model and optimizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "id": "6105ec98-7769-470f-8f07-c95bedba3eb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = nn.Sequential(\n",
    "    nn.Linear(20, 200),\n",
    "    nn.ReLU(),\n",
    "    nn.Linear(200, 200),\n",
    "    nn.ReLU(),\n",
    "    nn.Dropout(p=0.2),\n",
    "    nn.Linear(200, 50),\n",
    "    nn.ReLU(),\n",
    "    nn.Linear(50, 50),\n",
    "    nn.ReLU(),\n",
    "    nn.Dropout(p=0.1),\n",
    "    nn.Linear(50, 2),\n",
    ")\n",
    "\n",
    "opt_func = Adam"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10a0ff3f-bdb9-4fc8-8191-5f4de875b765",
   "metadata": {},
   "source": [
    "and wrap everything in a `Learner` and train for 3 epochs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "id": "100debb8-f1b6-453e-8d2f-138b1699a3ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "learn = Learner(dls, model, loss_func, opt_func=opt_func, metrics=[accuracy])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "id": "a446a63a-082c-4fef-bcd6-b36c075f15e7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       "    /* Turns off some styling */\n",
       "    progress {\n",
       "        /* gets rid of default border in Firefox and Opera. */\n",
       "        border: none;\n",
       "        /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "        background-size: auto;\n",
       "    }\n",
       "    .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "        background: #F44336;\n",
       "    }\n",
       "</style>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0.251400</td>\n",
       "      <td>0.311221</td>\n",
       "      <td>0.834794</td>\n",
       "      <td>00:13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>0.243241</td>\n",
       "      <td>0.369533</td>\n",
       "      <td>0.796215</td>\n",
       "      <td>00:13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>0.242842</td>\n",
       "      <td>0.313126</td>\n",
       "      <td>0.863372</td>\n",
       "      <td>00:13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "learn.fit(3, lr)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87edc26a-93b5-4a38-97ab-30c3417197ca",
   "metadata": {},
   "source": [
    "Well, this was quite a deep dive into traiing neural networks from scratch and ending with with all the components that go into a fastai `Learner`! \n",
    "\n",
    "Next week, we'll move away from tabular data and take a look a class of neural networks for images that are based on convolutions 👀."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5250f65a-222e-409b-aa97-2ebc862f39f7",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "* Instead of using `nn.Sequential` to create our neural network, try implementing this as a subclass of `nn.Module` and training the resulting model.\n",
    "* Using the validation dataset and dataloader, try computing the validation loss and accuracy within the `fit()` function.\n",
    "* Read the [_Xavier initialization_ paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b1130f9-ed4b-46a7-b546-b46470dffb86",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}