Jekyll2021-04-30T16:25:43-05:00https://lewtun.github.io/blog/feed.xmlLewis Tunstall’s BlogPosts on machine learning, physics, and topology at irregularly spaced intervals.Weeknotes: Fine-pruning transformers, universal data augmentation2021-01-24T00:00:00-06:002021-01-24T00:00:00-06:00https://lewtun.github.io/blog/weeknotes/nlp/huggingface/transformers/compression/few-shot/2021/01/24/wknotes-pruning-transformers<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2021-01-24-wknotes-pruning-transformers.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This week I split my time between wrapping up a book chapter on abstractive summarisation, trying to get UDA to work, and getting my hands dirty with movement pruning.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="A-first-look-at-movement-pruning">A first look at movement pruning<a class="anchor-link" href="#A-first-look-at-movement-pruning"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This week I focused on a compression technique for Transformers called <em>pruning</em>, whose goal is to selectively delete the weights of a model according to some importance criterion. In particular I wanted to understand how movement pruning<sup id="fnref-1" class="footnote-ref"><a href="#fn-1">1</a></sup> worked and how I could adapt <a href="https://twitter.com/SanhEstPasMoi?s=20">Victor Sanh's</a> implementation to run in Jupyter notebooks with the <code>Trainer</code> API from <code>transformers</code>.</p>
<p>The basic idea behind movement pruning is to <em>gradually</em> remove weights during <em>fine-tuning</em> such that the model becomes progressively <em>sparser</em>. As the authors observe, this "fine-pruning" approach addresses one of the main problems with other approaches like <em>magnitude pruning</em> that are designed for pure supervised learning tasks:</p>
<blockquote><p>While magnitude pruning is highly effective for standard supervised learning, it is inherently less useful in the transfer learning regime. In supervised learning, weight values are primarily determined by the end-task training data. In transfer learning, weight values are mostly predetermined by the original model and are only fine-tuned on the end task. This prevents these methods from learning to prune based on the fine-tuning step, or “fine-pruning.”</p>
</blockquote>
<p>Mathematically, the way most pruning methods work is to calculate a matrix ${\bf S}$ of <em>importance scores</em> and then select the top-$v$ percent of weights by importance:$$ \mathrm{Top}_v({\bf S})_{ij} = \left\{ \begin{aligned} 1 && \mathrm{if} \, S_{ij} \mathrm{ \,in\, top\, } v\% \\ 0 && \mathrm{otherwise}\end{aligned} \right.$$
From these scores we can then define a <em>mask</em> ${\bf M} \in \{0,1\}^{n\times n}$ that masks the weights during the forward pass with some input $x_i$ and effectively creates a sparse network:</p>
<p>
$$ a_i = W_{ik}M_{ik}x_k \,.$$
</p>
<p>For example, magnitude pruning calculates the scores according to the magnitude of the weights ${\bf S} = \left(\mid W_{ij} \mid\right)_{1\leq j, j\leq n}$ and then the masks are derived from ${\bf M} = \mathrm{Top}_v({\bf S})$.</p>
<p>The key novelty with movement pruning is that both the weights <em>and</em> the scores are <em>learned</em> during fine-tuning. This implies that in the backward pass, we also track the gradient of the loss ${\cal L}$ with respect to $S_{ij}$:<sup id="fnref-2" class="footnote-ref"><a href="#fn-2">2</a></sup></p>
<p>
$$ \frac{\partial{\cal L}}{\partial S_{ij}} = \frac{\partial {\cal L}}{\partial a_i}\frac{\partial a_i}{\partial S_{ij}} = \frac{\partial {\cal L}}{\partial a_i}W_{ij}x_j$$
</p>
<p>Once the scores are learned, it is then straightforward to generate the mask using ${\bf M} = \mathrm{Top}_v({\bf S})$. The authors also propose a "soft" version of movement pruning where instead of picking the top-$v$% of weights, one uses a global threshold $\tau$ to define the binary mask: ${\bf M} = ({\bf S} > \tau)$.</p>
<p>The paper has a nice visualisation of how the pretrained weights of BERT are pruned during fine-tuning and shows how magnitude pruning tends to make the pruning decision mostly on the basis of the pretrained weights (i.e. weights that have small absolute value during pre-training get pruned).</p>
<p><img src="/blog/images/copied_from_nb/my_icons/mag-vs-mov.png" alt="" /></p>
<p>In their experiments, the authors use a cubic sparsity scheduler to increase the amount of sparsity after some $t_i$ steps of warmp-up:</p>
<p>
$$v^{(t)} = v_f + (v_i-v_f)\left(1 - \frac{t-t_i}{N\Delta t}\right)^3 \,.$$
</p>
<p>The results for both hard and soft movement pruning on SQuAD and other benchmarks are quite impressive, especially in the high-sparsity regimes where less than 5% of the weights are retained!</p>
<p><img src="/blog/images/copied_from_nb/my_icons/mov-pruning-results.png" alt="" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Implementing-movement-pruning">Implementing movement pruning<a class="anchor-link" href="#Implementing-movement-pruning"> </a></h2><p>As noted above, I wanted to adapt Victor Sanh's implementation to work with the <code>Trainer</code> API from <code>transformers</code> so that I can run it in a Jupyter notebook. Implementing the <code>Trainer</code> itself was pretty straightforward and I was able to reuse a lot of Victor's code with minor adjustments. The first thing to do was override the <code>compute_loss</code> function as follows:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">compute_loss</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">inputs</span><span class="p">):</span>
<span class="n">threshold</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_schedule_threshold</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
<span class="n">inputs</span><span class="p">[</span><span class="s2">"threshold"</span><span class="p">]</span> <span class="o">=</span> <span class="n">threshold</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">inputs</span><span class="p">)</span>
<span class="n">loss</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">outputs</span>
<span class="k">return</span> <span class="n">loss</span>
</pre></div>
<p>Here we use the sparsity scheduler to get the threshold value $\tau$ needed for soft movement pruning, add it to the inputs and then extract the loss from the forward pass. The next step was to override the <code>create_optimizer_and_scheduler</code> function to account for the fact that there is a learning rate $\alpha_S$ associated with calculating the scores matrix:</p>
<p>
$$ S_{ij}^{(T)} = -\alpha_S \sum_{t<T} \left( \frac{\partial {\cal L}}{\partial W_{ij}}\right)^{(t)} W_{ij}^{(t)} $$
</p>
<p>In practice, this amounts to adding a term to the parameters we wish to optimize over</p>
<div class="highlight"><pre><span></span><span class="n">optimizer_grouped_parameters</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span>
<span class="s2">"params"</span><span class="p">:</span> <span class="p">[</span><span class="n">p</span> <span class="k">for</span> <span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">named_parameters</span><span class="p">()</span>
<span class="k">if</span> <span class="s2">"mask_score"</span> <span class="ow">in</span> <span class="n">n</span> <span class="ow">and</span> <span class="n">p</span><span class="o">.</span><span class="n">requires_grad</span><span class="p">],</span>
<span class="s2">"lr"</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">mask_scores_learning_rate</span><span class="p">,</span>
<span class="p">},</span> <span class="o">...</span>
<span class="p">]</span>
</pre></div>
<p>so that the final function takes the form:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">create_optimizer_and_scheduler</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">num_training_steps</span><span class="p">):</span>
<span class="n">no_decay</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"bias"</span><span class="p">,</span> <span class="s2">"LayerNorm.weight"</span><span class="p">]</span>
<span class="n">optimizer_grouped_parameters</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span>
<span class="s2">"params"</span><span class="p">:</span> <span class="p">[</span><span class="n">p</span> <span class="k">for</span> <span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">named_parameters</span><span class="p">()</span>
<span class="k">if</span> <span class="s2">"mask_score"</span> <span class="ow">in</span> <span class="n">n</span> <span class="ow">and</span> <span class="n">p</span><span class="o">.</span><span class="n">requires_grad</span><span class="p">],</span>
<span class="s2">"lr"</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">mask_scores_learning_rate</span><span class="p">,</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="s2">"params"</span><span class="p">:</span> <span class="p">[</span>
<span class="n">p</span>
<span class="k">for</span> <span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">named_parameters</span><span class="p">()</span>
<span class="k">if</span> <span class="s2">"mask_score"</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">n</span> <span class="ow">and</span> <span class="n">p</span><span class="o">.</span><span class="n">requires_grad</span>
<span class="ow">and</span> <span class="ow">not</span> <span class="nb">any</span><span class="p">(</span><span class="n">nd</span> <span class="ow">in</span> <span class="n">n</span> <span class="k">for</span> <span class="n">nd</span> <span class="ow">in</span> <span class="n">no_decay</span><span class="p">)</span>
<span class="p">],</span>
<span class="s2">"lr"</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">learning_rate</span><span class="p">,</span>
<span class="s2">"weight_decay"</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">weight_decay</span><span class="p">,</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="s2">"params"</span><span class="p">:</span> <span class="p">[</span>
<span class="n">p</span>
<span class="k">for</span> <span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">named_parameters</span><span class="p">()</span>
<span class="k">if</span> <span class="s2">"mask_score"</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">n</span> <span class="ow">and</span> <span class="n">p</span><span class="o">.</span><span class="n">requires_grad</span>
<span class="ow">and</span> <span class="nb">any</span><span class="p">(</span><span class="n">nd</span> <span class="ow">in</span> <span class="n">n</span> <span class="k">for</span> <span class="n">nd</span> <span class="ow">in</span> <span class="n">no_decay</span><span class="p">)</span>
<span class="p">],</span>
<span class="s2">"lr"</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">learning_rate</span><span class="p">,</span>
<span class="s2">"weight_decay"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
<span class="p">},</span>
<span class="p">]</span>
<span class="bp">self</span><span class="o">.</span><span class="n">optimizer</span> <span class="o">=</span> <span class="n">AdamW</span><span class="p">(</span><span class="n">optimizer_grouped_parameters</span><span class="p">,</span>
<span class="n">lr</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">learning_rate</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">adam_epsilon</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">lr_scheduler</span> <span class="o">=</span> <span class="n">get_linear_schedule_with_warmup</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">optimizer</span><span class="p">,</span> <span class="n">num_warmup_steps</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">warmup_steps</span><span class="p">,</span>
<span class="n">num_training_steps</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">t_total</span><span class="p">)</span>
</pre></div>
<p>So far, so good ... but what I had not appreciated is that one needs <em>special</em> model classes to deal with sparse matrices! In Victor's implementation, this requires a wholescale rewrite of the BERT classes to replace all the <code>torch.nn.Linear</code> layers with a custom <code>MaskedLinear</code> layer and additional parameters to calculate the adaptive mask in the forward pass.</p>
<p>Although there is <a href="https://discuss.huggingface.co/t/hugging-face-reads-01-2021-sparsity-and-pruning/3144/4?u=lewtun">no plan</a> to include these masked versions of BERT into the main <code>transformers</code> library, <a href="https://twitter.com/madlag?s=20">François Lagunas</a> at HuggingFace pointed me to work he's done on making <a href="https://github.com/huggingface/pytorch_block_sparse">sparse matrices efficient in PyTorch</a>.</p>
<p>In any case, I went ahead with Victor's masked models and ran a first set of experiments using 10% of the SQuAD data. To warmup, I used Victor's scripts as a benchmark and observed some peculiar features of fine-pruning: the metrics are flat for half the training before suddenly shooting up! Similarly, the loss gets <em>worse</em> before getting better. This is somewhat surprising, since fine-tuning usually gets most of the performance in the first 1-2 epochs of training before plateauing.</p>
<p><img src="/blog/images/copied_from_nb/my_icons/pruning-scores.png" alt="" /></p>
<p>So far I have not been able to reproduce these results in my implementation, with my model failing to recover from the charactersitic dip in performance during training:</p>
<p><img src="/blog/images/copied_from_nb/my_icons/pruning-scores-fail.png" alt="" /></p>
<p>So my focus for next week is to figure out what's going wrong and gradually scale-out to fine-pruning on the full SQuAD dataset!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Universal-data-augmentation">Universal data augmentation<a class="anchor-link" href="#Universal-data-augmentation"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>For the last two weeks, <a href="https://twitter.com/lvwerra?s=20">Leandro von Werra</a> and I have been dabbling in few-shot learning with Google's <a href="https://arxiv.org/abs/1904.12848"><em>Unsupervised Data Augmentation for Consistency Training</em></a> or UDA for short. For the book we're working on, one idea is to provide readers with a solution to a common problem in industry: what do you do when you've got tons of unlabelled text data and only 10s-100s of labelled examples?</p>
<p>The UDA paper proposes an elegant approach to this problem by applying data augmentation on the unlabelled data and then minimising the KL divergence between the model's predicted probability distribution for the raw and augmentated data. This "unsupervised consistency loss" is then added to the standard cross-entropy loss coming from the labelled examples and the model is trained jointly across the two tasks.</p>
<p><img src="/blog/images/copied_from_nb/my_icons/uda.png" alt="" /></p>
<p>The paper reports some spectacular results: using just <em>20 examples</em> from IMDB, UDA gets an error-rate that surpasses BERT-large fine-tuned on the full 25k examples in the training set!</p>
<p>There was just one hitch: the Google implementation is in <a href="https://github.com/google-research/uda/issues/8">Python2</a> and Tensorflow v1 🤮</p>
<p><img src="/blog/images/copied_from_nb/my_icons/uda-python.png" alt="" /></p>
<p>Being allergic to both, we decided to see if we could reproduce the results from an open-source port to PyTorch. In hindsight, this turned out to be a foolish decision because now we were debugging against 3 frameworks! It was also a humbling lesson in not believing what is reported in some random repo you find on the internet 😉.</p>
<p>So in the end, I bit the bullet and decided to run Google's implementation which unsurprisingly worked out of the box.<sup id="fnref-3" class="footnote-ref"><a href="#fn-3">3</a></sup> With just 10k steps and a few hours of training on a single GPU, UDA can indeed achieve > 90% accuracy on IMBD:</p>
<pre><code>=== step 500 ===
INFO:tensorflow: eval_classify_loss = 0.3957828
INFO:tensorflow: eval_classify_accuracy = 0.57844
INFO:tensorflow: loss = 0.80444646
=== step 1000 ===
INFO:tensorflow: eval_classify_loss = 0.68793213
INFO:tensorflow: eval_classify_accuracy = 0.56504
INFO:tensorflow: loss = 1.4864826
=== step 2000 ===
INFO:tensorflow: eval_classify_loss = 0.14758773
INFO:tensorflow: eval_classify_accuracy = 0.89524
INFO:tensorflow: loss = 0.71094906
=== step 10000 ===
INFO:tensorflow: eval_classify_loss = 0.23858581
INFO:tensorflow: eval_classify_accuracy = 0.91296
INFO:tensorflow: loss = 0.23858581</code></pre>
<p>So now that we're confident UDA really works, the next step will be to do a proper port to PyTorch - yay!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="TIL-this-week">TIL this week<a class="anchor-link" href="#TIL-this-week"> </a></h2><ul>
<li><a href="https://lewtun.github.io/blog/til/nlp/pytorch/2021/01/24/til-slicing-torch-datasets.html">Slicing PyTorch Datasets</a></li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><div class="footnotes"><p id="fn-1">1. <a href="https://arxiv.org/abs/2005.07683"><em>Movement Pruning: Adaptive Sparsity by Fine-Tuning</em></a> by Victor Sanh, Thomas Wolf, Alexander M. Rush (2020)<a href="#fnref-1" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-2">2. In a new term for me, this estimator is called straight-through because the top-$v$ function is ignored in the backward pass.<a href="#fnref-2" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-3">3. Well, <em>almost</em>. It took me a while to realise that when TensorFlow's <code>TPUEstimator</code> says it's running on a CPU, it's actually running on a GPU 🤷.<a href="#fnref-3" class="footnote footnotes">↩</a></p></div></p>
</div>
</div>
</div>
</div>Slicing PyTorch Datasets2021-01-24T00:00:00-06:002021-01-24T00:00:00-06:00https://lewtun.github.io/blog/til/nlp/pytorch/2021/01/24/til-slicing-torch-datasets<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2021-01-24-til-slicing-torch-datasets.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>I wanted to run some experiments with <a href="https://twitter.com/SanhEstPasMoi?s=20">Victor Sanh's</a> implementation of <a href="https://github.com/huggingface/transformers/tree/master/examples/research_projects/movement-pruning">movement pruning</a> so that I could compare against a custom <code>Trainer</code> I had implemented. Since each epoch of training on SQuAD takes around 2 hours on a single GPU, I wanted to speed-up the comparison by prune-tuning on a <em>subset</em> of the data.</p>
<p>Since it's been a while that I've worked directly with PyTorch <code>Dataset</code> objects,<sup id="fnref-1" class="footnote-ref"><a href="#fn-1">1</a></sup> I'd forgotten that one can't use a naive slicing of the dataset. For example, the following will fail:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">RandomSampler</span><span class="p">,</span> <span class="n">DataLoader</span>
<span class="n">train_ds</span> <span class="o">=</span> <span class="o">...</span>
<span class="n">sample_ds</span> <span class="o">=</span> <span class="n">train_ds</span><span class="p">[:</span><span class="mi">10</span><span class="p">]</span> <span class="c1"># folly!</span>
<span class="n">sample_sampler</span> <span class="o">=</span> <span class="n">RandomSampler</span><span class="p">(</span><span class="n">sample_ds</span><span class="p">)</span>
<span class="n">sample_dl</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">sample_ds</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">sample_sampler</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="nb">next</span><span class="p">(</span><span class="nb">iter</span><span class="p">(</span><span class="n">sample_dl</span><span class="p">))</span> <span class="c1"># KeyError or similar :(</span>
</pre></div>
<p>The reason this occurs is because slicing <code>train_ds</code> will return an object of a different <em>type</em> to <code>Dataset</code> (e.g. a <code>dict</code>), so the <code>RandomSampler</code> doesn't know how to produce appropriate samples for the <code>DataLoader</code>.</p>
<p>The solution I ended up with is to use the <code>Subset</code> class to create the desired subset:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">RandomSampler</span><span class="p">,</span> <span class="n">DataLoader</span><span class="p">,</span> <span class="n">Subset</span>
<span class="n">train_ds</span> <span class="o">=</span> <span class="o">...</span>
<span class="n">num_train_samples</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">sample_ds</span> <span class="o">=</span> <span class="n">Subset</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_train_samples</span><span class="p">))</span>
<span class="n">sample_sampler</span> <span class="o">=</span> <span class="n">RandomSampler</span><span class="p">(</span><span class="n">sample_ds</span><span class="p">)</span>
<span class="n">sample_dl</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">sample_ds</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">sample_sampler</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="nb">next</span><span class="p">(</span><span class="nb">iter</span><span class="p">(</span><span class="n">sample_dl</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="A-simple-example">A simple example<a class="anchor-link" href="#A-simple-example"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To see this in action, we'll use the IMDB dataset as an example. First let's download and unpack the dataset:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget -nc http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -P data
<span class="o">!</span>tar -xf data/aclImdb_v1.tar.gz -C data/
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Following the <code>transformers</code> <a href="https://huggingface.co/transformers/custom_datasets.html#sequence-classification-with-imdb-reviews">docs</a>, the next thing we need is to read the samples and labels. The following code does the trick:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="n">DATA</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s1">'data/aclImdb'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">read_imdb_split</span><span class="p">(</span><span class="n">split_dir</span><span class="p">):</span>
<span class="n">split_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">split_dir</span><span class="p">)</span>
<span class="n">texts</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">label_dir</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">"pos"</span><span class="p">,</span> <span class="s2">"neg"</span><span class="p">]:</span>
<span class="k">for</span> <span class="n">text_file</span> <span class="ow">in</span> <span class="p">(</span><span class="n">split_dir</span><span class="o">/</span><span class="n">label_dir</span><span class="p">)</span><span class="o">.</span><span class="n">iterdir</span><span class="p">():</span>
<span class="n">texts</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">text_file</span><span class="o">.</span><span class="n">read_text</span><span class="p">())</span>
<span class="n">labels</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="mi">0</span> <span class="k">if</span> <span class="n">label_dir</span> <span class="o">==</span> <span class="s2">"neg"</span> <span class="k">else</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">texts</span><span class="p">,</span> <span class="n">labels</span>
<span class="n">train_texts</span><span class="p">,</span> <span class="n">train_labels</span> <span class="o">=</span> <span class="n">read_imdb_split</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">DATA</span><span class="si">}</span><span class="s1">/train'</span><span class="p">)</span>
<span class="c1"># peek at first sample and label</span>
<span class="n">train_texts</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">train_labels</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>('For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.',
1)</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Next we need to tokenize the texts, which can be done as follows:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s1">'distilbert-base-uncased'</span><span class="p">)</span>
<span class="n">train_encodings</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">(</span><span class="n">train_texts</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Finally we can define a custom <code>Dataset</code> object:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">torch</span>
<span class="k">class</span> <span class="nc">IMDbDataset</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">encodings</span><span class="p">,</span> <span class="n">labels</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">encodings</span> <span class="o">=</span> <span class="n">encodings</span>
<span class="bp">self</span><span class="o">.</span><span class="n">labels</span> <span class="o">=</span> <span class="n">labels</span>
<span class="k">def</span> <span class="fm">__getitem__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">idx</span><span class="p">):</span>
<span class="n">item</span> <span class="o">=</span> <span class="p">{</span><span class="n">key</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">val</span><span class="p">[</span><span class="n">idx</span><span class="p">])</span> <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">val</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">encodings</span><span class="o">.</span><span class="n">items</span><span class="p">()}</span>
<span class="n">item</span><span class="p">[</span><span class="s1">'labels'</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">labels</span><span class="p">[</span><span class="n">idx</span><span class="p">])</span>
<span class="k">return</span> <span class="n">item</span>
<span class="k">def</span> <span class="fm">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">labels</span><span class="p">)</span>
<span class="n">train_ds</span> <span class="o">=</span> <span class="n">IMDbDataset</span><span class="p">(</span><span class="n">train_encodings</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Each element of <code>train_ds</code> is a <code>dict</code> with keys corresponding to the inputs expected in the <code>forward</code> pass of a Transformer model like BERT. If we take a slice, then we get tensors for each of the keys:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">train_ds</span><span class="p">[:</span><span class="mi">10</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'input_ids': tensor([[ 101, 2005, 1037, ..., 0, 0, 0],
[ 101, 13576, 5469, ..., 0, 0, 0],
[ 101, 1037, 5024, ..., 0, 0, 0],
...,
[ 101, 2023, 2001, ..., 0, 0, 0],
[ 101, 2081, 2044, ..., 3286, 1011, 102],
[ 101, 2005, 1037, ..., 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 0, 0, 0]]),
'labels': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This <code>dict</code> type is not suitable for sampling from, so the solution is to wrap our <code>Dataset</code> with <code>Subset</code> as follows:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">Subset</span>
<span class="n">num_train_examples</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">sample_ds</span> <span class="o">=</span> <span class="n">Subset</span><span class="p">(</span><span class="n">train_ds</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_train_examples</span><span class="p">))</span>
<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">sample_ds</span><span class="p">)</span> <span class="o">==</span> <span class="n">num_train_examples</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As a sanity check, let's compare the raw text against the decoded examples in the dataset:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">tokenizer</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="n">sample_ds</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s1">'input_ids'</span><span class="p">],</span> <span class="n">skip_special_tokens</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>'for a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. imagine a movie where joe piscopo is actually funny! maureen stapleton is a scene stealer. the moroni character is an absolute scream. watch for alan " the skipper " hale jr. as a police sgt.'</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This looks good, how about the last example?</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="n">sample_ds</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="s1">'input_ids'</span><span class="p">],</span> <span class="n">skip_special_tokens</span><span class="o">=</span><span class="kc">True</span><span class="p">),</span> <span class="s2">"</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">train_texts</span><span class="p">[</span><span class="mi">99</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>beautiful film, pure cassavetes style. gena rowland gives a stunning performance of a declining actress, dealing with success, aging, loneliness... and alcoholism. she tries to escape her own subconscious ghosts, embodied by the death spectre of a young girl. acceptance of oneself, of human condition, though its overall difficulties, is the real purpose of the film. the parallel between the theatrical sequences and the film itself are puzzling : it's like if the stage became a way out for the heroin. if all american movies could only be that top - quality, dealing with human relations on an adult level, not trying to infantilize and standardize feelings... one of the best dramas ever. 10 / 10.
Beautiful film, pure Cassavetes style. Gena Rowland gives a stunning performance of a declining actress, dealing with success, aging, loneliness...and alcoholism. She tries to escape her own subconscious ghosts, embodied by the death spectre of a young girl. Acceptance of oneself, of human condition, though its overall difficulties, is the real purpose of the film. The parallel between the theatrical sequences and the film itself are puzzling: it's like if the stage became a way out for the Heroin. If all american movies could only be that top-quality, dealing with human relations on an adult level, not trying to infantilize and standardize feelings... One of the best dramas ever. 10/10.
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The final step is to define the sampler and dataloader and we're done!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">RandomSampler</span><span class="p">,</span> <span class="n">DataLoader</span>
<span class="n">sample_sampler</span> <span class="o">=</span> <span class="n">RandomSampler</span><span class="p">(</span><span class="n">sample_ds</span><span class="p">)</span>
<span class="n">sample_dl</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">sample_ds</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">train_sampler</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="nb">next</span><span class="p">(</span><span class="nb">iter</span><span class="p">(</span><span class="n">sample_dl</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'input_ids': tensor([[ 101, 13576, 5469, ..., 0, 0, 0],
[ 101, 1037, 5024, ..., 0, 0, 0],
[ 101, 2005, 1037, ..., 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]]),
'labels': tensor([1, 1, 1])}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><div class="footnotes"><p id="fn-1">1. Mostly because I've been corrupted by the <code>datasets</code> and <code>fastai</code> APIs<a href="#fnref-1" class="footnote footnotes">↩</a></p></div></p>
</div>
</div>
</div>
</div>Weeknotes: Distilling distilled transformers2021-01-17T00:00:00-06:002021-01-17T00:00:00-06:00https://lewtun.github.io/blog/weeknotes/nlp/huggingface/transformers/2021/01/17/wknotes-distillation-and-generation<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2021-01-17-wknotes-distillation-and-generation.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This week I mostly worked on getting my knowledge distillation code up and running, doing some pair-programming with <a href="https://twitter.com/lvwerra">Leandro von Werra</a> to re-implement Google's <a href="https://arxiv.org/abs/1904.12848"><em>Unsupervised Data Augmentation for Consistency Training</em></a>, and reviewing a book chapter on decoding strategies for text generation.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="$(\mathrm{DistilBERT})^2$">$(\mathrm{DistilBERT})^2$<a class="anchor-link" href="#$(\mathrm{DistilBERT})^2$"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>I extended my question answering <a href="https://github.com/lewtun/transformerlab/tree/master">analysis</a> with <code>transformers</code> to implement a proof-of-concept for <em>task-specific</em> knowledge distillation.<sup id="fnref-1" class="footnote-ref"><a href="#fn-1">1</a></sup> Unlike <em>task-agnostic</em> distillation where the transfer of knowledge from teacher to student is done during <em>pretraining</em>, the task-specific approach involves using a teacher to augment the cross-entropy loss of the student during <em>finetuning</em>:</p>
<p>
$${\cal L}(\mathbf{x}|T) = - \sum_i \bar{y}_i\log y_i(\mathbf{x}|T) -T^2 \sum_i \hat{y}_i(\mathbf{x}|T)\log y_i(\mathbf{x}|T)$$
</p>
<p>Here $T$ is the temperature, $\hat{y}$ are the outputs from the model, $\bar{y}$ the ground-truth labels, and $y_i$ a softmax with temperature.</p>
<p>This neat idea comes from the <a href="https://arxiv.org/pdf/1910.01108.pdf">DistilBERT paper</a>, where the authors found that including a "second step of distillation" produced a student that performed better than simply finetuning the distilled language model:</p>
<blockquote><p>We also studied whether we could add another step of distillation during the adaptation phase by fine-tuning DistilBERT on SQuAD using a BERT model previously fine-tuned on SQuAD as a teacher for an additional term in the loss (knowledge distillation). In this setting, there are thus two successive steps of distillation, one during the pre-training phase and one during the adaptation phase. In this case, we were able to reach interesting performances given the size of the model:79.8 F1 and 70.4 EM, i.e. within 3 points of the full model.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>A comparison of the two approaches is shown in the figure below:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><figure>
<img class="docimage" src="/blog/images/copied_from_nb/my_icons/distillation.png" alt="distillation" />
<figcaption>Task-specific distillation (left) versus task-agnostic distillation (right). Figure from FastFormers by Y. Kim and H. Awadalla [arXiv:2010.13382].</figcaption>
</figure>
</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>I find this idea to be quite appealing for deploying Transformers in production environments as we get the benefits of speed from using a distilled language model, yet largely preserve the performance of the teacher.</p>
<p>So my task this week was to reproduce the SQuAD v1.1 results from Table 2 of the DistilBERT paper. To do this I integrated <a href="https://twitter.com/GuggerSylvain?s=20">Sylvain Gugger's</a> question answering material (see <a href="https://lewtun.github.io/blog/weeknotes/nlp/huggingface/transformers/2021/01/10/wknotes-question-answering.html">last weeknotes</a>) together with <a href="https://twitter.com/SanhEstPasMoi?s=20">Victor Sanh's</a> <a href="`https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation`">implementation</a> of knowledge distillation.<sup id="fnref-2" class="footnote-ref"><a href="#fn-2">2</a></sup></p>
<p>The main bit of work was to create a <code>Trainer</code> class that could:</p>
<ul>
<li>handle two models at once, i.e. for the teacher and student</li>
<li>run evaluation during training to get feedback on the distillation process</li>
</ul>
<p>The solution I ended up with involved subclassing the <code>QuestionAnsweringTrainer</code> I had previously adapted from Sylvain and simply overriding the <code>compute_loss</code> function:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">DistillationTrainer</span><span class="p">(</span><span class="n">QuestionAnsweringTrainer</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="n">teacher_model</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">teacher</span> <span class="o">=</span> <span class="n">teacher_model</span>
<span class="bp">self</span><span class="o">.</span><span class="n">teacher</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">compute_loss</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">inputs</span><span class="p">):</span>
<span class="n">inputs_stu</span> <span class="o">=</span> <span class="p">{</span><span class="o">...</span><span class="p">}</span>
<span class="n">outputs_stu</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">inputs_stu</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">outputs_stu</span><span class="o">.</span><span class="n">loss</span>
<span class="n">start_logits_stu</span> <span class="o">=</span> <span class="n">outputs_stu</span><span class="o">.</span><span class="n">start_logits</span>
<span class="n">end_logits_stu</span> <span class="o">=</span> <span class="n">outputs_stu</span><span class="o">.</span><span class="n">end_logits</span>
<span class="k">with</span> <span class="n">torch</span><span class="o">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">outputs_tea</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">teacher</span><span class="p">(</span><span class="o">**</span><span class="n">inputs</span><span class="p">)</span>
<span class="n">start_logits_tea</span> <span class="o">=</span> <span class="n">outputs_tea</span><span class="o">.</span><span class="n">start_logits</span>
<span class="n">end_logits_tea</span> <span class="o">=</span> <span class="n">outputs_tea</span><span class="o">.</span><span class="n">end_logits</span>
<span class="n">loss_fct</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">KLDivLoss</span><span class="p">(</span><span class="n">reduction</span><span class="o">=</span><span class="s2">"batchmean"</span><span class="p">)</span>
<span class="n">loss_start</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">loss_fct</span><span class="p">(</span>
<span class="n">F</span><span class="o">.</span><span class="n">log_softmax</span><span class="p">(</span><span class="n">start_logits_stu</span> <span class="o">/</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">temperature</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">),</span>
<span class="n">F</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">start_logits_tea</span> <span class="o">/</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">temperature</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">))</span>
<span class="o">*</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">temperature</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">loss_end</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">loss_fct</span><span class="p">(</span>
<span class="n">F</span><span class="o">.</span><span class="n">log_softmax</span><span class="p">(</span><span class="n">end_logits_stu</span> <span class="o">/</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">temperature</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">),</span>
<span class="n">F</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">end_logits_tea</span> <span class="o">/</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">temperature</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">))</span>
<span class="o">*</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">temperature</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">loss_ce</span> <span class="o">=</span> <span class="p">(</span><span class="n">loss_start</span> <span class="o">+</span> <span class="n">loss_end</span><span class="p">)</span> <span class="o">/</span> <span class="mf">2.0</span>
<span class="n">loss</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">alpha_ce</span> <span class="o">*</span> <span class="n">loss_ce</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">args</span><span class="o">.</span><span class="n">alpha_squad</span> <span class="o">*</span> <span class="n">loss</span>
<span class="k">return</span> <span class="n">loss</span>
</pre></div>
<p>By using DistilBERT-base as the student and BERT-base fine-tuned on SQuAD v1.1 as the teacher, I was able to get within spitting distance of the published results (Exact Match/F1 = 79.1/86.9), with the differences likely due to the choice of hyperparameters:</p>
<p><figure>
<img class="docimage" src="/blog/images/copied_from_nb/my_icons/distillation-results.png" alt="distillation" style="max-width: 500px" />
<figcaption>Evaluation metrics on SQuAD v1.1 for task-specific distillation</figcaption>
</figure>
</p>
<p>Overall, I'm pretty happy with how this turned out and am starting to appreciate the power of the "trainer paradigm", where one can abstract away tons of boilerplate (and error-prone) code for the training loop, evaluation, prediction etc and just focus on overriding the parts you need. I'm looking forward to pushing this analysis one step further with pruning and quantization - that's on the menu for next week!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Papers-this-week">Papers this week<a class="anchor-link" href="#Papers-this-week"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This week I've been reading up on OpenAI's GPT papers to better understand how decoding methods for text generation work with conditional language models:</p>
<ul>
<li><a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf"><em>Language Models are Unsupervised Multitask Learners</em></a> by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever (2019)</li>
<li><a href="https://arxiv.org/abs/2005.14165"><em>Language Models are Few-Shot Learners</em></a> by Tom B. Brown et al. (2020)</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="TIL-this-week">TIL this week<a class="anchor-link" href="#TIL-this-week"> </a></h2><ul>
<li><a href="https://lewtun.github.io/blog/til/nlp/huggingface/transformers/2021/01/15/til-recovering-hidden-trainer-columns.html">Recovering columns hidden by the 🤗 Trainer</a></li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><div class="footnotes"><p id="fn-1">1. As far as I know, this term was coined in the <a href="https://arxiv.org/abs/2010.13382"><em>FastFormers: Highly Efficient Transformer Models for Natural Language Understanding</em></a> paper by Y. Kim and H. Awadalla in their<a href="#fnref-1" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-2">2. Thanks to <a href="https://twitter.com/Thom_Wolf?s=20">Thomas Wolf</a> for pointing me to this resource.<a href="#fnref-2" class="footnote footnotes">↩</a></p></div></p>
</div>
</div>
</div>
</div>Recovering columns hidden by the 🤗 Trainer2021-01-15T00:00:00-06:002021-01-15T00:00:00-06:00https://lewtun.github.io/blog/til/nlp/huggingface/transformers/2021/01/15/til-recovering-hidden-trainer-columns<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2021-01-15-til-recovering-hidden-trainer-columns.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Lately, I've been using the <code>transformers</code> trainer together with the <code>datasets</code> library and I was a bit mystified by the disappearence of some columns in the training and validation sets after fine-tuning. It wasn't until I saw <a href="https://twitter.com/GuggerSylvain?s=20">Sylvain Gugger's</a> tutorial on <a href="https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb">question answering</a> that I realised this is by design! Indeed, as noted in the <a href="https://huggingface.co/transformers/main_classes/trainer.html?highlight=trainer#id1">docs</a><sup id="fnref-1" class="footnote-ref"><a href="#fn-1">1</a></sup> for the <code>train_dataset</code> and <code>eval_dataset</code> arguments of the <code>Trainer</code>:</p>
<blockquote><p>If it is an <code>datasets.Dataset</code>, columns not accepted by the <code>model.forward()</code> method are automatically removed.</p>
</blockquote>
<p>A simple one-liner to restore the missing columns is the following:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<div class="highlight"><pre><span></span><span class="n">dataset</span><span class="o">.</span><span class="n">set_format</span><span class="p">(</span><span class="nb">type</span><span class="o">=</span><span class="n">dataset</span><span class="o">.</span><span class="n">format</span><span class="p">[</span><span class="s2">"type"</span><span class="p">],</span> <span class="n">columns</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">features</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
</pre></div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To understand <em>why</em> this works, we can peek inside the relevant <code>Trainer</code> code</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">??</span>Trainer._remove_unused_columns
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea ">
<pre><span class="ansi-red-fg">Signature:</span>
Trainer<span class="ansi-blue-fg">.</span>_remove_unused_columns<span class="ansi-blue-fg">(</span>
self<span class="ansi-blue-fg">,</span>
dataset<span class="ansi-blue-fg">:</span><span class="ansi-blue-fg">'datasets.Dataset'</span><span class="ansi-blue-fg">,</span>
description<span class="ansi-blue-fg">:</span>Union<span class="ansi-blue-fg">[</span>str<span class="ansi-blue-fg">,</span> NoneType<span class="ansi-blue-fg">]</span><span class="ansi-blue-fg">=</span><span class="ansi-green-fg">None</span><span class="ansi-blue-fg">,</span>
<span class="ansi-blue-fg">)</span>
<span class="ansi-red-fg">Docstring:</span> <no docstring>
<span class="ansi-red-fg">Source:</span>
<span class="ansi-green-fg">def</span> _remove_unused_columns<span class="ansi-blue-fg">(</span>self<span class="ansi-blue-fg">,</span> dataset<span class="ansi-blue-fg">:</span> <span class="ansi-blue-fg">"datasets.Dataset"</span><span class="ansi-blue-fg">,</span> description<span class="ansi-blue-fg">:</span> Optional<span class="ansi-blue-fg">[</span>str<span class="ansi-blue-fg">]</span> <span class="ansi-blue-fg">=</span> <span class="ansi-green-fg">None</span><span class="ansi-blue-fg">)</span><span class="ansi-blue-fg">:</span>
<span class="ansi-green-fg">if</span> <span class="ansi-green-fg">not</span> self<span class="ansi-blue-fg">.</span>args<span class="ansi-blue-fg">.</span>remove_unused_columns<span class="ansi-blue-fg">:</span>
<span class="ansi-green-fg">return</span>
<span class="ansi-red-fg"># Inspect model forward signature to keep only the arguments it accepts.</span>
signature <span class="ansi-blue-fg">=</span> inspect<span class="ansi-blue-fg">.</span>signature<span class="ansi-blue-fg">(</span>self<span class="ansi-blue-fg">.</span>model<span class="ansi-blue-fg">.</span>forward<span class="ansi-blue-fg">)</span>
signature_columns <span class="ansi-blue-fg">=</span> list<span class="ansi-blue-fg">(</span>signature<span class="ansi-blue-fg">.</span>parameters<span class="ansi-blue-fg">.</span>keys<span class="ansi-blue-fg">(</span><span class="ansi-blue-fg">)</span><span class="ansi-blue-fg">)</span>
<span class="ansi-red-fg"># Labels may be named label or label_ids, the default data collator handles that.</span>
signature_columns <span class="ansi-blue-fg">+=</span> <span class="ansi-blue-fg">[</span><span class="ansi-blue-fg">"label"</span><span class="ansi-blue-fg">,</span> <span class="ansi-blue-fg">"label_ids"</span><span class="ansi-blue-fg">]</span>
columns <span class="ansi-blue-fg">=</span> <span class="ansi-blue-fg">[</span>k <span class="ansi-green-fg">for</span> k <span class="ansi-green-fg">in</span> signature_columns <span class="ansi-green-fg">if</span> k <span class="ansi-green-fg">in</span> dataset<span class="ansi-blue-fg">.</span>column_names<span class="ansi-blue-fg">]</span>
ignored_columns <span class="ansi-blue-fg">=</span> list<span class="ansi-blue-fg">(</span>set<span class="ansi-blue-fg">(</span>dataset<span class="ansi-blue-fg">.</span>column_names<span class="ansi-blue-fg">)</span> <span class="ansi-blue-fg">-</span> set<span class="ansi-blue-fg">(</span>signature_columns<span class="ansi-blue-fg">)</span><span class="ansi-blue-fg">)</span>
dset_description <span class="ansi-blue-fg">=</span> <span class="ansi-blue-fg">""</span> <span class="ansi-green-fg">if</span> description <span class="ansi-green-fg">is</span> <span class="ansi-green-fg">None</span> <span class="ansi-green-fg">else</span> <span class="ansi-blue-fg">f"in the {description} set "</span>
logger<span class="ansi-blue-fg">.</span>info<span class="ansi-blue-fg">(</span>
<span class="ansi-blue-fg">f"The following columns {dset_description}don't have a corresponding argument in `{self.model.__class__.__name__}.forward` and have been ignored: {', '.join(ignored_columns)}."</span>
<span class="ansi-blue-fg">)</span>
dataset<span class="ansi-blue-fg">.</span>set_format<span class="ansi-blue-fg">(</span>type<span class="ansi-blue-fg">=</span>dataset<span class="ansi-blue-fg">.</span>format<span class="ansi-blue-fg">[</span><span class="ansi-blue-fg">"type"</span><span class="ansi-blue-fg">]</span><span class="ansi-blue-fg">,</span> columns<span class="ansi-blue-fg">=</span>columns<span class="ansi-blue-fg">)</span>
<span class="ansi-red-fg">File:</span> /usr/local/lib/python3.6/dist-packages/transformers/trainer.py
<span class="ansi-red-fg">Type:</span> function
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>and see that we're effectively undoing the final <code>dataset.set_format()</code> operation.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="A-simple-example">A simple example<a class="anchor-link" href="#A-simple-example"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To see this in action, let's grab 1,000 examples from the COLA dataset:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_dataset</span>
<span class="n">cola</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="s1">'glue'</span><span class="p">,</span> <span class="s1">'cola'</span><span class="p">,</span> <span class="n">split</span><span class="o">=</span><span class="s1">'train[:1000]'</span><span class="p">)</span>
<span class="n">cola</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 1000
})</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Here we can see that each split has three <code>Dataset.features</code>: <code>sentence</code>, <code>label</code>, and <code>idx</code>. By inspecting the <code>Dataset.format</code> attribute</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">cola</span><span class="o">.</span><span class="n">format</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'type': None,
'format_kwargs': {},
'columns': ['idx', 'label', 'sentence'],
'output_all_columns': False}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>we also see that the <code>type</code> is <code>None</code>. Next, let's load a pretrained model and its corresponding tokenizer:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModelForSequenceClassification</span>
<span class="n">num_labels</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">model_name</span> <span class="o">=</span> <span class="s1">'distilbert-base-uncased'</span>
<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">device</span><span class="p">(</span><span class="s2">"cuda"</span> <span class="k">if</span> <span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">is_available</span><span class="p">()</span> <span class="k">else</span> <span class="s2">"cpu"</span><span class="p">)</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">model_name</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="p">(</span><span class="n">AutoModelForSequenceClassification</span>
<span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">model_name</span><span class="p">,</span> <span class="n">num_labels</span><span class="o">=</span><span class="n">num_labels</span><span class="p">)</span>
<span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Before fine-tuning the model, we need to tokenize and encode the dataset, so let's do that with a simple <code>Dataset.map</code> operation:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">tokenize_and_encode</span><span class="p">(</span><span class="n">batch</span><span class="p">):</span>
<span class="k">return</span> <span class="n">tokenizer</span><span class="p">(</span><span class="n">batch</span><span class="p">[</span><span class="s1">'sentence'</span><span class="p">],</span> <span class="n">truncation</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">cola_enc</span> <span class="o">=</span> <span class="n">cola</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">tokenize_and_encode</span><span class="p">,</span> <span class="n">batched</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">cola_enc</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence'],
num_rows: 1000
})</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Note that the encoding process has added two new <code>Dataset.features</code> to our dataset: <code>attention_mask</code> and <code>input_ids</code>. Since we don't care about evaluation, let's create a minimal trainer and fine-tune the model for one epoch:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">TrainingArguments</span><span class="p">,</span> <span class="n">Trainer</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">16</span>
<span class="n">logging_steps</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">cola_enc</span><span class="p">)</span> <span class="o">//</span> <span class="n">batch_size</span>
<span class="n">training_args</span> <span class="o">=</span> <span class="n">TrainingArguments</span><span class="p">(</span>
<span class="n">output_dir</span><span class="o">=</span><span class="s2">"results"</span><span class="p">,</span>
<span class="n">num_train_epochs</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">per_device_train_batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span>
<span class="n">disable_tqdm</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
<span class="n">logging_steps</span><span class="o">=</span><span class="n">logging_steps</span><span class="p">)</span>
<span class="n">trainer</span> <span class="o">=</span> <span class="n">Trainer</span><span class="p">(</span>
<span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
<span class="n">args</span><span class="o">=</span><span class="n">training_args</span><span class="p">,</span>
<span class="n">train_dataset</span><span class="o">=</span><span class="n">cola_enc</span><span class="p">,</span>
<span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">)</span>
<span class="n">trainer</span><span class="o">.</span><span class="n">train</span><span class="p">();</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
<div>
<style>
/* Turns off some styling */
progress {
/* gets rid of default border in Firefox and Opera. */
border: none;
/* Needs to be in here for Safari polyfill so background images work as expected. */
background-size: auto;
}
</style>
<progress value="63" max="63" style="width:300px; height:20px; vertical-align: middle;"></progress>
[63/63 00:03, Epoch 1/1]
</div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: left;">
<th>Step</th>
<th>Training Loss</th>
</tr>
</thead>
<tbody>
<tr>
<td>62</td>
<td>0.630255</td>
</tr>
</tbody>
</table><p>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>By inspecting one of the training examples</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">cola_enc</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'input_ids': [101,
2256,
2814,
2180,
1005,
1056,
4965,
2023,
4106,
1010,
2292,
2894,
1996,
2279,
2028,
2057,
16599,
1012,
102],
'label': 1}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>it seems that we've lost our <code>sentence</code> and <code>idx</code> columns! However, by inspecting the <code>features</code> attribute</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">cola_enc</span><span class="o">.</span><span class="n">features</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'idx': Value(dtype='int32', id=None),
'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'label': ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'], names_file=None, id=None),
'sentence': Value(dtype='string', id=None)}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>we see that they are still present in the dataset. Applying our one-liner to restore them gives the desired result:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">cola_enc</span><span class="o">.</span><span class="n">set_format</span><span class="p">(</span><span class="nb">type</span><span class="o">=</span><span class="n">cola_enc</span><span class="o">.</span><span class="n">format</span><span class="p">[</span><span class="s2">"type"</span><span class="p">],</span> <span class="n">columns</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="n">cola_enc</span><span class="o">.</span><span class="n">features</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
<span class="n">cola_enc</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'idx': 0,
'input_ids': [101,
2256,
2814,
2180,
1005,
1056,
4965,
2023,
4106,
1010,
2292,
2894,
1996,
2279,
2028,
2057,
16599,
1012,
102],
'label': 1,
'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><div class="footnotes"><p id="fn-1">1. Proof positive that I only read documentation after some threshold of confusion.<a href="#fnref-1" class="footnote footnotes">↩</a></p></div></p>
</div>
</div>
</div>
</p></div></div></div></div></div></div>Weeknotes: Question answering with 🤗 transformers, mock interviews2021-01-10T00:00:00-06:002021-01-10T00:00:00-06:00https://lewtun.github.io/blog/weeknotes/nlp/huggingface/transformers/2021/01/10/wknotes-question-answering<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2021-01-10-wknotes-question-answering.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>During my PhD and postdoc, I kept detailed research notes that I would often revisit to reproduce a lengthy calculation or simply take stock of the progress I'd made on my projects.</p>
<p><img src="/blog/images/copied_from_nb/my_icons/physics.jpg" alt="" title="Good luck trying to remember what the colours mean one year later ..." /></p>
<p>For various reasons, I dropped this habit when I switched to industry<sup id="fnref-1" class="footnote-ref"><a href="#fn-1">1</a></sup> and nowadays find myself digging out code snippets or techniques from a tangle of Google Docs, Git repositories, and Markdown files that I've built up over the years.</p>
<p>To break this anti-pattern, I've decided to "work in public" as much as possible this year, mostly in the form of <a href="https://www.urbandictionary.com/define.php?term=TIL">TILs</a> and weeknotes. Here, I am drawing inspiration from the prolific <a href="https://twitter.com/simonw?s=20">Simon Willison</a>, whose <a href="https://simonwillison.net/">blog</a> meticulously documents the development of his open-source projects.<sup id="fnref-2" class="footnote-ref"><a href="#fn-2">2</a></sup></p>
<p>To that end, here's the first weeknotes of the year - hopefully they're not the last!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Question-answering">Question answering<a class="anchor-link" href="#Question-answering"> </a></h2><p>This week I've been doing a deep dive into extractive question answering as part of a book chapter I'm writing on compression methods for Transformers. Although I built a question answering PoC with BERT in the dark ages of 2019, I was curious to see how the implementation could be done in the <code>transformers</code> library, specifically with a custom <code>Trainer</code> class and running everything inside Jupyter notebooks.</p>
<p>Fortunately, <a href="https://twitter.com/GuggerSylvain?s=20">Sylvain Gugger</a> at HuggingFace had already implemented</p>
<ul>
<li>A <a href="https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb">tutorial</a> on fine-tuning language models for question answering, but without a custom <code>Trainer</code></li>
<li>A custom <code>QuestionAnsweringTrainer</code> as part of the <a href="https://github.com/huggingface/transformers/tree/master/examples/question-answering">question answering scripts</a> in <code>transformers</code></li>
</ul>
<p>so my warm-up task this week was to simply merge the two in a single notebook and fine-tune <code>bert-base-uncased</code> on SQuAD v1.</p>
<p>I implemented a <em>very</em> scrappy version that achieves this in my <code>transformerlab</code> repository, and the main lesson I learnt is that</p>
<blockquote><p>Dealing with context size is tricky for long documents</p>
</blockquote>
<p>Transformer models can only process a finite number of input tokens, a property usually referred to as the maximum context size. As described in Sylvain's tutorial, naive truncation of documents for question answering is problematic because</p>
<blockquote><p>removing part of the the context might result in losing the answer we are looking for.</p>
</blockquote>
<p>The solution is to apply a <em>sliding window</em><sup id="fnref-3" class="footnote-ref"><a href="#fn-3">3</a></sup> to the input context, so that long contexts are split into <em>multiple</em> features. An example from the tutorial shows how this works by introducing two new hyperparameters <code>max_length</code> and <code>doc_stride</code> that control the degree of overlap (bold shows the overlapping region):</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<blockquote><p>[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notre dame, has achieved a 332 - 165 record. in 2009 they were invited to the nit, where they advanced to the semifinals but were beaten by penn state who went on and beat baylor in the <em><strong>championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were</strong></em> [SEP]</p>
<p>[CLS] how many wins does the notre dame men's basketball team have? [SEP] championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were the most by the fighting irish team since 1908 - 09. [SEP]</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Remarkably, <code>transformers</code> supports this preprocessing logic out of the box, so one just has to specify a few arguments in the tokenizer:</p>
<div class="highlight"><pre><span></span><span class="n">tokenized_example</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">(</span>
<span class="n">example</span><span class="p">[</span><span class="s2">"question"</span><span class="p">],</span>
<span class="n">example</span><span class="p">[</span><span class="s2">"context"</span><span class="p">],</span>
<span class="n">max_length</span><span class="o">=</span><span class="n">max_length</span><span class="p">,</span>
<span class="n">truncation</span><span class="o">=</span><span class="s2">"only_second"</span><span class="p">,</span>
<span class="n">return_overflowing_tokens</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">return_offsets_mapping</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">stride</span><span class="o">=</span><span class="n">doc_stride</span><span class="p">)</span>
</pre></div>
<p>One drawback from this approach is that it introduces significant complexity into the data preparation step:</p>
<ul>
<li>With multiple features per example, one needs to do some heavy wrangling to pick out the start and end positions of each answer. For example, the <code>postprocess_qa_predictions</code> function in Sylvain's tutorial is about 80 lines long, and breaking this down for readers is likely to distract from the main focus on compression methods.</li>
<li>We need slightly different logic for preprocessing the training and validation sets (see the <code>prepare_train_features</code> and <code>prepare_validation_features</code>)</li>
</ul>
<p>Instead, I may opt for the simpler, but less rigourous approach of truncating the long examples. As shown in the <code>transformer</code> <a href="https://huggingface.co/transformers/custom_datasets.html#question-answering-with-squad-2-0">docs</a>, we'd only need to define a custom dataset</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">torch</span>
<span class="k">class</span> <span class="nc">SquadDataset</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">encodings</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">encodings</span> <span class="o">=</span> <span class="n">encodings</span>
<span class="k">def</span> <span class="fm">__getitem__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">idx</span><span class="p">):</span>
<span class="k">return</span> <span class="p">{</span><span class="n">key</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">val</span><span class="p">[</span><span class="n">idx</span><span class="p">])</span> <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">val</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">encodings</span><span class="o">.</span><span class="n">items</span><span class="p">()}</span>
<span class="k">def</span> <span class="fm">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">encodings</span><span class="o">.</span><span class="n">input_ids</span><span class="p">)</span>
</pre></div>
<p>and then pass the encoding for the training and validation sets as follows:</p>
<div class="highlight"><pre><span></span><span class="n">train_encodings</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">(</span><span class="n">train_contexts</span><span class="p">,</span> <span class="n">train_questions</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">val_encodings</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">(</span><span class="n">val_contexts</span><span class="p">,</span> <span class="n">val_questions</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">train_dataset</span> <span class="o">=</span> <span class="n">SquadDataset</span><span class="p">(</span><span class="n">train_encodings</span><span class="p">)</span>
<span class="n">val_dataset</span> <span class="o">=</span> <span class="n">SquadDataset</span><span class="p">(</span><span class="n">val_encodings</span><span class="p">)</span>
</pre></div>
<p>From here we can just use the native <code>Trainer</code> in <code>transformers</code>, together with the <code>squad</code> metric from <code>datasets</code>. By looking at the distribution of question and context lengths, we can see that this simplification will only fail in a very small number of examples:</p>
<p><img src="/blog/images/copied_from_nb/my_icons/squad-lengths.png" alt="" /></p>
<p>Another alternative would be to adopt the "retriever-reader" architecture that I used in my PoC (where I split long documents into smaller paragraphs), but that introduces it's own set of complexity that I'd like to avoid.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Running-a-mock-interview">Running a mock interview<a class="anchor-link" href="#Running-a-mock-interview"> </a></h2><p>A friend of mine is applying for a research scientist position and we thought it would be fun to run a couple of mock interviews together. Since the position is likely to involve Transformers, I asked my friend a few GPT-related questions (e.g. how does the architecture differ from BERT and what is the difference between GPT / GPT-2 and GPT-3?), followed by a coding session to see how fast one could implement GPT from scratch. The goal was to approach a skeleton of <a href="https://karpathy.ai/">Andrej Karpathy's</a> excellent <a href="https://github.com/karpathy/minGPT">minGPT</a> implementation</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>
<center>
<div class="jekyll-twitter-plugin"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">I wrote a minimal/educational GPT training library in PyTorch, am calling it minGPT as it is only around ~300 lines of code: <a href="https://t.co/79S9lShJRN">https://t.co/79S9lShJRN</a> +demos for addition and character-level language model. (quick weekend project, may contain sharp edges)</p>— Andrej Karpathy (@karpathy) <a href="https://twitter.com/karpathy/status/1295410274095095810?ref_src=twsrc%5Etfw">August 17, 2020</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</div>
</center>
</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>and the experience taught me a few lessons:</p>
<ul>
<li>There's a significant difference between being a power-user of a library like <code>transformers</code> versus deeply knowing how every layer, activation function, etc in a deep neural architecture is put together. Running the interview reminded me that I should aim to block some time per week to hone the foundations of my machine learning knowledge.</li>
<li>Open-ended coding interviews like this are way more fun to conduct than the usual LeetCode / HackerRank problems one usually encounters in industry. To me, they resemble a pair-programming interaction that gives the interviewer a pretty good feel for what it would be like to work closely with the candidate. Something to remember the next time I'm interviewing people for a real job!</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Papers-this-week">Papers this week<a class="anchor-link" href="#Papers-this-week"> </a></h2><p>This week I've been mostly reading papers on compressing Transformers and how to improve few-shot learning <em>without</em> resorting to massive scaling:</p>
<ul>
<li><a href="https://arxiv.org/abs/1910.01108"><em>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</em></a> by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf (2019)</li>
<li><a href="https://arxiv.org/abs/2010.13382"><em>FastFormers: Highly Efficient Transformer Models for Natural Language Understanding</em></a> by Young Jin Kim and Hany Hassan Awadalla (2020)</li>
<li><a href="https://arxiv.org/abs/2006.15315"><em>Uncertainty-aware Self-training for Text Classification with Few Labels</em></a> by Subhabrata Mukherjee and Ahmed Hassan Awadallah (2020)</li>
</ul>
<p>This week also coincided with the release of OpenAI's <a href="https://openai.com/blog/dall-e/">DALL-E</a> which, although light on implementation details, provided a fun interface to see how far you can push the limits of text-to-image generation:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/blog/images/copied_from_nb/my_icons/dalle.png" alt="" title="The DALL-E blog post has many examples involving Capybaras, which happen to be my favourite animal." /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="TIL-this-week">TIL this week<a class="anchor-link" href="#TIL-this-week"> </a></h2><ul>
<li><a href="https://lewtun.github.io/blog/til/2021/01/07/til-poll-api-with-bash.html">Polling a web service with bash and jq</a></li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><div class="footnotes"><p id="fn-1">1. Mostly due to playing an insane game of "data science catch-up" at an early-stage startup.<a href="#fnref-1" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-2">2. Even down to the level of reviewing his own <a href="https://github.com/simonw/datasette/pull/1117">pull requests</a>!<a href="#fnref-2" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-3">3. We want a <em>sliding</em> window instead of a <em>tumbling</em> one because the answer might appear across the boundary of the two windows.<a href="#fnref-3" class="footnote footnotes">↩</a></p></div></p>
</div>
</div>
</div>
</div>Polling a web service with bash and jq2021-01-07T00:00:00-06:002021-01-07T00:00:00-06:00https://lewtun.github.io/blog/til/2021/01/07/til-poll-api-with-bash<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2021-01-07-til-poll-api-with-bash.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As part of a book project with O'Reilly, I wanted to trigger and monitor the build process that converts <code>AsciiDoc</code> files into PDF, HTML, etc. O'Reilly has a production system called <a href="https://docs.atlas.oreilly.com/index.html">Atlas</a> that allows users to trigger builds through a UI, but I wanted to do this via their <a href="https://docs.atlas.oreilly.com/working_locally.html#atlas-api">JSON API</a> instead.</p>
<p>The <code>bash</code> script I ended up with was built on top of an elegant <a href="https://keestalkstech.com/2020/01/poll-json-endpoint-until-value-changes-with-bash-curl/">blog post</a> by <a href="https://www.linkedin.com/in/keescbakker/">Kees C. Bakker</a>:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<div class="highlight"><pre><span></span><span class="ch">#! /bin/bash</span>
<span class="c1"># load atlas credentials</span>
. .env
<span class="nv">job_id</span><span class="o">=</span><span class="k">$(</span>curl -X POST <span class="se">\</span>
-F <span class="s2">"project=oreillymedia/secret-project"</span> <span class="se">\</span>
-F <span class="s2">"branch=main"</span> <span class="se">\</span>
-F <span class="s2">"formats=pdf"</span> <span class="se">\</span>
-F <span class="s2">"auth_token=</span><span class="nv">$auth_token</span><span class="s2">"</span> <span class="se">\</span>
https://atlas.oreilly.com/api/builds -s <span class="p">|</span> jq <span class="s2">".id"</span><span class="k">)</span><span class="p">;</span>
<span class="nb">printf</span> <span class="s2">"\nSent job to Atlas with ID </span><span class="nv">$job_id</span><span class="s2">\n"</span>
<span class="nv">build_url</span><span class="o">=</span><span class="s2">"https://atlas.oreilly.com/api/builds/</span><span class="nv">$job_id</span><span class="s2">\?auth_token=</span><span class="nv">$auth_token</span><span class="s2">"</span>
<span class="nv">atlas_url</span><span class="o">=</span><span class="s2">"https://atlas.oreilly.com/oreillymedia/project-name"</span>
<span class="nv">interval_in_seconds</span><span class="o">=</span><span class="m">5</span>
<span class="nv">status_path</span><span class="o">=</span><span class="s2">".status[0].status"</span>
<span class="nv">download_path</span><span class="o">=</span><span class="s2">".status[0].download_url"</span>
<span class="nb">printf</span> <span class="s2">"\nPolling '</span><span class="si">${</span><span class="nv">build_url</span><span class="p">%</span><span class="se">\?</span><span class="p">*</span><span class="si">}</span><span class="s2">' every </span><span class="nv">$interval_in_seconds</span><span class="s2"> seconds, until 'complete'\n"</span>
<span class="k">while</span> true<span class="p">;</span>
<span class="k">do</span>
<span class="nv">status</span><span class="o">=</span><span class="k">$(</span>curl <span class="nv">$build_url</span> -s <span class="p">|</span> jq <span class="nv">$status_path</span><span class="k">)</span><span class="p">;</span>
<span class="nb">printf</span> <span class="s2">"\r</span><span class="k">$(</span>date +%H:%M:%S<span class="k">)</span><span class="s2">: </span><span class="nv">$status</span><span class="s2">"</span><span class="p">;</span>
<span class="k">if</span> <span class="o">[[</span> <span class="s2">"</span><span class="nv">$status</span><span class="s2">"</span> <span class="o">==</span> <span class="s2">"\"complete\""</span> <span class="o">||</span> <span class="s2">"</span><span class="nv">$status</span><span class="s2">"</span> <span class="o">==</span> <span class="s2">"\"failed\""</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then</span>
<span class="k">if</span> <span class="o">[[</span> <span class="s2">"</span><span class="nv">$status</span><span class="s2">"</span> <span class="o">==</span> <span class="s2">"\"failed\""</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then</span>
<span class="nb">printf</span> <span class="s2">"\n Build failed! View logs at </span><span class="nv">$atlas_url</span><span class="s2">"</span>
<span class="k">else</span>
<span class="nv">download_url</span><span class="o">=</span><span class="k">$(</span>curl <span class="nv">$build_url</span> -s <span class="p">|</span> jq <span class="nv">$download_path</span><span class="k">)</span><span class="p">;</span>
<span class="nb">printf</span> <span class="s2">"\nBuild complete! Download URL: </span><span class="nv">$download_url</span><span class="s2">"</span><span class="p">;</span>
<span class="k">fi</span>
break<span class="p">;</span>
<span class="k">fi</span><span class="p">;</span>
sleep <span class="nv">$interval_in_seconds</span><span class="p">;</span>
<span class="k">done</span>
</pre></div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This script performs two tasks:</p>
<ul>
<li>Trigger a PDF build</li>
<li>Poll the endpoint until the build is "complete" or "failed", then return URL</li>
</ul>
<p>We use <code>jq</code> to pick out various fields of interest from the JSON responses, e.g. the build request</p>
<div class="highlight"><pre><span></span>curl -X POST <span class="se">\ </span>
-F <span class="s2">"project=oreillymedia/secret-project"</span> <span class="se">\</span>
-F <span class="s2">"branch=main"</span> <span class="se">\</span>
-F <span class="s2">"formats=pdf"</span> <span class="se">\</span>
-F <span class="s2">"auth_token=</span><span class="nv">$auth_token</span><span class="s2">"</span> <span class="se">\</span>
https://atlas.oreilly.com/api/builds
</pre></div>
<p>returns something like</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"id"</span><span class="p">:</span> <span class="mi">308588</span><span class="p">,</span>
<span class="nt">"branch"</span><span class="p">:</span> <span class="s2">"main"</span><span class="p">,</span>
<span class="nt">"created_at"</span><span class="p">:</span> <span class="s2">"2021-01-07T21:40:15.655Z"</span><span class="p">,</span>
<span class="nt">"project"</span><span class="p">:</span> <span class="s2">"oreillymedia/secret-project"</span><span class="p">,</span>
<span class="nt">"clone_url"</span><span class="p">:</span> <span class="kc">null</span><span class="p">,</span>
<span class="nt">"build_url"</span><span class="p">:</span> <span class="s2">"/api/builds/308588"</span><span class="p">,</span>
<span class="nt">"status"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"id"</span><span class="p">:</span> <span class="mi">363930</span><span class="p">,</span>
<span class="nt">"format"</span><span class="p">:</span> <span class="s2">"pdf"</span><span class="p">,</span>
<span class="nt">"download_url"</span><span class="p">:</span> <span class="kc">null</span><span class="p">,</span>
<span class="nt">"status"</span><span class="p">:</span> <span class="s2">"queued"</span><span class="p">,</span>
<span class="nt">"message"</span><span class="p">:</span> <span class="kc">null</span>
<span class="p">}</span>
<span class="p">],</span>
<span class="nt">"user"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"id"</span><span class="p">:</span> <span class="mi">1234</span><span class="p">,</span>
<span class="nt">"nickname"</span><span class="p">:</span> <span class="s2">"foobar"</span><span class="p">,</span>
<span class="nt">"avatar"</span><span class="p">:</span> <span class="s2">"https://secure.gravatar.com/avatar/123456"</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p>so we can get the ID by just using <code>jq ".id"</code>. Ditto for the response to the <code>/api/builds</code> endpoint which returns JSON of the form:</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"id"</span><span class="p">:</span> <span class="mi">308595</span><span class="p">,</span>
<span class="nt">"branch"</span><span class="p">:</span> <span class="s2">"main"</span><span class="p">,</span>
<span class="nt">"created_at"</span><span class="p">:</span> <span class="s2">"2021-01-07T21:55:04.381Z"</span><span class="p">,</span>
<span class="nt">"project"</span><span class="p">:</span> <span class="s2">"oreillymedia/secret-project"</span><span class="p">,</span>
<span class="nt">"clone_url"</span><span class="p">:</span> <span class="kc">null</span><span class="p">,</span>
<span class="nt">"build_url"</span><span class="p">:</span> <span class="s2">"/api/builds/308595"</span><span class="p">,</span>
<span class="nt">"status"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"id"</span><span class="p">:</span> <span class="mi">363938</span><span class="p">,</span>
<span class="nt">"format"</span><span class="p">:</span> <span class="s2">"pdf"</span><span class="p">,</span>
<span class="nt">"download_url"</span><span class="p">:</span> <span class="s2">"..."</span><span class="p">,</span>
<span class="nt">"status"</span><span class="p">:</span> <span class="s2">"complete"</span><span class="p">,</span>
<span class="nt">"message"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"debug"</span><span class="p">:</span> <span class="p">[],</span>
<span class="nt">"info"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"Building web PDF"</span><span class="p">,</span>
<span class="err">...</span>
<span class="p">],</span>
<span class="nt">"warn"</span><span class="p">:</span> <span class="p">[],</span>
<span class="nt">"error"</span><span class="p">:</span> <span class="p">[]</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">],</span>
<span class="nt">"user"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"id"</span><span class="p">:</span> <span class="mi">1234</span><span class="p">,</span>
<span class="nt">"nickname"</span><span class="p">:</span> <span class="s2">"foobar"</span><span class="p">,</span>
<span class="nt">"avatar"</span><span class="p">:</span> <span class="s2">"https://secure.gravatar.com/avatar/123456"</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p>In this case the parent <code>status</code> field is an array, hence the need to pick out the first element with <code>status[0]</code>.</p>
</div>
</div>
</div>
</div>Calculating named entity frequencies2021-01-02T00:00:00-06:002021-01-02T00:00:00-06:00https://lewtun.github.io/blog/til/nlp/huggingface/2021/01/02/til-counting-ner-tokens<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2021-01-02-til-counting-ner-tokens.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>For named entity recognition tasks, a handy measure of class imbalance is to calculate the frequency of named entities in the data. I wanted to do this with the <code>datasets</code> library for documents annotated in the <a href="https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging">"inside-outside-beginning" (IOB2) format</a>.</p>
<p>One problem I encountered was that <code>datasets</code> tends to represent the entities in terms of <em>label IDs</em></p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_dataset</span>
<span class="n">conll</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="s2">"conll2003"</span><span class="p">)</span>
<span class="n">conll</span><span class="p">[</span><span class="s1">'train'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
'id': '0',
'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
'tokens': ['EU',
'rejects',
'German',
'call',
'to',
'boycott',
'British',
'lamb',
'.']}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>so I created a simple function that makes use of the <code>Dataset.features</code> attribute and <code>ClassLabel.int2str</code> method to perform the mapping from ID to human-readable string:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">Dataset</span>
<span class="k">def</span> <span class="nf">create_tag_names</span><span class="p">(</span><span class="n">ds</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">,</span> <span class="n">tags_col</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="n">Dataset</span><span class="p">:</span>
<span class="c1"># pick out the ClassLabel feature from feature</span>
<span class="n">tags</span> <span class="o">=</span> <span class="n">ds</span><span class="p">[</span><span class="s2">"train"</span><span class="p">]</span><span class="o">.</span><span class="n">features</span><span class="p">[</span><span class="n">tags_col</span><span class="p">]</span><span class="o">.</span><span class="n">feature</span>
<span class="c1"># apply the ClassLabel.int2str method to each token</span>
<span class="n">proc_fn</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="p">{</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">tags_col</span><span class="si">}</span><span class="s2">_str"</span><span class="p">:</span> <span class="p">[</span><span class="n">tags</span><span class="o">.</span><span class="n">int2str</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span> <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">x</span><span class="p">[</span><span class="n">tags_col</span><span class="p">]]}</span>
<span class="k">return</span> <span class="n">ds</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">proc_fn</span><span class="p">)</span>
<span class="n">conll</span> <span class="o">=</span> <span class="n">create_tag_names</span><span class="p">(</span><span class="n">conll</span><span class="p">,</span> <span class="s1">'ner_tags'</span><span class="p">)</span>
<span class="n">conll</span><span class="p">[</span><span class="s1">'train'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
'id': '0',
'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
'ner_tags_str': ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'],
'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
'tokens': ['EU',
'rejects',
'German',
'call',
'to',
'boycott',
'British',
'lamb',
'.']}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>With some help from my <a href="https://twitter.com/lvwerra?s=20">partner-in-crime</a>, the final step was to iterate over each example, collect all the <em>B-</em> tags in a list (since the <em>I-</em> tags refer to the same entity), and then use a bit of <code>chain</code> magic to flatten the list of lists per split:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">chain</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
<span class="k">def</span> <span class="nf">calculate_tag_frequencies</span><span class="p">(</span><span class="n">ds</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">,</span> <span class="n">tags_col</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">:</span>
<span class="n">split2freqs</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">split</span> <span class="ow">in</span> <span class="n">ds</span><span class="o">.</span><span class="n">keys</span><span class="p">():</span>
<span class="n">tag_names</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">ds</span><span class="p">[</span><span class="n">split</span><span class="p">][</span><span class="n">tags_col</span><span class="p">]:</span>
<span class="n">tag_names</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">tag</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">'-'</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">row</span> <span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s2">"B"</span><span class="p">)])</span>
<span class="c1"># chain.from_iterable(['ABC', 'DEF']) --> A B C D E F</span>
<span class="n">split2freqs</span><span class="p">[</span><span class="n">split</span><span class="p">]</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">chain</span><span class="o">.</span><span class="n">from_iterable</span><span class="p">(</span><span class="n">tag_names</span><span class="p">))</span>
<span class="k">return</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="o">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">split2freqs</span><span class="p">,</span> <span class="n">orient</span><span class="o">=</span><span class="s2">"index"</span><span class="p">)</span>
<span class="n">calculate_tag_frequencies</span><span class="p">(</span><span class="n">conll</span><span class="p">,</span> <span class="s1">'ner_tags_str'</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>ORG</th>
<th>MISC</th>
<th>PER</th>
<th>LOC</th>
</tr>
</thead>
<tbody>
<tr>
<th>train</th>
<td>6321</td>
<td>3438</td>
<td>6600</td>
<td>7140</td>
</tr>
<tr>
<th>validation</th>
<td>1341</td>
<td>922</td>
<td>1842</td>
<td>1837</td>
</tr>
<tr>
<th>test</th>
<td>1661</td>
<td>702</td>
<td>1617</td>
<td>1668</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As a sanity check, let's compare with Table 2 from the <a href="https://www.aclweb.org/anthology/W03-0419.pdf">CoNLL-2003 paper</a>:</p>
<p><img src="/blog/images/copied_from_nb/my_icons/conll.png" alt="" /></p>
<p>It works!</p>
</div>
</div>
</div>
</div>Using data collators for training and error analysis2021-01-01T00:00:00-06:002021-01-01T00:00:00-06:00https://lewtun.github.io/blog/til/nlp/huggingface/transformers/2021/01/01/til-data-collator<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2021-01-01-til-data-collator.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Recently, <a href="https://twitter.com/GuggerSylvain?s=20">Sylvain Gugger</a> from HuggingFace has created some nice tutorials on using <code>transformers</code> for <a href="https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb">text classification</a> and <a href="https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=545PP3o8IrJV">named entity recognition</a>. One trick that caught my attention was the use of a <em>data collator</em> in the trainer, which automatically pads the model inputs in a batch to the length of the longest example. This bypasses the need to set a <em>global</em> maximum sequence length, and in practice leads to faster training since we perform fewer redundant computations on the padded tokens and attention masks.</p>
<p>I wanted to use a data collator for both training <em>and</em> error analysis (e.g. by inspecting the top losses of the model). One problem: during training, each batch is collated on the fly so how do I pad my inputs in subsequent <code>Dataset.map</code> operations?</p>
<p>For <em>sequence classification</em> tasks, the solution I ended up with was to simply grab the data collator from the trainer and use it in my post-processing functions:</p>
<div class="highlight"><pre><span></span><span class="n">data_collator</span> <span class="o">=</span> <span class="n">trainer</span><span class="o">.</span><span class="n">data_collator</span>
<span class="k">def</span> <span class="nf">processing_function</span><span class="p">(</span><span class="n">batch</span><span class="p">):</span>
<span class="c1"># pad inputs</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">data_collator</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span>
<span class="o">...</span>
<span class="k">return</span> <span class="n">batch</span>
</pre></div>
<p>For <em>token classification</em> tasks, there is a dedicated <code>DataCollatorForTokenClassification</code> which expects a <code>list</code> of <code>dicts</code>, where each <code>dict</code> represents a single example in the dataset. Since a <code>Dataset</code> slice returns a <code>dict</code> of <code>lists</code>, we need a two more lines to wrangle the data in the expected format:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">DataCollatorForTokenClassification</span>
<span class="n">data_collator</span> <span class="o">=</span> <span class="n">DataCollatorForTokenClassification</span><span class="p">(</span><span class="n">trainer</span><span class="o">.</span><span class="n">tokenizer</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">processing_function</span><span class="p">(</span><span class="n">batch</span><span class="p">):</span>
<span class="c1"># convert dict of lists to list of dicts</span>
<span class="n">features</span> <span class="o">=</span> <span class="p">[</span><span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">t</span><span class="p">))</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">batch</span><span class="o">.</span><span class="n">values</span><span class="p">())]</span>
<span class="c1"># pad inputs and labels</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">data_collator</span><span class="p">(</span><span class="n">features</span><span class="p">)</span>
<span class="o">...</span>
<span class="k">return</span> <span class="n">batch</span>
</pre></div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>For an end-to-end example, let's grab 1,000 examples from the IMDB dataset:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_dataset</span>
<span class="n">imdb</span> <span class="o">=</span> <span class="p">(</span><span class="n">load_dataset</span><span class="p">(</span><span class="s1">'imdb'</span><span class="p">,</span> <span class="n">split</span><span class="o">=</span><span class="s1">'train'</span><span class="p">)</span>
<span class="o">.</span><span class="n">train_test_split</span><span class="p">(</span><span class="n">train_size</span><span class="o">=</span><span class="mi">800</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mi">200</span><span class="p">))</span>
<span class="n">imdb</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 800
})
test: Dataset({
features: ['text', 'label'],
num_rows: 200
})
})</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Next, let's load a pretrained model and its corresponding tokenizer:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModelForSequenceClassification</span>
<span class="n">num_labels</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">model_name</span> <span class="o">=</span> <span class="s1">'distilbert-base-cased'</span>
<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">device</span><span class="p">(</span><span class="s2">"cuda"</span> <span class="k">if</span> <span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">is_available</span><span class="p">()</span> <span class="k">else</span> <span class="s2">"cpu"</span><span class="p">)</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">model_name</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="p">(</span><span class="n">AutoModelForSequenceClassification</span>
<span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">model_name</span><span class="p">,</span> <span class="n">num_labels</span><span class="o">=</span><span class="n">num_labels</span><span class="p">)</span>
<span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Before fine-tuning the model, we need to tokenize and encode the dataset, so let's do that with a simple <code>Dataset.map</code> operation:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">tokenize_and_encode</span><span class="p">(</span><span class="n">batch</span><span class="p">):</span>
<span class="k">return</span> <span class="n">tokenizer</span><span class="p">(</span><span class="n">batch</span><span class="p">[</span><span class="s1">'text'</span><span class="p">],</span> <span class="n">truncation</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">imdb_enc</span> <span class="o">=</span> <span class="n">imdb</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">tokenize_and_encode</span><span class="p">,</span> <span class="n">batched</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">imdb_enc</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>DatasetDict({
train: Dataset({
features: ['attention_mask', 'input_ids', 'label', 'text'],
num_rows: 800
})
test: Dataset({
features: ['attention_mask', 'input_ids', 'label', 'text'],
num_rows: 200
})
})</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The final step is to define the metrics</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_metric</span>
<span class="n">accuracy_score</span> <span class="o">=</span> <span class="n">load_metric</span><span class="p">(</span><span class="s2">"accuracy"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">compute_metrics</span><span class="p">(</span><span class="n">eval_pred</span><span class="p">):</span>
<span class="n">predictions</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">eval_pred</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">predictions</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">accuracy_score</span><span class="o">.</span><span class="n">compute</span><span class="p">(</span><span class="n">predictions</span><span class="o">=</span><span class="n">predictions</span><span class="p">,</span> <span class="n">references</span><span class="o">=</span><span class="n">labels</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>the arguments for the trainer</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">TrainingArguments</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">16</span>
<span class="n">logging_steps</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">imdb_enc</span><span class="p">[</span><span class="s1">'train'</span><span class="p">])</span> <span class="o">//</span> <span class="n">batch_size</span>
<span class="n">training_args</span> <span class="o">=</span> <span class="n">TrainingArguments</span><span class="p">(</span>
<span class="n">output_dir</span><span class="o">=</span><span class="s2">"results"</span><span class="p">,</span>
<span class="n">num_train_epochs</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">per_device_train_batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span>
<span class="n">per_device_eval_batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span>
<span class="n">evaluation_strategy</span><span class="o">=</span><span class="s2">"epoch"</span><span class="p">,</span>
<span class="n">disable_tqdm</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
<span class="n">logging_steps</span><span class="o">=</span><span class="n">logging_steps</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>and the trainer itself:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><div class="flash flash-warn">
<svg class="octicon octicon-zap" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M10.561 1.5a.016.016 0 00-.01.004L3.286 8.571A.25.25 0 003.462 9H6.75a.75.75 0 01.694 1.034l-1.713 4.188 6.982-6.793A.25.25 0 0012.538 7H9.25a.75.75 0 01-.683-1.06l2.008-4.418.003-.006a.02.02 0 00-.004-.009.02.02 0 00-.006-.006L10.56 1.5zM9.504.43a1.516 1.516 0 012.437 1.713L10.415 5.5h2.123c1.57 0 2.346 1.909 1.22 3.004l-7.34 7.142a1.25 1.25 0 01-.871.354h-.302a1.25 1.25 0 01-1.157-1.723L5.633 10.5H3.462c-1.57 0-2.346-1.909-1.22-3.004L9.503.429z"></path></svg>
<strong>Important: </strong>The trainer will remove <em>in-place</em> any dataset columns of <code>str</code> type, so in this example <code>imdb_enc</code> loses the <code>text</code> column.
</div></p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">Trainer</span>
<span class="n">trainer</span> <span class="o">=</span> <span class="n">Trainer</span><span class="p">(</span>
<span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
<span class="n">args</span><span class="o">=</span><span class="n">training_args</span><span class="p">,</span>
<span class="n">compute_metrics</span><span class="o">=</span><span class="n">compute_metrics</span><span class="p">,</span>
<span class="n">train_dataset</span><span class="o">=</span><span class="n">imdb_enc</span><span class="p">[</span><span class="s1">'train'</span><span class="p">],</span>
<span class="n">eval_dataset</span><span class="o">=</span><span class="n">imdb_enc</span><span class="p">[</span><span class="s1">'test'</span><span class="p">],</span>
<span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">)</span>
<span class="n">trainer</span><span class="o">.</span><span class="n">train</span><span class="p">();</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea ">
<div>
<style>
/* Turns off some styling */
progress {
/* gets rid of default border in Firefox and Opera. */
border: none;
/* Needs to be in here for Safari polyfill so background images work as expected. */
background-size: auto;
}
</style>
<progress value="50" max="50" style="width:300px; height:20px; vertical-align: middle;"></progress>
[50/50 00:32, Epoch 1/1]
</div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: left;">
<th>Epoch</th>
<th>Training Loss</th>
<th>Validation Loss</th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.390015</td>
<td>0.328747</td>
<td>0.875000</td>
</tr>
</tbody>
</table><p>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>By default, the <code>Trainer</code> class uses the simple <code>default_data_collator</code> to collate batches of dict-like objects, but by passing the tokenizer we get a <code>DataCollatorWithPadding</code> instead:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">data_collator</span> <span class="o">=</span> <span class="n">trainer</span><span class="o">.</span><span class="n">data_collator</span>
<span class="nb">type</span><span class="p">(</span><span class="n">data_collator</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>transformers.data.data_collator.DataCollatorWithPadding</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To see how this collator works, let's pass a dummy batch and observe that both the <code>input_ids</code> and <code>attention_mask</code> are padded as expected:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">batch</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'input_ids'</span><span class="p">:</span> <span class="p">[[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">]]}</span>
<span class="n">data_collator</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'input_ids': tensor([[0, 1, 2, 0, 0, 0],
[0, 1, 2, 3, 4, 5]]), 'attention_mask': tensor([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1]])}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Finally, we can calculate the loss per example with the following function:<sup id="fnref-1" class="footnote-ref"><a href="#fn-1">1</a></sup></p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">loss_per_example</span><span class="p">(</span><span class="n">batch</span><span class="p">):</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">data_collator</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span>
<span class="n">input_ids</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">batch</span><span class="p">[</span><span class="s2">"input_ids"</span><span class="p">],</span> <span class="n">device</span><span class="o">=</span><span class="n">device</span><span class="p">)</span>
<span class="n">attention_mask</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">batch</span><span class="p">[</span><span class="s2">"attention_mask"</span><span class="p">],</span> <span class="n">device</span><span class="o">=</span><span class="n">device</span><span class="p">)</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">batch</span><span class="p">[</span><span class="s2">"labels"</span><span class="p">],</span> <span class="n">device</span><span class="o">=</span><span class="n">device</span><span class="p">)</span>
<span class="k">with</span> <span class="n">torch</span><span class="o">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_ids</span><span class="p">,</span> <span class="n">attention_mask</span><span class="p">)</span>
<span class="n">batch</span><span class="p">[</span><span class="s2">"predicted_label"</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">output</span><span class="o">.</span><span class="n">logits</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">functional</span><span class="o">.</span><span class="n">cross_entropy</span><span class="p">(</span>
<span class="n">output</span><span class="o">.</span><span class="n">logits</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">reduction</span><span class="o">=</span><span class="s2">"none"</span><span class="p">)</span>
<span class="n">batch</span><span class="p">[</span><span class="s2">"loss"</span><span class="p">]</span> <span class="o">=</span> <span class="n">loss</span>
<span class="c1"># datasets requires list of NumPy array data types</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">batch</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">batch</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span><span class="o">.</span><span class="n">cpu</span><span class="p">()</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span>
<span class="k">return</span> <span class="n">batch</span>
<span class="n">losses_ds</span> <span class="o">=</span> <span class="n">imdb_enc</span><span class="p">[</span><span class="s1">'test'</span><span class="p">]</span><span class="o">.</span><span class="n">map</span><span class="p">(</span>
<span class="n">loss_per_example</span><span class="p">,</span> <span class="n">batched</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>It's then a simple matter to convert <code>losses_ds</code> to a <code>pandas.DataFrame</code> and sort by loss to find the examples where the model is most confused:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s2">"display.max_colwidth"</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span>
<span class="n">losses_ds</span><span class="o">.</span><span class="n">set_format</span><span class="p">(</span><span class="s1">'pandas'</span><span class="p">)</span>
<span class="n">losses_df</span> <span class="o">=</span> <span class="n">losses_ds</span><span class="p">[:][[</span><span class="s1">'label'</span><span class="p">,</span> <span class="s1">'predicted_label'</span><span class="p">,</span> <span class="s1">'loss'</span><span class="p">]]</span>
<span class="c1"># add the text column removed by the trainer</span>
<span class="n">losses_df</span><span class="p">[</span><span class="s1">'text'</span><span class="p">]</span> <span class="o">=</span> <span class="n">imdb</span><span class="p">[</span><span class="s1">'test'</span><span class="p">][</span><span class="s1">'text'</span><span class="p">]</span>
<span class="n">losses_df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s2">"loss"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>label</th>
<th>predicted_label</th>
<th>loss</th>
<th>text</th>
</tr>
</thead>
<tbody>
<tr>
<th>147</th>
<td>1</td>
<td>0</td>
<td>3.477502</td>
<td>Was the script more fitting for a 30 minute sitcom? Yes, but they still make it work! I thought the actors did a fantastic job with an otherwise bland script, especially Jack Black and Christopher Walken. Most people on the board seem to really hate this film. I personally can't see how that could be, but Envy is just one of those film that you either love it or hate it. Much like Napoleon Dynamite and every Leslie Neilsen movie ever made. You either think it's one of the worst movies ever made or one of the funniest. Don't avoid this movie because of the reviews. Watch it and see if you're one of the ones who really like it! If you do, I guarantee it's worth your money. If you don't like it... well, now you know.</td>
</tr>
<tr>
<th>143</th>
<td>1</td>
<td>0</td>
<td>2.925410</td>
<td>I would just like to say, that no matter how low budget the film is, it needs to be shown throughout this world the point to these movies. We don't read that much anymore, instead people want to see movies. Having this series out on DVD, has made me want to read the whole series, and want more. PLEASE MAKE ALL 8 MOVIES. Please don't change any of the characters either, it ruins the effect. Because I have grown to love the actors who have played the characters. PLEASE MAKE ALL 8 MOVIES. I want to see the message, and watch the message that these books and now movies are here to portray. We don't get that enough anymore. AWESOME JOB!!!</td>
</tr>
<tr>
<th>57</th>
<td>0</td>
<td>1</td>
<td>2.873445</td>
<td>I like Brad Pitt enormously. He is an actor with brains and wit, not to mention face, pectorals and all the rest. Since I saw him in "Thelma and Louise" a thought has been bothering me, who does he remind me of? "Troy" did it for me. He is the new Brigitte Bardot. The differences are obvious of course. Male, American etc but Brigitte Bardot comes to mind nonetheless. He is so beautiful that he is at his most effective when he plays against it. "Kalifornia" "12 Monkeys" "Fight Club" "Snatch" His self deprecating humor makes him human, almost accessible. Fortunately "Troy" will soon be forgotten. Only still photographs with Pitt, semi naked in ravishing sprint positions will decorate the walls of legions of salivating fans. Strange, "Das Boot" is one of the great films of the second part of the 20th Century. What is Wolfgang Petersen doing directing this? Well, I suppose it would be very hard to say no at the chance of working with the new Brigitte Bardot.</td>
</tr>
<tr>
<th>151</th>
<td>1</td>
<td>0</td>
<td>2.861723</td>
<td>SOLDIER is not as bad as many have made it out to be. I found the film to have some of the sacarstic, cynical humour like that in Paul Verhoven's Starship Troopers. The lack of dialogue and over the top action is deliberate and adds to the comic-book atmosphere.<br /><br />One particular trivia-bit stands out for me - Todd has the names of several space-war campaigns tattoo'd onto his chest and one of these battles is TANNHAUSER GATE. For the oblivious ones out there, Tannhauser Gate is mentioned in Roy Batty's elegiac last lines in Blade Runner. To imagine that Todd could have fought alongside android troops like Roy is mind boggling to say the least. Maybe script writer David Peoples was nostalgic?<br /><br />I'll give this one 3 out of 5.</td>
</tr>
<tr>
<th>53</th>
<td>0</td>
<td>1</td>
<td>2.849806</td>
<td>Reed Diamond plays a man suffering from amnesia who's been in a mental asylum for over a decade after he was found wondering the back roads with blood on his hands. The doctors want to test out an experimental new drug that'll return his lost memories if it works. But when the drugs give him hallucinations of a demon, he chooses to escape instead. While outside he befriends a young boy whose stepfather (Greg Grunberg) mistreats his mother, won't let her near the darkroom in his basement & acts suspicious in general.<br /><br />While the general 'mystery' of the film is a tad easy to identify way before it's revealed, I found Mr. Diamond's acting to be enthralling enough to keep my attention throughout. (In the interest of full disclosure, I've been a huge fan of his since Homicide and his brief, but extremely pivotal, role in The Shield up through Journeyman & Dollhouse) Not a great film nor a good one, but serviceable enough. Although I did like it better than the previous films that I've seen from Director/writer Michael Hurst (Room 6, Pumkinhead 4, Mansquito)<br /><br />Eye Candy: one fleeting pair of boobs in a hallucination<br /><br />My Grade: C-</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><div class="footnotes"><p id="fn-1">1. The non-padded version of this function is adapted from an implementation by <a href="https://twitter.com/lvwerra?s=20">Leandro von Werra</a>.<a href="#fnref-1" class="footnote footnotes">↩</a></p></div></p>
</div>
</div>
</div>
</div>
</p></div></div></div></div></div></div>Highlights from ICML 20202020-07-31T00:00:00-05:002020-07-31T00:00:00-05:00https://lewtun.github.io/blog/research/conference/2020/07/31/icml2020<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-07-31-icml2020.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This year I had the opportunity to attend the <a href="https://icml.cc/">International Conference on Machine Learning</a> (ICML) and decided to highlight some of the talks I found especially interesting. Although the conference was hosted entirely online, this provided two key benefits over attending in person:</p>
<ul>
<li><strong>Clash resolution:</strong> with <a href="https://syncedreview.com/2020/06/01/icml-2020-announces-accepted-papers/#:~:text=Conference%20Industry-,ICML%202020%20Announces%20Accepted%20Papers,the%20prestigious%20machine%20learning%20conference.">1,088 papers accepted</a>, it is inevitable that multiple talks of interest would clash in the timetable. Watching the pre-recorded presentations in my own time provided a simple solution, not to mention the ability to quickly switch to a new talk if desired.</li>
<li><strong>Better Q&A sessions:</strong> at large conferences it is not easy to get your questions answered directly after a talk, usually because the whole session is running overtime and the moderator wants to move onto the next speaker. By having two (!) dedicated Q&A sessions for each talk, I found the discussions to be extremely insightful and much more personalised.</li>
</ul>
<p>Since I'm resigned to being in quarantine until 2050, I hope other virtual conferences will adopt a similar format. Conference highlights are below!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Transformers">Transformers<a class="anchor-link" href="#Transformers"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Generative-Pretraining-from-Pixels"><a href="https://proceedings.icml.cc/static/paper_files/icml/2020/6022-Paper.pdf">Generative Pretraining from Pixels</a><a class="anchor-link" href="#Generative-Pretraining-from-Pixels"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/blog/images/copied_from_nb/my_icons/igpt.png" alt="" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><em>Predicting the next pixel with a GPT-2 scale model yields high quality representations. The best representations lie in the middle of the network.</em></p>
<p>This talk showed that with enough compute, it is possible to adapt transformer architectures to images and achieve strong results in self-supervised learning benchmarks. Dubbed iGPT, this approach relies on a three-step process:</p>
<ol>
<li>Downsize the images, cluster the RGB pixel values to create a 9-bit colour map, and reshape to 1D.<sup id="fnref-1" class="footnote-ref"><a href="#fn-1">1</a></sup></li>
<li>Pre-train on either an autoregressive next pixel or masked pixel prediction task.</li>
<li>Evaluate the quality of the learned representations on downstream tasks.</li>
</ol>
<p>One surprising result of the linear probe<sup id="fnref-2" class="footnote-ref"><a href="#fn-2">2</a></sup> experiments is that representation quality tends to be highest in the <em>middle</em> of the network.</p>
<p>I think this work provides a compelling example of Sutton's <a href="http://incompleteideas.net/IncIdeas/BitterLesson.html">"bitter lesson"</a></p>
<blockquote><p>Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.</p>
</blockquote>
<p>but takes it one step further by discarding knowledge of the 2D structure in images entirely!</p>
<p>Although the iGPT models are 2-30 times larger than ResNet-152, I expect it is only a matter of time before people find ways to make this approach more efficient. In the meantime, it's nice to see that the pre-trained models have been <a href="https://github.com/openai/image-gpt">open-sourced</a> and a <a href="https://github.com/huggingface/transformers/issues/5088">port</a> to HuggingFace's transformers library is already underway.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Retrieval-Augmented-Language-Model-Pre-Training"><a href="https://proceedings.icml.cc/static/paper_files/icml/2020/3102-Paper.pdf">Retrieval Augmented Language Model Pre-Training</a><a class="anchor-link" href="#Retrieval-Augmented-Language-Model-Pre-Training"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/blog/images/copied_from_nb/my_icons/realm.png" alt="" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><em>Augmenting language models with knowledge retrieval sets a new benchmark for open-domain question answering.</em></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>I liked this talk a lot because it takes a non-trivial step towards integrating world knowledge into language models and addresses Gary Marcus' <a href="https://thegradient.pub/gpt2-and-the-nature-of-intelligence/">common complaint</a> that data and compute aren't enough to produce Real Intelligence™.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To integrate knowledge into language model pretraining, this talk proposes adding a text retriever that is <em>learned</em> during the training process. Unsurprisingly, this introduces a major computational challenge because the conditional probability now involves a sum over <em>all</em> documents in a corpus $\mathcal{Z}$:</p>
<p>
$$ p(y|x) = \sum_{z\in \mathcal{Z}} p(y|x,z)p(z)\,.$$
</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To deal with this, the authors compute an embedding for every document in the corpus and then use <a href="https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html">Maximum Inner Product Search</a> algorithms to find the approximate top $k$ documents. The result is a hybrid model that significantly outperforms other approaches in open-domain question answering.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Transformers-are-RNNs:-Fast-Autoregressive-Transformers-with-Linear-Attention"><a href="https://proceedings.icml.cc/static/paper_files/icml/2020/2935-Paper.pdf">Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention</a><a class="anchor-link" href="#Transformers-are-RNNs:-Fast-Autoregressive-Transformers-with-Linear-Attention"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/blog/images/copied_from_nb/my_icons/transformers-are-rnns.png" alt="" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><em>A clever choice of kernel reduces the computational complexity of attention from $O(N^2)$ to $O(N)$. Generate images 4000x faster than vanilla transformers :fire:.</em></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>It's refreshing to see a transformer talk that isn't about using a "bonfire worth of GPU-TPU-neuromorphic wafer scale silicon"<sup id="fnref-4" class="footnote-ref"><a href="#fn-4">4</a></sup> to break NLP benchmarks. This talk observes that the main bottleneck in vanilla transformer models is the softmax attention computation</p>
<p>
$$ V' = \mathrm{softmax} \left(\frac{QK^T}{\sqrt{D}} \right) V $$
</p>
<p>whose time and space complexity is $O(N^2)$ for sequence length $N$. To get around this, the authors first use a similarity function to obtain a <em>generalised</em> form of self-attention</p>
<p>
$$ V_i' = \frac{\sum_j \mathrm{sim}(Q_i, K_j)V_j}{\sum_j \mathrm{sim}(Q_i, K_j)} $$
</p>
<p>which can be simplified via a choice of kernel and matrix associativity:</p>
<p>
$$V_i' = \frac{\phi(Q_i)^T\sum_j\phi(K_j)V_j^T}{\phi(Q_i)^T\sum_j\phi(K_j)}\,. $$
</p>
<p>The result is a self-attention step that is $O(N)$ because the sums in the above expression can be computed once and reused for every query. In practice, this turns out to be especially powerful for inference, with speed-ups of 4000x reported in the talk!</p>
<p>The authors go on to show that their formulation can also be used to express transformers as RNNs, which might be an interesting way to explore the <a href="https://mostafadehghani.com/2019/05/05/universal-transformers/">shortcomings</a> of these large langauge models.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="XTREME:-A-Massively-Multilingual-Multi-task-Benchmark-for-Evaluating-Cross-lingual-Generalisation"><a href="https://proceedings.icml.cc/static/paper_files/icml/2020/4220-Paper.pdf">XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation</a><a class="anchor-link" href="#XTREME:-A-Massively-Multilingual-Multi-task-Benchmark-for-Evaluating-Cross-lingual-Generalisation"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/blog/images/copied_from_nb/my_icons/xtreme.png" alt="" title="Image credit: https://ai.googleblog.com/2020/04/xtreme-massively-multilingual-multi.html" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><em>A new benchmark to test zero-shot cross-lingual transfer from English to 39 diverse languages.</em></p>
<p>In this talk, the authors introduce the <a href="https://sites.research.google/xtreme">XTREME benchmark</a> to evaluate the ability of multilingual representations to generalise across 40 languages and 9 tasks. To evaluate a model in XTREME, the main idea is to follow a three-stage recipe:</p>
<ol>
<li>Pre-train on a large corpus of multilingual text.</li>
<li>Fine-tune on English data for each task.</li>
<li>Evaluate the model on <em>zero-shot transfer</em> performance, e.g. evaluate the accuracy on a German text classification task.</li>
</ol>
<p>English is chosen for fine-tuning because it's the langauge with the most labelled data, and the authors employ a neat trick using Google Translate to generate proxy test sets for the tasks where a pre-existing translation does not exist.</p>
<p>Although not strictly about Transformers, the baseline models for this benchmark are all variants of the Transformer architecture, and the authors find that <a href="https://arxiv.org/abs/1911.02116">XLM-R</a> achieves the best zero-shot transfer performance across all languages in each task. What I especially like about XTREME is that the tasks are designed to be trainable on a single GPU for less than a day. This should make it possible for research labs with tight budgets to create competitive models, where the gains in performance are likely to come from architectural design rather than simply scaling-up the compute.</p>
<p>I'm excited about this benchmark because I expect it will produce models that have a direct impact on my professional work in Switzerland. With <a href="https://en.wikipedia.org/wiki/Languages_of_Switzerland">four national languages</a> and a smattering of English, building natural language applications that serve the whole population is a constant challenge.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Time-series">Time series<a class="anchor-link" href="#Time-series"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Set-Functions-for-Time-Series"><a href="https://proceedings.icml.cc/static/paper_files/icml/2020/4750-Paper.pdf">Set Functions for Time Series</a><a class="anchor-link" href="#Set-Functions-for-Time-Series"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/blog/images/copied_from_nb/my_icons/seft.png" alt="" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><em>High-performance classification for multivariate, irregularly sampled time series.</em></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Time series seems to be the neglected child of machine learning research, so I was excited to see a talk that combines a lot of cool ideas like <a href="https://arxiv.org/abs/1703.06114">Deep Sets</a>, attention, and positional encodings in a new architecture. The motivation for this work is based on the observation that:</p>
<ul>
<li>Imputation techniques for sparse or irregularly sampled time series introduce bias or don't make sense at all.<sup id="fnref-5" class="footnote-ref"><a href="#fn-5">5</a></sup> </li>
<li>Many time series of practical interest are multivariate in nature, and often with <em>unaligned</em> measurements</li>
</ul>
<p>The authors note that for time series classification tasks, the <em>order</em> of input measurements is not important and thus one can reframe the problem as classifing a <em>set</em> of observations. By representing each observation as a tuple $(t_i, z_i, m_i)$ of timestamp $t_i$, observation $z_i$ and indicator $m_i$, an entire time series can be written as</p>
<p>
$$\mathcal{S} = \{(t_1,z_1,m_1), \ldots , (t_M, z_M, m_M) \}$$
</p>
<p>The goal is then to learn a function $f: \mathcal{S} \to \mathbb{R}^C$ which the authors do via the Deep Sets approach to obtain a highly-scalable architecture. One aspect I especially liked in this talk is the use of attention to visualise which observations contributed to the model output.</p>
<p><img src="/blog/images/copied_from_nb/my_icons/seft-attention.png" alt="" /></p>
<p>In industry it is quite common for domain experts to have a different mental model on how to interpret the predictions from your model, and visualisations like these could be really handy as a common discussion point. I'm quite excited to see if I can use this approach to tackle some thorny time series problems at work!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Interpretable,-Multidimensional,-Multimodal-Anomaly-Detection-with-Negative-Sampling-for-Detection-of-Device-Failure"><a href="https://proceedings.icml.cc/static/paper_files/icml/2020/2557-Paper.pdf">Interpretable, Multidimensional, Multimodal Anomaly Detection with Negative Sampling for Detection of Device Failure</a><a class="anchor-link" href="#Interpretable,-Multidimensional,-Multimodal-Anomaly-Detection-with-Negative-Sampling-for-Detection-of-Device-Failure"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/blog/images/copied_from_nb/my_icons/anomaly-detection.png" alt="" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><em>A new unsupervised anomaly detection algorithm for IoT devices.</em></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This talk proposes a new technique to distinguish "normal" from "abnormal" events in streams of telemetry data from IoT devices. Like almost every real-world anomaly detection problem, one rarely has training data with labelled anomalies.<sup id="fnref-6" class="footnote-ref"><a href="#fn-6">6</a></sup></p>
<p>The main novelty in this talk is a method to deal with the lack of labels by framing the problem as a binary classification task, where one class contains <em>positive</em> (mostly "normal") samples while the other contains <em>negative</em> samples that are supposed to represent the space of anomalies. A sample ratio parameter $r_s$ controls the ratio of negative to positive sample sizes and acts as a sort of hyperparameter or threshold that is tuned.</p>
<p>Although this method will generate false positive and false negative labelling errors, the author notes that the former are rare (by definition) and the latter decay exponentially for high-dimensional time series. Once the "labelled" dataset is created, it is then a simple matter to train a classifier and the talk notes that both neural nets and random forests perform comparably well.</p>
<p>One really neat aspect of this work is that it also introduces a novel way to interpret anomalies for root-cause analysis. The aim here is to figure out which dimensions contribute most to an anomaly score and the talk proposes a method based on <em><a href="https://www.youtube.com/watch?v=iVSIFm0UN9I">integrated gradients</a></em>. Here the basic idea is to identify which dimensions of the time series must be changed to transform an anomalous point into a normal one.</p>
<p>I think the methods in this paper can have a direct impact in my day job and I'm interested to see how it performs on the challenging <a href="https://numenta.com/machine-intelligence-technology/numenta-anomaly-benchmark/">Numenta Anomaly Benchmark</a>. Since the code is <a href="https://github.com/google/madi">open-sourced</a>, this will be a nice weekend project!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Physics">Physics<a class="anchor-link" href="#Physics"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Learning-to-Simulate-Complex-Physics-with-Graph-Networks"><a href="https://proceedings.icml.cc/static/paper_files/icml/2020/6892-Paper.pdf">Learning to Simulate Complex Physics with Graph Networks</a><a class="anchor-link" href="#Learning-to-Simulate-Complex-Physics-with-Graph-Networks"> </a></h3><p>
<center>
<iframe width="560" height="315" src="https://www.youtube.com/embed/h7h9zF8OO7E" frameborder="0" allowfullscreen=""></iframe>
</center>
</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><em>A single architecture creates high-fidelity particle simulations of various interacting materials.</em></p>
<p>I'm a sucker for flashy demos and this talk from DeepMind didn't disappoint. They propose an "encode-process-decode" architecture to calculate the dynamics of physical systems, where particle states are represented as graphs and a graph neural network learns the particle interactions.</p>
<p><img src="/blog/images/copied_from_nb/my_icons/gns.png" alt="" /></p>
<p>During training, the model predicts each particle's position and velocity one timestep into the future, and these predictions are compared against the ground-truth values of a simulator. Remarkably, this approach generalises to <em>thousands of timesteps</em> at test time, even under different initial conditions and an order of magnitude more particles!<sup id="fnref-3" class="footnote-ref"><a href="#fn-3">3</a></sup></p>
<p>I think this work is a great example of how machine learning can help physicists build better simulations of complex phenomena. It will be interesting to see whether this approach can scale to systems with <em>billions</em> of particles, like those found in <a href="https://wwwmpa.mpa-garching.mpg.de/galform/virgo/millennium/">dark matter simulations</a> or <a href="https://www.youtube.com/watch?v=NhXMXiXOWAA">high-energy collisions</a> at the Large Hadron Collider.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><div class="footnotes"><p id="fn-1">1. Downscaling is needed because naively training on a $224^2 \times 3$ sequence length would blow up the memory of the largest TPU!<a href="#fnref-1" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-2">2. A <em>linear probe</em> refers to using the model as a feature extractor and passing those features through a linear model like logistic regression.<a href="#fnref-2" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-3">3. The authors ascribe this generalisation power to the fact that each particle is only aware of local interactions in some 'connectivity radius', so the model is flexible enough to generalise to out-of-distribution inputs.<a href="#fnref-3" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-4">4. Quote from Stephen Merity's brilliant <em>Single Headed Attention RNN: Stop Thinking With Your Head</em>.<a href="#fnref-4" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-5">5. For example, in a medical context where a patient's vitals may only be measured if the doctor orders a test.<a href="#fnref-5" class="footnote footnotes">↩</a></p></div></p>
<p><div class="footnotes"><p id="fn-6">6. And even if you did, supervised approaches tend to experience 'model rot' quite quickly when dealing with vast streams of data.<a href="#fnref-6" class="footnote footnotes">↩</a></p></div></p>
</div>
</div>
</div>
</div>