During my PhD and postdoc, I kept detailed research notes that I would often revisit to reproduce a lengthy calculation or simply take stock of the progress I'd made on my projects.

For various reasons, I dropped this habit when I switched to industry¹ and nowadays find myself digging out code snippets or techniques from a tangle of Google Docs, Git repositories, and Markdown files that I've built up over the years.

To break this anti-pattern, I've decided to "work in public" as much as possible this year, mostly in the form of TILs and weeknotes. Here, I am drawing inspiration from the prolific Simon Willison, whose blog meticulously documents the development of his open-source projects.²

To that end, here's the first weeknotes of the year - hopefully they're not the last!

Question answering

This week I've been doing a deep dive into extractive question answering as part of a book chapter I'm writing on compression methods for Transformers. Although I built a question answering PoC with BERT in the dark ages of 2019, I was curious to see how the implementation could be done in the transformers library, specifically with a custom Trainer class and running everything inside Jupyter notebooks.

Fortunately, Sylvain Gugger at HuggingFace had already implemented

A tutorial on fine-tuning language models for question answering, but without a custom Trainer
A custom QuestionAnsweringTrainer as part of the question answering scripts in transformers

so my warm-up task this week was to simply merge the two in a single notebook and fine-tune bert-base-uncased on SQuAD v1.

I implemented a very scrappy version that achieves this in my transformerlab repository, and the main lesson I learnt is that

Dealing with context size is tricky for long documents

Transformer models can only process a finite number of input tokens, a property usually referred to as the maximum context size. As described in Sylvain's tutorial, naive truncation of documents for question answering is problematic because

removing part of the the context might result in losing the answer we are looking for.

The solution is to apply a sliding window³ to the input context, so that long contexts are split into multiple features. An example from the tutorial shows how this works by introducing two new hyperparameters max_length and doc_stride that control the degree of overlap (bold shows the overlapping region):

[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notre dame, has achieved a 332 - 165 record. in 2009 they were invited to the nit, where they advanced to the semifinals but were beaten by penn state who went on and beat baylor in the championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were [SEP]

[CLS] how many wins does the notre dame men's basketball team have? [SEP] championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were the most by the fighting irish team since 1908 - 09. [SEP]

Remarkably, transformers supports this preprocessing logic out of the box, so one just has to specify a few arguments in the tokenizer:

tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride)

One drawback from this approach is that it introduces significant complexity into the data preparation step:

With multiple features per example, one needs to do some heavy wrangling to pick out the start and end positions of each answer. For example, the postprocess_qa_predictions function in Sylvain's tutorial is about 80 lines long, and breaking this down for readers is likely to distract from the main focus on compression methods.
We need slightly different logic for preprocessing the training and validation sets (see the prepare_train_features and prepare_validation_features)

Instead, I may opt for the simpler, but less rigourous approach of truncating the long examples. As shown in the transformer docs, we'd only need to define a custom dataset

import torch

class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

and then pass the encoding for the training and validation sets as follows:

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

From here we can just use the native Trainer in transformers, together with the squad metric from datasets. By looking at the distribution of question and context lengths, we can see that this simplification will only fail in a very small number of examples:

Another alternative would be to adopt the "retriever-reader" architecture that I used in my PoC (where I split long documents into smaller paragraphs), but that introduces it's own set of complexity that I'd like to avoid.

Running a mock interview

A friend of mine is applying for a research scientist position and we thought it would be fun to run a couple of mock interviews together. Since the position is likely to involve Transformers, I asked my friend a few GPT-related questions (e.g. how does the architecture differ from BERT and what is the difference between GPT / GPT-2 and GPT-3?), followed by a coding session to see how fast one could implement GPT from scratch. The goal was to approach a skeleton of Andrej Karpathy's excellent minGPT implementation

I wrote a minimal/educational GPT training library in PyTorch, am calling it minGPT as it is only around ~300 lines of code: https://t.co/79S9lShJRN +demos for addition and character-level language model. (quick weekend project, may contain sharp edges)
— Andrej Karpathy (@karpathy) August 17, 2020

and the experience taught me a few lessons:

There's a significant difference between being a power-user of a library like transformers versus deeply knowing how every layer, activation function, etc in a deep neural architecture is put together. Running the interview reminded me that I should aim to block some time per week to hone the foundations of my machine learning knowledge.
Open-ended coding interviews like this are way more fun to conduct than the usual LeetCode / HackerRank problems one usually encounters in industry. To me, they resemble a pair-programming interaction that gives the interviewer a pretty good feel for what it would be like to work closely with the candidate. Something to remember the next time I'm interviewing people for a real job!

Papers this week

This week I've been mostly reading papers on compressing Transformers and how to improve few-shot learning without resorting to massive scaling:

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf (2019)
FastFormers: Highly Efficient Transformer Models for Natural Language Understanding by Young Jin Kim and Hany Hassan Awadalla (2020)
Uncertainty-aware Self-training for Text Classification with Few Labels by Subhabrata Mukherjee and Ahmed Hassan Awadallah (2020)

This week also coincided with the release of OpenAI's DALL-E which, although light on implementation details, provided a fun interface to see how far you can push the limits of text-to-image generation:

TIL this week

Polling a web service with bash and jq

1. Mostly due to playing an insane game of "data science catch-up" at an early-stage startup.↩

2. Even down to the level of reviewing his own pull requests!↩

3. We want a sliding window instead of a tumbling one because the answer might appear across the boundary of the two windows.↩