This year I had the opportunity to attend the International Conference on Machine Learning (ICML) and decided to highlight some of the talks I found especially interesting. Although the conference was hosted entirely online, this provided two key benefits over attending in person:

  • Clash resolution: with 1,088 papers accepted, it is inevitable that multiple talks of interest would clash in the timetable. Watching the pre-recorded presentations in my own time provided a simple solution, not to mention the ability to quickly switch to a new talk if desired.
  • Better Q&A sessions: at large conferences it is not easy to get your questions answered directly after a talk, usually because the whole session is running overtime and the moderator wants to move onto the next speaker. By having two (!) dedicated Q&A sessions for each talk, I found the discussions to be extremely insightful and much more personalised.

Since I'm resigned to being in quarantine until 2050, I hope other virtual conferences will adopt a similar format. Conference highlights are below!


Predicting the next pixel with a GPT-2 scale model yields high quality representations. The best representations lie in the middle of the network.

This talk showed that with enough compute, it is possible to adapt transformer architectures to images and achieve strong results in self-supervised learning benchmarks. Dubbed iGPT, this approach relies on a three-step process:

  1. Downsize the images, cluster the RGB pixel values to create a 9-bit colour map, and reshape to 1D.1
  2. Pre-train on either an autoregressive next pixel or masked pixel prediction task.
  3. Evaluate the quality of the learned representations on downstream tasks.

One surprising result of the linear probe2 experiments is that representation quality tends to be highest in the middle of the network.

I think this work provides a compelling example of Sutton's "bitter lesson"

Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

but takes it one step further by discarding knowledge of the 2D structure in images entirely!

Although the iGPT models are 2-30 times larger than ResNet-152, I expect it is only a matter of time before people find ways to make this approach more efficient. In the meantime, it's nice to see that the pre-trained models have been open-sourced and a port to HuggingFace's transformers library is already underway.

Augmenting language models with knowledge retrieval sets a new benchmark for open-domain question answering.

I liked this talk a lot because it takes a non-trivial step towards integrating world knowledge into language models and addresses Gary Marcus' common complaint that data and compute aren't enough to produce Real Intelligence™.

To integrate knowledge into language model pretraining, this talk proposes adding a text retriever that is learned during the training process. Unsurprisingly, this introduces a major computational challenge because the conditional probability now involves a sum over all documents in a corpus $\mathcal{Z}$:

$$ p(y|x) = \sum_{z\in \mathcal{Z}} p(y|x,z)p(z)\,.$$

To deal with this, the authors compute an embedding for every document in the corpus and then use Maximum Inner Product Search algorithms to find the approximate top $k$ documents. The result is a hybrid model that significantly outperforms other approaches in open-domain question answering.

A clever choice of kernel reduces the computational complexity of attention from $O(N^2)$ to $O(N)$. Generate images 4000x faster than vanilla transformers :fire:.

It's refreshing to see a transformer talk that isn't about using a "bonfire worth of GPU-TPU-neuromorphic wafer scale silicon"4 to break NLP benchmarks. This talk observes that the main bottleneck in vanilla transformer models is the softmax attention computation

$$ V' = \mathrm{softmax} \left(\frac{QK^T}{\sqrt{D}} \right) V $$

whose time and space complexity is $O(N^2)$ for sequence length $N$. To get around this, the authors first use a similarity function to obtain a generalised form of self-attention

$$ V_i' = \frac{\sum_j \mathrm{sim}(Q_i, K_j)V_j}{\sum_j \mathrm{sim}(Q_i, K_j)} $$

which can be simplified via a choice of kernel and matrix associativity:

$$V_i' = \frac{\phi(Q_i)^T\sum_j\phi(K_j)V_j^T}{\phi(Q_i)^T\sum_j\phi(K_j)}\,. $$

The result is a self-attention step that is $O(N)$ because the sums in the above expression can be computed once and reused for every query. In practice, this turns out to be especially powerful for inference, with speed-ups of 4000x reported in the talk!

The authors go on to show that their formulation can also be used to express transformers as RNNs, which might be an interesting way to explore the shortcomings of these large langauge models.

A new benchmark to test zero-shot cross-lingual transfer from English to 39 diverse languages.

In this talk, the authors introduce the XTREME benchmark to evaluate the ability of multilingual representations to generalise across 40 languages and 9 tasks. To evaluate a model in XTREME, the main idea is to follow a three-stage recipe:

  1. Pre-train on a large corpus of multilingual text.
  2. Fine-tune on English data for each task.
  3. Evaluate the model on zero-shot transfer performance, e.g. evaluate the accuracy on a German text classification task.

English is chosen for fine-tuning because it's the langauge with the most labelled data, and the authors employ a neat trick using Google Translate to generate proxy test sets for the tasks where a pre-existing translation does not exist.

Although not strictly about Transformers, the baseline models for this benchmark are all variants of the Transformer architecture, and the authors find that XLM-R achieves the best zero-shot transfer performance across all languages in each task. What I especially like about XTREME is that the tasks are designed to be trainable on a single GPU for less than a day. This should make it possible for research labs with tight budgets to create competitive models, where the gains in performance are likely to come from architectural design rather than simply scaling-up the compute.

I'm excited about this benchmark because I expect it will produce models that have a direct impact on my professional work in Switzerland. With four national languages and a smattering of English, building natural language applications that serve the whole population is a constant challenge.

Time series

High-performance classification for multivariate, irregularly sampled time series.

Time series seems to be the neglected child of machine learning research, so I was excited to see a talk that combines a lot of cool ideas like Deep Sets, attention, and positional encodings in a new architecture. The motivation for this work is based on the observation that:

  • Imputation techniques for sparse or irregularly sampled time series introduce bias or don't make sense at all.5
  • Many time series of practical interest are multivariate in nature, and often with unaligned measurements

The authors note that for time series classification tasks, the order of input measurements is not important and thus one can reframe the problem as classifing a set of observations. By representing each observation as a tuple $(t_i, z_i, m_i)$ of timestamp $t_i$, observation $z_i$ and indicator $m_i$, an entire time series can be written as

$$\mathcal{S} = \{(t_1,z_1,m_1), \ldots , (t_M, z_M, m_M) \}$$

The goal is then to learn a function $f: \mathcal{S} \to \mathbb{R}^C$ which the authors do via the Deep Sets approach to obtain a highly-scalable architecture. One aspect I especially liked in this talk is the use of attention to visualise which observations contributed to the model output.

In industry it is quite common for domain experts to have a different mental model on how to interpret the predictions from your model, and visualisations like these could be really handy as a common discussion point. I'm quite excited to see if I can use this approach to tackle some thorny time series problems at work!

A new unsupervised anomaly detection algorithm for IoT devices.

This talk proposes a new technique to distinguish "normal" from "abnormal" events in streams of telemetry data from IoT devices. Like almost every real-world anomaly detection problem, one rarely has training data with labelled anomalies.6

The main novelty in this talk is a method to deal with the lack of labels by framing the problem as a binary classification task, where one class contains positive (mostly "normal") samples while the other contains negative samples that are supposed to represent the space of anomalies. A sample ratio parameter $r_s$ controls the ratio of negative to positive sample sizes and acts as a sort of hyperparameter or threshold that is tuned.

Although this method will generate false positive and false negative labelling errors, the author notes that the former are rare (by definition) and the latter decay exponentially for high-dimensional time series. Once the "labelled" dataset is created, it is then a simple matter to train a classifier and the talk notes that both neural nets and random forests perform comparably well.

One really neat aspect of this work is that it also introduces a novel way to interpret anomalies for root-cause analysis. The aim here is to figure out which dimensions contribute most to an anomaly score and the talk proposes a method based on integrated gradients. Here the basic idea is to identify which dimensions of the time series must be changed to transform an anomalous point into a normal one.

I think the methods in this paper can have a direct impact in my day job and I'm interested to see how it performs on the challenging Numenta Anomaly Benchmark. Since the code is open-sourced, this will be a nice weekend project!


A single architecture creates high-fidelity particle simulations of various interacting materials.

I'm a sucker for flashy demos and this talk from DeepMind didn't disappoint. They propose an "encode-process-decode" architecture to calculate the dynamics of physical systems, where particle states are represented as graphs and a graph neural network learns the particle interactions.

During training, the model predicts each particle's position and velocity one timestep into the future, and these predictions are compared against the ground-truth values of a simulator. Remarkably, this approach generalises to thousands of timesteps at test time, even under different initial conditions and an order of magnitude more particles!3

I think this work is a great example of how machine learning can help physicists build better simulations of complex phenomena. It will be interesting to see whether this approach can scale to systems with billions of particles, like those found in dark matter simulations or high-energy collisions at the Large Hadron Collider.

1. Downscaling is needed because naively training on a $224^2 \times 3$ sequence length would blow up the memory of the largest TPU!

2. A linear probe refers to using the model as a feature extractor and passing those features through a linear model like logistic regression.

3. The authors ascribe this generalisation power to the fact that each particle is only aware of local interactions in some 'connectivity radius', so the model is flexible enough to generalise to out-of-distribution inputs.

4. Quote from Stephen Merity's brilliant Single Headed Attention RNN: Stop Thinking With Your Head.

5. For example, in a medical context where a patient's vitals may only be measured if the doctor orders a test.

6. And even if you did, supervised approaches tend to experience 'model rot' quite quickly when dealing with vast streams of data.