Recently, Sylvain Gugger from HuggingFace has created some nice tutorials on using transformers for text classification and named entity recognition. One trick that caught my attention was the use of a data collator in the trainer, which automatically pads the model inputs in a batch to the length of the longest example. This bypasses the need to set a global maximum sequence length, and in practice leads to faster training since we perform fewer redundant computations on the padded tokens and attention masks.

I wanted to use a data collator for both training and error analysis (e.g. by inspecting the top losses of the model). One problem: during training, each batch is collated on the fly so how do I pad my inputs in subsequent Dataset.map operations?

For sequence classification tasks, the solution I ended up with was to simply grab the data collator from the trainer and use it in my post-processing functions:

data_collator = trainer.data_collator

def processing_function(batch):
batch = data_collator(batch)
...
return batch


For token classification tasks, there is a dedicated DataCollatorForTokenClassification which expects a list of dicts, where each dict represents a single example in the dataset. Since a Dataset slice returns a dict of lists, we need a two more lines to wrangle the data in the expected format:

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(trainer.tokenizer)

def processing_function(batch):
# convert dict of lists to list of dicts
features = [dict(zip(batch, t)) for t in zip(*batch.values())]
batch = data_collator(features)
...
return batch


For an end-to-end example, let's grab 1,000 examples from the IMDB dataset:

from datasets import load_dataset

.train_test_split(train_size=800, test_size=200))
imdb

DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 800
})
test: Dataset({
features: ['text', 'label'],
num_rows: 200
})
})

Next, let's load a pretrained model and its corresponding tokenizer:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

num_labels = 2
model_name = 'distilbert-base-cased'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = (AutoModelForSequenceClassification
.from_pretrained(model_name, num_labels=num_labels)
.to(device))


Before fine-tuning the model, we need to tokenize and encode the dataset, so let's do that with a simple Dataset.map operation:

def tokenize_and_encode(batch):

imdb_enc = imdb.map(tokenize_and_encode, batched=True)
imdb_enc

DatasetDict({
train: Dataset({
num_rows: 800
})
test: Dataset({
num_rows: 200
})
})

The final step is to define the metrics

import numpy as np

def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy_score.compute(predictions=predictions, references=labels)


the arguments for the trainer

from transformers import TrainingArguments

batch_size = 16
logging_steps = len(imdb_enc['train']) // batch_size

training_args = TrainingArguments(
output_dir="results",
num_train_epochs=1,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps)


and the trainer itself:

Important: The trainer will remove in-place any dataset columns of str type, so in this example imdb_enc loses the text column.

from transformers import Trainer

trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=imdb_enc['train'],
eval_dataset=imdb_enc['test'],
tokenizer=tokenizer)

trainer.train();

[50/50 00:32, Epoch 1/1]
Epoch Training Loss Validation Loss Accuracy
1 0.390015 0.328747 0.875000

</div> </div> </div> </div> </div>

By default, the Trainer class uses the simple default_data_collator to collate batches of dict-like objects, but by passing the tokenizer we get a DataCollatorWithPadding instead:

data_collator = trainer.data_collator
type(data_collator)

transformers.data.data_collator.DataCollatorWithPadding

To see how this collator works, let's pass a dummy batch and observe that both the input_ids and attention_mask are padded as expected:

batch = {'input_ids': [[0,1,2], [0,1,2,3,4,5]]}
data_collator(batch)

{'input_ids': tensor([[0, 1, 2, 0, 0, 0],
[0, 1, 2, 3, 4, 5]]), 'attention_mask': tensor([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1]])}

Finally, we can calculate the loss per example with the following function:1

def loss_per_example(batch):
batch = data_collator(batch)
input_ids = torch.tensor(batch["input_ids"], device=device)
labels = torch.tensor(batch["labels"], device=device)

batch["predicted_label"] = torch.argmax(output.logits, axis=1)

loss = torch.nn.functional.cross_entropy(
output.logits, labels, reduction="none")
batch["loss"] = loss

# datasets requires list of NumPy array data types
for k, v in batch.items():
batch[k] = v.cpu().numpy()

return batch

losses_ds = imdb_enc['test'].map(
loss_per_example, batched=True, batch_size=batch_size)


It's then a simple matter to convert losses_ds to a pandas.DataFrame and sort by loss to find the examples where the model is most confused:

import pandas as pd
pd.set_option("display.max_colwidth", None)

losses_ds.set_format('pandas')
losses_df = losses_ds[:][['label', 'predicted_label', 'loss']]
# add the text column removed by the trainer
losses_df['text'] = imdb['test']['text']