Using data collators for training and error analysis
A text classification example with 🤗 Transformers and Datasets
Recently, Sylvain Gugger from HuggingFace has created some nice tutorials on using transformers
for text classification and named entity recognition. One trick that caught my attention was the use of a data collator in the trainer, which automatically pads the model inputs in a batch to the length of the longest example. This bypasses the need to set a global maximum sequence length, and in practice leads to faster training since we perform fewer redundant computations on the padded tokens and attention masks.
I wanted to use a data collator for both training and error analysis (e.g. by inspecting the top losses of the model). One problem: during training, each batch is collated on the fly so how do I pad my inputs in subsequent Dataset.map
operations?
For sequence classification tasks, the solution I ended up with was to simply grab the data collator from the trainer and use it in my post-processing functions:
data_collator = trainer.data_collator
def processing_function(batch):
# pad inputs
batch = data_collator(batch)
...
return batch
For token classification tasks, there is a dedicated DataCollatorForTokenClassification
which expects a list
of dicts
, where each dict
represents a single example in the dataset. Since a Dataset
slice returns a dict
of lists
, we need a two more lines to wrangle the data in the expected format:
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(trainer.tokenizer)
def processing_function(batch):
# convert dict of lists to list of dicts
features = [dict(zip(batch, t)) for t in zip(*batch.values())]
# pad inputs and labels
batch = data_collator(features)
...
return batch
For an end-to-end example, let's grab 1,000 examples from the IMDB dataset:
from datasets import load_dataset
imdb = (load_dataset('imdb', split='train')
.train_test_split(train_size=800, test_size=200))
imdb
Next, let's load a pretrained model and its corresponding tokenizer:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
num_labels = 2
model_name = 'distilbert-base-cased'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = (AutoModelForSequenceClassification
.from_pretrained(model_name, num_labels=num_labels)
.to(device))
Before fine-tuning the model, we need to tokenize and encode the dataset, so let's do that with a simple Dataset.map
operation:
def tokenize_and_encode(batch):
return tokenizer(batch['text'], truncation=True)
imdb_enc = imdb.map(tokenize_and_encode, batched=True)
imdb_enc
The final step is to define the metrics
import numpy as np
from datasets import load_metric
accuracy_score = load_metric("accuracy")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy_score.compute(predictions=predictions, references=labels)
the arguments for the trainer
from transformers import TrainingArguments
batch_size = 16
logging_steps = len(imdb_enc['train']) // batch_size
training_args = TrainingArguments(
output_dir="results",
num_train_epochs=1,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps)
and the trainer itself:
str
type, so in this example imdb_enc
loses the text
column.
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=imdb_enc['train'],
eval_dataset=imdb_enc['test'],
tokenizer=tokenizer)
trainer.train();