Recovering columns hidden by the 🤗 Trainer
Lately, I've been using the transformers
trainer together with the datasets
library and I was a bit mystified by the disappearence of some columns in the training and validation sets after fine-tuning. It wasn't until I saw Sylvain Gugger's tutorial on question answering that I realised this is by design! Indeed, as noted in the docs1 for the train_dataset
and eval_dataset
arguments of the Trainer
:
If it is an
datasets.Dataset
, columns not accepted by themodel.forward()
method are automatically removed.
A simple one-liner to restore the missing columns is the following:
dataset.set_format(type=dataset.format["type"], columns=list(dataset.features.keys()))
To understand why this works, we can peek inside the relevant Trainer
code
??Trainer._remove_unused_columns
and see that we're effectively undoing the final dataset.set_format()
operation.
To see this in action, let's grab 1,000 examples from the COLA dataset:
from datasets import load_dataset
cola = load_dataset('glue', 'cola', split='train[:1000]')
cola
Here we can see that each split has three Dataset.features
: sentence
, label
, and idx
. By inspecting the Dataset.format
attribute
cola.format
we also see that the type
is None
. Next, let's load a pretrained model and its corresponding tokenizer:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
num_labels = 2
model_name = 'distilbert-base-uncased'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = (AutoModelForSequenceClassification
.from_pretrained(model_name, num_labels=num_labels)
.to(device))
Before fine-tuning the model, we need to tokenize and encode the dataset, so let's do that with a simple Dataset.map
operation:
def tokenize_and_encode(batch):
return tokenizer(batch['sentence'], truncation=True)
cola_enc = cola.map(tokenize_and_encode, batched=True)
cola_enc
Note that the encoding process has added two new Dataset.features
to our dataset: attention_mask
and input_ids
. Since we don't care about evaluation, let's create a minimal trainer and fine-tune the model for one epoch:
from transformers import TrainingArguments, Trainer
batch_size = 16
logging_steps = len(cola_enc) // batch_size
training_args = TrainingArguments(
output_dir="results",
num_train_epochs=1,
per_device_train_batch_size=batch_size,
disable_tqdm=False,
logging_steps=logging_steps)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=cola_enc,
tokenizer=tokenizer)
trainer.train();