Recovering columns hidden by the 🤗 Trainer
Lately, I've been using the transformers trainer together with the datasets library and I was a bit mystified by the disappearence of some columns in the training and validation sets after fine-tuning. It wasn't until I saw Sylvain Gugger's tutorial on question answering that I realised this is by design!  Indeed, as noted in the docs1 for the train_dataset and eval_dataset arguments of the Trainer:
If it is an
datasets.Dataset, columns not accepted by themodel.forward()method are automatically removed.
A simple one-liner to restore the missing columns is the following:
dataset.set_format(type=dataset.format["type"], columns=list(dataset.features.keys()))
To understand why this works, we can peek inside the relevant Trainer code
??Trainer._remove_unused_columns
and see that we're effectively undoing the final dataset.set_format() operation.
To see this in action, let's grab 1,000 examples from the COLA dataset:
from datasets import load_dataset
cola = load_dataset('glue', 'cola', split='train[:1000]')
cola
Here we can see that each split has three Dataset.features: sentence, label, and idx. By inspecting the Dataset.format attribute
cola.format
we also see that the type is None. Next, let's load a pretrained model and its corresponding tokenizer:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
num_labels = 2
model_name = 'distilbert-base-uncased'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = (AutoModelForSequenceClassification
         .from_pretrained(model_name, num_labels=num_labels)
         .to(device))
Before fine-tuning the model, we need to tokenize and encode the dataset, so let's do that with a simple Dataset.map operation:
def tokenize_and_encode(batch):
    return tokenizer(batch['sentence'], truncation=True)
cola_enc = cola.map(tokenize_and_encode, batched=True)
cola_enc
Note that the encoding process has added two new Dataset.features to our dataset: attention_mask and input_ids. Since we don't care about evaluation, let's create a minimal trainer and fine-tune the model for one epoch:
from transformers import TrainingArguments, Trainer
batch_size = 16
logging_steps = len(cola_enc) // batch_size
training_args = TrainingArguments(
    output_dir="results",
    num_train_epochs=1,
    per_device_train_batch_size=batch_size,
    disable_tqdm=False,
    logging_steps=logging_steps)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=cola_enc,
    tokenizer=tokenizer)
trainer.train();