Classes and functions for question answering tasks


Most of the classes and functions defined in this module are adapted from the following resources:

Compared to the results from HuggingFace's run_qa.py script, this implementation agrees to within 0.5% on the SQUAD v1 dataset:

Implementation Exact Match F1
HuggingFace 81.22 88.52
Ours 80.82 88.22

Dataset preprocessing

prepare_train_features[source]

prepare_train_features(examples:Union[str, List[str], List[List[str]]], tokenizer:PreTrainedTokenizer, pad_on_right:bool, max_length:int=384, doc_stride:int=128)

Tokenize and encode training examples in the SQuAD format

prepare_validation_features[source]

prepare_validation_features(examples, tokenizer, pad_on_right, max_length, doc_stride)

Tokenize and encode validation examples in the SQuAD format

convert_examples_to_features[source]

convert_examples_to_features(dataset, tokenizer, num_train_examples, num_eval_examples, max_length=384, doc_stride=128, seed=42)

Tokenize and encode the training and validation examples in the SQuAD format

from datasets import load_dataset
from transformers import AutoTokenizer

num_train_examples = 800
num_eval_examples = 200
squad_ds = load_dataset('squad')
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
train_ds, eval_ds, eval_examples = convert_examples_to_features(squad_ds, tokenizer, num_train_examples, num_eval_examples)
assert eval_examples.num_rows == num_eval_examples

Metrics

squad_metrics[source]

squad_metrics(p:EvalPrediction)

Compute the Exact Match and F1-score metrics on SQuAD

Trainer

class QuestionAnsweringTrainingArguments[source]

QuestionAnsweringTrainingArguments(*args, max_length=384, doc_stride=128, version_2_with_negative=False, null_score_diff_threshold=0.0, n_best_size=20, max_answer_length=30, **kwargs) :: TrainingArguments

TrainingArguments is the subset of the arguments we use in our example scripts which relate to the training loop itself.

Using :class:~transformers.HfArgumentParser we can turn this class into argparse <https://docs.python.org/3/library/argparse.html#module-argparse>__ arguments that can be specified on the command line.

Parameters: output_dir (:obj:str): The output directory where the model predictions and checkpoints will be written. overwrite_output_dir (:obj:bool, optional, defaults to :obj:False): If :obj:True, overwrite the content of the output directory. Use this to continue training if :obj:output_dir points to a checkpoint directory. do_train (:obj:bool, optional, defaults to :obj:False): Whether to run training or not. This argument is not directly used by :class:~transformers.Trainer, it's intended to be used by your training/evaluation scripts instead. See the example scripts <https://github.com/huggingface/transformers/tree/master/examples> for more details. do_eval (:obj:bool, optional): Whether to run evaluation on the validation set or not. Will be set to :obj:True if :obj:evaluation_strategy is different from :obj:"no". This argument is not directly used by :class:~transformers.Trainer, it's intended to be used by your training/evaluation scripts instead. See the example scripts <https://github.com/huggingface/transformers/tree/master/examples> for more details. do_predict (:obj:bool, optional, defaults to :obj:False): Whether to run predictions on the test set or not. This argument is not directly used by :class:~transformers.Trainer, it's intended to be used by your training/evaluation scripts instead. See the example scripts <https://github.com/huggingface/transformers/tree/master/examples>__ for more details. evaluation_strategy (:obj:str or :class:~transformers.trainer_utils.EvaluationStrategy, optional, defaults to :obj:"no"): The evaluation strategy to adopt during training. Possible values are:

        * :obj:`"no"`: No evaluation is done during training.
        * :obj:`"steps"`: Evaluation is done (and logged) every :obj:`eval_steps`.
        * :obj:`"epoch"`: Evaluation is done at the end of each epoch.

prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`):
    When performing evaluation and generating predictions, only returns the loss.
per_device_train_batch_size (:obj:`int`, `optional`, defaults to 8):
    The batch size per GPU/TPU core/CPU for training.
per_device_eval_batch_size (:obj:`int`, `optional`, defaults to 8):
    The batch size per GPU/TPU core/CPU for evaluation.
gradient_accumulation_steps (:obj:`int`, `optional`, defaults to 1):
    Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

    .. warning::

        When using gradient accumulation, one step is counted as one step with backward pass. Therefore,
        logging, evaluation, save will be conducted every ``gradient_accumulation_steps * xxx_step`` training
        examples.
eval_accumulation_steps (:obj:`int`, `optional`):
    Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If
    left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but
    requires more memory).
learning_rate (:obj:`float`, `optional`, defaults to 5e-5):
    The initial learning rate for Adam.
weight_decay (:obj:`float`, `optional`, defaults to 0):
    The weight decay to apply (if not zero).
adam_beta1 (:obj:`float`, `optional`, defaults to 0.9):
    The beta1 hyperparameter for the Adam optimizer.
adam_beta2 (:obj:`float`, `optional`, defaults to 0.999):
    The beta2 hyperparameter for the Adam optimizer.
adam_epsilon (:obj:`float`, `optional`, defaults to 1e-8):
    The epsilon hyperparameter for the Adam optimizer.
max_grad_norm (:obj:`float`, `optional`, defaults to 1.0):
    Maximum gradient norm (for gradient clipping).
num_train_epochs(:obj:`float`, `optional`, defaults to 3.0):
    Total number of training epochs to perform (if not an integer, will perform the decimal part percents of
    the last epoch before stopping training).
max_steps (:obj:`int`, `optional`, defaults to -1):
    If set to a positive number, the total number of training steps to perform. Overrides
    :obj:`num_train_epochs`.
warmup_steps (:obj:`int`, `optional`, defaults to 0):
    Number of steps used for a linear warmup from 0 to :obj:`learning_rate`.
logging_dir (:obj:`str`, `optional`):
    `TensorBoard <https://www.tensorflow.org/tensorboard>`__ log directory. Will default to
    `runs/**CURRENT_DATETIME_HOSTNAME**`.
logging_first_step (:obj:`bool`, `optional`, defaults to :obj:`False`):
    Whether to log and evaluate the first :obj:`global_step` or not.
logging_steps (:obj:`int`, `optional`, defaults to 500):
    Number of update steps between two logs.
save_steps (:obj:`int`, `optional`, defaults to 500):
    Number of updates steps before two checkpoint saves.
save_total_limit (:obj:`int`, `optional`):
    If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
    :obj:`output_dir`.
no_cuda (:obj:`bool`, `optional`, defaults to :obj:`False`):
    Whether to not use CUDA even when it is available or not.
seed (:obj:`int`, `optional`, defaults to 42):
    Random seed for initialization.
fp16 (:obj:`bool`, `optional`, defaults to :obj:`False`):
    Whether to use 16-bit (mixed) precision training (through NVIDIA Apex) instead of 32-bit training.
fp16_opt_level (:obj:`str`, `optional`, defaults to 'O1'):
    For :obj:`fp16` training, Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details
    on the `Apex documentation <https://nvidia.github.io/apex/amp.html>`__.
local_rank (:obj:`int`, `optional`, defaults to -1):
    Rank of the process during distributed training.
tpu_num_cores (:obj:`int`, `optional`):
    When training on TPU, the number of TPU cores (automatically passed by launcher script).
debug (:obj:`bool`, `optional`, defaults to :obj:`False`):
    When training on TPU, whether to print debug metrics or not.
dataloader_drop_last (:obj:`bool`, `optional`, defaults to :obj:`False`):
    Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size)
    or not.
eval_steps (:obj:`int`, `optional`):
    Number of update steps between two evaluations if :obj:`evaluation_strategy="steps"`. Will default to the
    same value as :obj:`logging_steps` if not set.
dataloader_num_workers (:obj:`int`, `optional`, defaults to 0):
    Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the
    main process.
past_index (:obj:`int`, `optional`, defaults to -1):
    Some models like :doc:`TransformerXL <../model_doc/transformerxl>` or :doc`XLNet <../model_doc/xlnet>` can
    make use of the past hidden states for their predictions. If this argument is set to a positive int, the
    ``Trainer`` will use the corresponding output (usually index 2) as the past state and feed it to the model
    at the next training step under the keyword argument ``mems``.
run_name (:obj:`str`, `optional`):
    A descriptor for the run. Typically used for `wandb <https://www.wandb.com/>`_ logging.
disable_tqdm (:obj:`bool`, `optional`):
    Whether or not to disable the tqdm progress bars and table of metrics produced by
    :class:`~transformers.notebook.NotebookTrainingTracker` in Jupyter Notebooks. Will default to :obj:`True`
    if the logging level is set to warn or lower (default), :obj:`False` otherwise.
remove_unused_columns (:obj:`bool`, `optional`, defaults to :obj:`True`):
    If using :obj:`datasets.Dataset` datasets, whether or not to automatically remove the columns unused by the
    model forward method.

    (Note that this behavior is not implemented for :class:`~transformers.TFTrainer` yet.)
label_names (:obj:`List[str]`, `optional`):
    The list of keys in your dictionary of inputs that correspond to the labels.

    Will eventually default to :obj:`["labels"]` except if the model used is one of the
    :obj:`XxxForQuestionAnswering` in which case it will default to :obj:`["start_positions",
    "end_positions"]`.
load_best_model_at_end (:obj:`bool`, `optional`, defaults to :obj:`False`):
    Whether or not to load the best model found during training at the end of training.

    .. note::

        When set to :obj:`True`, the parameters :obj:`save_steps` will be ignored and the model will be saved
        after each evaluation.
metric_for_best_model (:obj:`str`, `optional`):
    Use in conjunction with :obj:`load_best_model_at_end` to specify the metric to use to compare two different
    models. Must be the name of a metric returned by the evaluation with or without the prefix :obj:`"eval_"`.
    Will default to :obj:`"loss"` if unspecified and :obj:`load_best_model_at_end=True` (to use the evaluation
    loss).

    If you set this value, :obj:`greater_is_better` will default to :obj:`True`. Don't forget to set it to
    :obj:`False` if your metric is better when lower.
greater_is_better (:obj:`bool`, `optional`):
    Use in conjunction with :obj:`load_best_model_at_end` and :obj:`metric_for_best_model` to specify if better
    models should have a greater metric or not. Will default to:

    - :obj:`True` if :obj:`metric_for_best_model` is set to a value that isn't :obj:`"loss"` or
      :obj:`"eval_loss"`.
    - :obj:`False` if :obj:`metric_for_best_model` is not set, or set to :obj:`"loss"` or :obj:`"eval_loss"`.
model_parallel (:obj:`bool`, `optional`, defaults to :obj:`False`):
    If there is more than one device, whether to use model parallelism to distribute the model's modules across
    devices or not.
ignore_skip_data (:obj:`bool`, `optional`, defaults to :obj:`False`):
    When resuming training, whether or not to skip the epochs and batches to get the data loading at the same
    stage as in the previous training. If set to :obj:`True`, the training will begin faster (as that skipping
    step can take a long time) but will not yield the same results as the interrupted training would have.
fp16_backend (:obj:`str`, `optional`, defaults to :obj:`"auto"`):
    The backend to use for mixed precision training. Must be one of :obj:`"auto"`, :obj:`"amp"` or
    :obj:`"apex"`. :obj:`"auto"` will use AMP or APEX depending on the PyTorch version detected, while the
    other choices will force the requested backend.
sharded_ddp (:obj:`bool`, `optional`, defaults to :obj:`False`):
    Use Sharded DDP training from `FairScale <https://github.com/facebookresearch/fairscale>`__ (in distributed
    training only). This is an experimental feature.

class QuestionAnsweringTrainer[source]

QuestionAnsweringTrainer(*args, eval_examples=None, **kwargs) :: Trainer

Trainer is a simple but feature-complete training and eval loop for PyTorch, optimized for 🤗 Transformers.

Args: model (:class:~transformers.PreTrainedModel or :obj:torch.nn.Module, optional): The model to train, evaluate or use for predictions. If not provided, a model_init must be passed.

    .. note::

        :class:`~transformers.Trainer` is optimized to work with the :class:`~transformers.PreTrainedModel`
        provided by the library. You can still use your own models defined as :obj:`torch.nn.Module` as long as
        they work the same way as the 🤗 Transformers models.
args (:class:`~transformers.TrainingArguments`, `optional`):
    The arguments to tweak for training. Will default to a basic instance of
    :class:`~transformers.TrainingArguments` with the ``output_dir`` set to a directory named `tmp_trainer` in
    the current directory if not provided.
data_collator (:obj:`DataCollator`, `optional`):
    The function to use to form a batch from a list of elements of :obj:`train_dataset` or :obj:`eval_dataset`.
    Will default to :func:`~transformers.default_data_collator` if no ``tokenizer`` is provided, an instance of
    :func:`~transformers.DataCollatorWithPadding` otherwise.
train_dataset (:obj:`torch.utils.data.dataset.Dataset`, `optional`):
    The dataset to use for training. If it is an :obj:`datasets.Dataset`, columns not accepted by the
    ``model.forward()`` method are automatically removed.
eval_dataset (:obj:`torch.utils.data.dataset.Dataset`, `optional`):
     The dataset to use for evaluation. If it is an :obj:`datasets.Dataset`, columns not accepted by the
     ``model.forward()`` method are automatically removed.
tokenizer (:class:`PreTrainedTokenizerBase`, `optional`):
    The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the
    maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an
    interrupted training or reuse the fine-tuned model.
model_init (:obj:`Callable[[], PreTrainedModel]`, `optional`):
    A function that instantiates the model to be used. If provided, each call to
    :meth:`~transformers.Trainer.train` will start from a new instance of the model as given by this function.

    The function may have zero argument, or a single one containing the optuna/Ray Tune trial object, to be
    able to choose different architectures according to hyper parameters (such as layer count, sizes of inner
    layers, dropout probabilities etc).
compute_metrics (:obj:`Callable[[EvalPrediction], Dict]`, `optional`):
    The function that will be used to compute metrics at evaluation. Must take a
    :class:`~transformers.EvalPrediction` and return a dictionary string to metric values.
callbacks (List of :obj:`~transformers.TrainerCallback`, `optional`):
    A list of callbacks to customize the training loop. Will add those to the list of default callbacks
    detailed in :doc:`here <callback>`.

    If you want to remove one of the default callbacks used, use the :meth:`Trainer.remove_callback` method.
optimizers (:obj:`Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR`, `optional`): A tuple
    containing the optimizer and the scheduler to use. Will default to an instance of
    :class:`~transformers.AdamW` on your model and a scheduler given by
    :func:`~transformers.get_linear_schedule_with_warmup` controlled by :obj:`args`.

Usage

The following example shows how the classes and functions in this module can be combined to fine-tune on the SQuAD v1 dataset. The first thing we need to do is grab the dataset:

from datasets import load_dataset

squad = load_dataset('squad')
squad
DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

For each example, the key information is contained in the context, question, and answer fields:

squad['train'][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

Next we need to tokenize and encode these texts. The following code does the job:

from transformers import AutoTokenizer

model_checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

num_train_examples = 1600
num_eval_examples = 320
train_ds, eval_ds, eval_examples = convert_examples_to_features(squad_ds, tokenizer, num_train_examples, num_eval_examples)

The final step is to configure and instantiate the trainer using the same settings as those decribed in the transformers examples. We'll use the model_init argument to ensure that the model is initialised with the same random weights:

import torch
from transformers import AutoModelForQuestionAnswering

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Running on device: {device}")

def model_init():
    return AutoModelForQuestionAnswering.from_pretrained(model_checkpoint).to(device)
Running on device: cuda

Then we just need to specify the hyperparameters and data collator for padding

batch_size = 12
learning_rate = 3e-5
num_train_epochs = 2
logging_steps = len(train_ds) // batch_size

args = QuestionAnsweringTrainingArguments(
    output_dir='checkpoints',
    evaluation_strategy='epoch',
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_size,
    learning_rate=3e-5,
    logging_steps=logging_steps,
)

trainer = QuestionAnsweringTrainer(
    args=args,
    model_init=model_init,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    eval_examples=eval_examples,
    tokenizer=tokenizer
)

and perform the fine-tuning:

trainer.train();
[14754/14754 2:58:54, Epoch 2/2]
Epoch Training Loss Validation Loss Exact Match F1
1.000000 1.266106 No log 79.309366 86.817847
2.000000 0.720876 No log 80.823084 88.228499