A very good first issue IMO!
See #4829 (comment)
we should update the eval dataset to pick one start_position (or the most frequent one)
Optionally, use the huggingface/nlp library to get the eval dataset, and hook it into the Trainer.
Also referenced in #6997 (comment)