Hi @kashif,
We met at ICLR in Vienna and chatted about generative evaluation in DPO Trainer.
I promised to send you the patch I used in my work to make DPO support generative evaluation.
Here is my patched version of the trainer:
https://github.com/prompteus/calc-x/blob/master/gadgets/dpo_trainer.py
Here is the version of the trainer I was editing to make it work:
https://github.com/huggingface/trl/blob/v0.7.9/trl/trainer/dpo_trainer.py
The diff between those two files shows the minimal changes needed to support generative eval. Specifically:
- inheriting from
Seq2SeqTrainer instead of Trainer (and swapping TrainingArguments -> Seq2SeqTrainingArguments)
- adding
input_ids to the data collator
- changing
prediction_step
With generative eval, one probably uses a validation dataset to only have prompts and reference golden answers, but the DPO trainer expects the validation dataset as triples (prompt, chosen, rejected), so I just set rejected as an empty string. Probably handling it directly in the DPO trainer would be better but I didn't do that.
Feel free to ask for any clarifications.
I am not sending a PR since it requires merging my changes with the past 4 months (and updating other trainers as well) and I'm currently preparing for my degree finals and have little spare time.
Hi @kashif,
We met at ICLR in Vienna and chatted about generative evaluation in DPO Trainer.
I promised to send you the patch I used in my work to make DPO support generative evaluation.
Here is my patched version of the trainer:
https://github.com/prompteus/calc-x/blob/master/gadgets/dpo_trainer.py
Here is the version of the trainer I was editing to make it work:
https://github.com/huggingface/trl/blob/v0.7.9/trl/trainer/dpo_trainer.py
The diff between those two files shows the minimal changes needed to support generative eval. Specifically:
Seq2SeqTrainerinstead ofTrainer(and swappingTrainingArguments->Seq2SeqTrainingArguments)input_idsto the data collatorprediction_stepWith generative eval, one probably uses a validation dataset to only have prompts and reference golden answers, but the DPO trainer expects the validation dataset as triples (prompt, chosen, rejected), so I just set rejected as an empty string. Probably handling it directly in the DPO trainer would be better but I didn't do that.
Feel free to ask for any clarifications.
I am not sending a PR since it requires merging my changes with the past 4 months (and updating other trainers as well) and I'm currently preparing for my degree finals and have little spare time.