Feature request
Hi,
I'd like to propose adding VESPO (https://huggingface.co/papers/2602.10693) as a supported loss_type in GRPOTrainer
It's quite distinct from GRPO and closer to SAPO, with a smooth trust region.
-
Instead of differentiating through the IS ratio, the gradient relies directly on the current logprobs (like vanilla policy gradient). The IS ratio, instead, is passed through a nonlinear function, detached, as a pure scaling factor.
-
Advantages calculation is the same as GRPO.
-
Compatible with TIS/MIS.
The main highlight, imo, is nice/stable performance under bad policy staleness and training-inference mismatch, especially with MoEs.
I wanted to know, first, if there is an interest to have this variant in TRL before opening a PR
Motivation
I thought you might be interested in this one @qgallouedec because they're getting quite nice results with sparse architectures specifically, even without routing replay, see some handpicked pics below:
Your contribution
yes
Feature request
Hi,
I'd like to propose adding VESPO (https://huggingface.co/papers/2602.10693) as a supported
loss_typeinGRPOTrainerIt's quite distinct from GRPO and closer to SAPO, with a smooth trust region.
Instead of differentiating through the IS ratio, the gradient relies directly on the current logprobs (like vanilla policy gradient). The IS ratio, instead, is passed through a nonlinear function, detached, as a pure scaling factor.
Advantages calculation is the same as GRPO.
Compatible with TIS/MIS.
The main highlight, imo, is nice/stable performance under bad policy staleness and training-inference mismatch, especially with MoEs.
I wanted to know, first, if there is an interest to have this variant in TRL before opening a PR
Motivation
I thought you might be interested in this one @qgallouedec because they're getting quite nice results with sparse architectures specifically, even without routing replay, see some handpicked pics below:
Your contribution
yes