Skip to content

feat: Variational Sequence-Level Soft Policy Optimization (VESPO) #5196

@casinca

Description

@casinca

Feature request

Hi,

I'd like to propose adding VESPO (https://huggingface.co/papers/2602.10693) as a supported loss_type in GRPOTrainer

It's quite distinct from GRPO and closer to SAPO, with a smooth trust region.

  • Instead of differentiating through the IS ratio, the gradient relies directly on the current logprobs (like vanilla policy gradient). The IS ratio, instead, is passed through a nonlinear function, detached, as a pure scaling factor.

  • Advantages calculation is the same as GRPO.

  • Compatible with TIS/MIS.

The main highlight, imo, is nice/stable performance under bad policy staleness and training-inference mismatch, especially with MoEs.

 

I wanted to know, first, if there is an interest to have this variant in TRL before opening a PR

Motivation

I thought you might be interested in this one @qgallouedec because they're getting quite nice results with sparse architectures specifically, even without routing replay, see some handpicked pics below:

Image

Your contribution

yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions