Skip to content

[Feature] Env Mask Support in custom rollout func #5673

@EternalNova

Description

@EternalNova

I’m trying to train a multi-turn code-repair RL environment with a custom rollout_func, kinda like in RLEF paper (https://arxiv.org/abs/2410.02089)
The problem is that after some time, the training is collapsing - the reward falls down, model cant properly answer after 1st turn etc.
After some research and fixes (setting gspo + beta > 0 kinda helped, but not much), it seems that there is a problem with env mask, which allows for masking model tokens vs. environment/feedback/tool-result tokens. I'm pretty sure it is not being applied in Unsloth's patched GRPOTrainer?
In pure TRL version it seems to be properly applied: https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L2444
While I cant see it applying in the patched version: https://github.com/unslothai/unsloth/blob/main/unsloth/models/rl_replacements.py#L1500

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestFeature request pending on roadmap

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions