[Feature] Env Mask Support in custom rollout func

I’m trying to train a multi-turn code-repair RL environment with a custom rollout_func, kinda like in RLEF paper (https://arxiv.org/abs/2410.02089)
The problem is that after some time, the training is collapsing - the reward falls down, model cant properly answer after 1st turn etc.
After some research and fixes (setting gspo + beta > 0 kinda helped, but not much), it seems that there is a problem with env mask, which allows for masking model tokens vs. environment/feedback/tool-result tokens. I'm pretty sure it is not being applied in Unsloth's patched GRPOTrainer?
In pure TRL version it seems to be properly applied: https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L2444
While I cant see it applying in the patched version: https://github.com/unslothai/unsloth/blob/main/unsloth/models/rl_replacements.py#L1500

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Env Mask Support in custom rollout func #5673

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Env Mask Support in custom rollout func #5673

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions