GRPOTrainer example works with trl but generate "noise" with unsloth

Hi,
I'm running a simple example of GRPOTrainer in plain trl and it runs fine (using the very same conda env I use for unsloth):

[grpo_example2.txt](https://github.com/user-attachments/files/19007584/grpo_example2.txt)

After MANY iterations the text becomes garbage but I think it is reasonable given the reward function used.


I tried to port this to unsloth, it runs, but the model generates "noise" after the very first fine tuning iteration:

[prova_grpo.txt](https://github.com/user-attachments/files/19007652/prova_grpo.txt)

First completion is fine:

reward_function completions:  I got blamed, and the girl is in the same classes, for what i didn't do.

the following ones are "noise":

reward_function completions:  back.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
reward_function completions: .Pee.Pee est.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

Environment details and full log:

[log.txt](https://github.com/user-attachments/files/19007734/log.txt)

This might be related  https://github.com/unslothai/unsloth/issues/1836  but I'm already using 3.11.11

Also: https://github.com/unslothai/unsloth/issues/1672   tried 2025.2.12 but it's still the same.

I also tried unsloth/llama-3-8b-bnb-4bit with same results.


What am I doing wrong?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GRPOTrainer example works with trl but generate "noise" with unsloth #1844

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

GRPOTrainer example works with trl but generate "noise" with unsloth #1844

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions