Skip to content

GRPOTrainer example works with trl but generate "noise" with unsloth #1844

@nottrz

Description

@nottrz

Hi,
I'm running a simple example of GRPOTrainer in plain trl and it runs fine (using the very same conda env I use for unsloth):

grpo_example2.txt

After MANY iterations the text becomes garbage but I think it is reasonable given the reward function used.

I tried to port this to unsloth, it runs, but the model generates "noise" after the very first fine tuning iteration:

prova_grpo.txt

First completion is fine:

reward_function completions: I got blamed, and the girl is in the same classes, for what i didn't do.

the following ones are "noise":

reward_function completions: back.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
reward_function completions: .Pee.Pee est.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

Environment details and full log:

log.txt

This might be related #1836 but I'm already using 3.11.11

Also: #1672 tried 2025.2.12 but it's still the same.

I also tried unsloth/llama-3-8b-bnb-4bit with same results.

What am I doing wrong?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions