Hi,
I'm running a simple example of GRPOTrainer in plain trl and it runs fine (using the very same conda env I use for unsloth):
grpo_example2.txt
After MANY iterations the text becomes garbage but I think it is reasonable given the reward function used.
I tried to port this to unsloth, it runs, but the model generates "noise" after the very first fine tuning iteration:
prova_grpo.txt
First completion is fine:
reward_function completions: I got blamed, and the girl is in the same classes, for what i didn't do.
the following ones are "noise":
reward_function completions: back.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
reward_function completions: .Pee.Pee est.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
Environment details and full log:
log.txt
This might be related #1836 but I'm already using 3.11.11
Also: #1672 tried 2025.2.12 but it's still the same.
I also tried unsloth/llama-3-8b-bnb-4bit with same results.
What am I doing wrong?
Thanks
Hi,
I'm running a simple example of GRPOTrainer in plain trl and it runs fine (using the very same conda env I use for unsloth):
grpo_example2.txt
After MANY iterations the text becomes garbage but I think it is reasonable given the reward function used.
I tried to port this to unsloth, it runs, but the model generates "noise" after the very first fine tuning iteration:
prova_grpo.txt
First completion is fine:
reward_function completions: I got blamed, and the girl is in the same classes, for what i didn't do.
the following ones are "noise":
reward_function completions: back.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
reward_function completions: .Pee.Pee est.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
Environment details and full log:
log.txt
This might be related #1836 but I'm already using 3.11.11
Also: #1672 tried 2025.2.12 but it's still the same.
I also tried unsloth/llama-3-8b-bnb-4bit with same results.
What am I doing wrong?
Thanks