Llama 3 family of models does not seem to work with RewardTrainer

### Reproduction

I noticed, when trying to train [Llama3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) with the `RewardTrainer` for my own problem, that seemingly no matter what I tried I couldn't get it to converge. Simply by switching Llama3.1 for Qwen2.5 (tried sizes from 0.5B --> 7B), the model converged without issue. I kept all hyperparameters the same and used the same data (unfortunately I cannot share it).

To reproduce, I ran the official [`reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py) script examples with Qwen2-0.5B vs. Llama3.1-1B (both instruct):

```bash
# Qwen training job
accelerate launch \
    --config_file="./conf/accelerate_configs/multi_gpu.yaml" \
    --num_processes 8 \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --output_dir output/Qwen2-0.5B-Reward \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-5 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048

# Llama training job
accelerate launch \
    --config_file="./conf/accelerate_configs/multi_gpu.yaml" \
    --num_processes 8 \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --output_dir output/Llama-3.2-1B-Reward \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-5 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048
```

Both were trained on 1 node of 8xA6000s with the following accelerate config:

```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

> [!NOTE]
> I did have to add the line `tokenizer.pad_token = tokenizer.eos_token` in `reward_modeling.py` right after the tokenizer and model initialization, similar to how its done in [`sft.py`](https://github.com/huggingface/trl/blob/bbdd6db17c49db813695d0a8bc0da7bf6b1bb88e/trl/scripts/sft.py#L87), because llama does not have a pad token (auxiliary point: maybe the `reward_modeling.py` script should do this when there is no pad token? happy to PR if so)

Sure enough, I see the same issues with convergence:

<img width="332" alt="Image" src="https://github.com/user-attachments/assets/e4c739d6-a026-4ff3-a46c-6242c3d9c0ff" />
<img width="732" alt="Image" src="https://github.com/user-attachments/assets/517e7802-4cd6-43de-bf77-45c845fa9f4d" />

I don't know if this is a known issue but I wanted to flag incase it is (and someone knows the fix) or it isn't and I am doing something dumb someone would be kind enough to point out!


### System Info

- Platform: Linux-5.15.0-83-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version: 2.5.1
- CUDA device(s): not available
- Transformers version: 4.48.2
- Accelerate version: 1.0.1
- Accelerate config: not found
- Datasets version: 3.2.0
- HF Hub version: 0.25.2
- TRL version: 0.14.0
- bitsandbytes version: 0.45.1
- DeepSpeed version: not installed
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.60.2
- PEFT version: 0.14.0

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3 family of models does not seem to work with RewardTrainer #2758

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Llama 3 family of models does not seem to work with RewardTrainer #2758

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions