Skip to content

Llama 3 family of models does not seem to work with RewardTrainer #2758

@JohnGiorgi

Description

@JohnGiorgi

Reproduction

I noticed, when trying to train Llama3.1-8B with the RewardTrainer for my own problem, that seemingly no matter what I tried I couldn't get it to converge. Simply by switching Llama3.1 for Qwen2.5 (tried sizes from 0.5B --> 7B), the model converged without issue. I kept all hyperparameters the same and used the same data (unfortunately I cannot share it).

To reproduce, I ran the official reward_modeling.py script examples with Qwen2-0.5B vs. Llama3.1-1B (both instruct):

# Qwen training job
accelerate launch \
    --config_file="./conf/accelerate_configs/multi_gpu.yaml" \
    --num_processes 8 \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --output_dir output/Qwen2-0.5B-Reward \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-5 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048

# Llama training job
accelerate launch \
    --config_file="./conf/accelerate_configs/multi_gpu.yaml" \
    --num_processes 8 \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --output_dir output/Llama-3.2-1B-Reward \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-5 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048

Both were trained on 1 node of 8xA6000s with the following accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Note

I did have to add the line tokenizer.pad_token = tokenizer.eos_token in reward_modeling.py right after the tokenizer and model initialization, similar to how its done in sft.py, because llama does not have a pad token (auxiliary point: maybe the reward_modeling.py script should do this when there is no pad token? happy to PR if so)

Sure enough, I see the same issues with convergence:

Image Image

I don't know if this is a known issue but I wanted to flag incase it is (and someone knows the fix) or it isn't and I am doing something dumb someone would be kind enough to point out!

System Info

  • Platform: Linux-5.15.0-83-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version: 2.5.1
  • CUDA device(s): not available
  • Transformers version: 4.48.2
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • Datasets version: 3.2.0
  • HF Hub version: 0.25.2
  • TRL version: 0.14.0
  • bitsandbytes version: 0.45.1
  • DeepSpeed version: not installed
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: 1.60.2
  • PEFT version: 0.14.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions