Feature request
I would like to request support for padding-free reward modeling training in the HuggingFace TRL library, specifically in the RewardTrainer.
Currently, TRL's SFT trainer already supports padding-free training via position_ids, allowing for efficient fine-tuning on variable-length sequences. However, RewardTrainer still relies on padded inputs.
It would be very helpful if RewardTrainer could support a similar mechanism—by allowing user-supplied position_ids and attention_mask—to enable training without padding overhead.
Motivation
Reward modeling datasets typically involve prompt + response pairs of varying lengths. Padding these to a uniform length introduces significant memory and compute inefficiencies, especially for long sequences and large batch sizes.
TRL’s SFT trainer has addressed this via padding-free training using custom position_ids. Bringing this same capability to RewardTrainer would streamline fine-tuning pipelines and reduce resource waste. This is particularly important in large-scale setups or when using FlashAttention-based architectures.
Your contribution
Yes, I’d be happy to help by contributing a PR.
However, I believe this feature may require changes not only to TRL but also to HuggingFace Transformers—specifically in how forward() is implemented for SequenceClassification models. Many of these currently return only pooled_logits, which makes it difficult to apply position-aware token-level masking or sequence-level selection needed for padding-free reward modeling.
I’m happy to coordinate with the Transformers team or align with ongoing architectural changes to ensure compatibility.
Feature request
I would like to request support for padding-free reward modeling training in the HuggingFace TRL library, specifically in the RewardTrainer.
Currently, TRL's SFT trainer already supports padding-free training via position_ids, allowing for efficient fine-tuning on variable-length sequences. However, RewardTrainer still relies on padded inputs.
It would be very helpful if RewardTrainer could support a similar mechanism—by allowing user-supplied position_ids and attention_mask—to enable training without padding overhead.
Motivation
Reward modeling datasets typically involve prompt + response pairs of varying lengths. Padding these to a uniform length introduces significant memory and compute inefficiencies, especially for long sequences and large batch sizes.
TRL’s SFT trainer has addressed this via padding-free training using custom position_ids. Bringing this same capability to RewardTrainer would streamline fine-tuning pipelines and reduce resource waste. This is particularly important in large-scale setups or when using FlashAttention-based architectures.
Your contribution
Yes, I’d be happy to help by contributing a PR.
However, I believe this feature may require changes not only to TRL but also to HuggingFace Transformers—specifically in how forward() is implemented for SequenceClassification models. Many of these currently return only pooled_logits, which makes it difficult to apply position-aware token-level masking or sequence-level selection needed for padding-free reward modeling.
I’m happy to coordinate with the Transformers team or align with ongoing architectural changes to ensure compatibility.