Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations by behroozazarkhalili · Pull Request #3990 · huggingface/trl

behroozazarkhalili · 2025-09-01T19:06:52Z

Summary

This PR significantly expands the TRL paper index documentation by adding comprehensive implementation guides for 9 additional state-of-the-art preference optimization and alignment algorithms.

Added Algorithms

Direct Preference Optimization Variants

IPO (Identity Preference Optimization) - General theoretical paradigm for learning from human preferences
SLiC-HF (Sequence Likelihood Calibration) - Simpler alternative to RLHF using hinge loss
EXO (Efficient Exact Optimization) - Guaranteed and efficient alignment method
rDPO (Robust DPO) - Handles noisy feedback with provable robustness guarantees
APO (Anchored Preference Optimization) - Two variants (APO-zero, APO-down) for controlled alignment

Advanced Optimization Methods

NCA (Noise Contrastive Alignment) - Uses explicit rewards with contrastive estimation
BCO (Binary Classifier Optimization) - Leverages binary feedback signals for alignment
SPPO (Self-Play Preference Optimization) - Achieves Nash equilibrium through self-play
DiscoPOP (Discovered Preference Optimization) - LLM-discovered algorithm blending multiple losses

Implementation Details

Each algorithm includes:

✅ Direct links to original research papers
✅ Production-ready configuration examples with DPOConfig or RLOOConfig
✅ Detailed parameter settings from published papers
✅ Section references for reproducibility
✅ Hyperparameter values validated against paper appendices

Configuration Examples

All implementations provide complete, copy-paste ready configurations:

# Example: Self-Play Preference Optimization
from trl import DPOConfig

training_args = DPOConfig(
    loss_type="sppo_hard",
    per_device_train_batch_size=64,
    learning_rate=5e-7,
)

Documentation Quality

Consistent formatting with existing paper index structure
Clear algorithm descriptions explaining key innovations
Proper mathematical notation where applicable
Direct citations to published versions with PDF links

Impact

This enhancement makes TRL a more comprehensive resource for researchers and practitioners working with preference-based language model alignment, providing easy access to cutting-edge algorithms with validated configurations.

Testing

All configuration examples have been validated against the original paper specifications and TRL's API compatibility.

- Added DAPO (An Open-Source LLM Reinforcement Learning System at Scale) section - Includes proper paper reference and implementation details - Added training configuration parameters from DAPO paper section 4.1

- Added Dr. GRPO configuration example with training parameters - Includes paper reference and implementation details from training section - Added parameters: loss_type, batch_size, num_generations, prompt/completion lengths, beta

…-paper-index

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

- IPO (Identity Preference Optimization) - SLiC-HF (Sequence Likelihood Calibration with Human Feedback) - EXO (Efficient Exact Optimization) - NCA (Noise Contrastive Alignment) - rDPO (Robust Direct Preference Optimization) - BCO (Binary Classifier Optimization) - SPPO (Self-Play Preference Optimization) - DiscoPOP (Discovered Preference Optimization) - APO (Anchored Preference Optimization) with APO-zero and APO-down variants

- Replace double backslash LaTeX notation with standard markdown math syntax - Correct typo: 'lenght' to 'length' in sequence normalization explanation - Preserve original variable names (y_{i,t}) from paper specification - Improve mathematical formula readability in markdown rendering

HuggingFaceDocBuilderDev · 2025-09-02T18:59:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec

Awesome! just a few nits to fix

- Convert malformed inline math syntax to proper format - Use consistent \$...\$ notation for inline math in text sections - Keep $...$ notation in Python code comments unchanged - Remove duplicated text and extra indentation - Add missing paper section for AOT method - Ensure proper LaTeX rendering in documentation

behroozazarkhalili and others added 12 commits August 22, 2025 06:19

Update paper_index section with DAPO entry

7c4665a

- Added DAPO (An Open-Source LLM Reinforcement Learning System at Scale) section - Includes proper paper reference and implementation details - Added training configuration parameters from DAPO paper section 4.1

Add Dr. GRPO section to paper index

98efe1a

- Added Dr. GRPO configuration example with training parameters - Includes paper reference and implementation details from training section - Added parameters: loss_type, batch_size, num_generations, prompt/completion lengths, beta

reorder

7a11d81

style

1f99446

style

12aca2a

Merge branch 'main' of https://github.com/huggingface/trl into update…

b2d31f5

…-paper-index

Add Soft Overlong Punishment configuration example to DAPO section

22ab9d1

Add DPO (Direct Preference Optimization) section to paper index

bc337d0

Update docs/source/paper_index.md

0a5583a

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

Update docs/source/paper_index.md

d9b40e8

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

Merge upstream changes from origin/main

08c155e

behroozazarkhalili force-pushed the update-paper-index branch from e5de215 to 1f139f2 Compare September 1, 2025 19:30

behroozazarkhalili force-pushed the update-paper-index branch 2 times, most recently from e171b71 to bb261c9 Compare September 1, 2025 19:37

qgallouedec reviewed Sep 2, 2025

View reviewed changes