Skip to content

Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations#3990

Merged
behroozazarkhalili merged 17 commits into
huggingface:mainfrom
behroozazarkhalili:update-paper-index
Sep 4, 2025
Merged

Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations#3990
behroozazarkhalili merged 17 commits into
huggingface:mainfrom
behroozazarkhalili:update-paper-index

Conversation

@behroozazarkhalili

Copy link
Copy Markdown
Collaborator

Summary

This PR significantly expands the TRL paper index documentation by adding comprehensive implementation guides for 9 additional state-of-the-art preference optimization and alignment algorithms.

Added Algorithms

Direct Preference Optimization Variants

  • IPO (Identity Preference Optimization) - General theoretical paradigm for learning from human preferences
  • SLiC-HF (Sequence Likelihood Calibration) - Simpler alternative to RLHF using hinge loss
  • EXO (Efficient Exact Optimization) - Guaranteed and efficient alignment method
  • rDPO (Robust DPO) - Handles noisy feedback with provable robustness guarantees
  • APO (Anchored Preference Optimization) - Two variants (APO-zero, APO-down) for controlled alignment

Advanced Optimization Methods

  • NCA (Noise Contrastive Alignment) - Uses explicit rewards with contrastive estimation
  • BCO (Binary Classifier Optimization) - Leverages binary feedback signals for alignment
  • SPPO (Self-Play Preference Optimization) - Achieves Nash equilibrium through self-play
  • DiscoPOP (Discovered Preference Optimization) - LLM-discovered algorithm blending multiple losses

Implementation Details

Each algorithm includes:

  • ✅ Direct links to original research papers
  • ✅ Production-ready configuration examples with DPOConfig or RLOOConfig
  • ✅ Detailed parameter settings from published papers
  • ✅ Section references for reproducibility
  • ✅ Hyperparameter values validated against paper appendices

Configuration Examples

All implementations provide complete, copy-paste ready configurations:

# Example: Self-Play Preference Optimization
from trl import DPOConfig

training_args = DPOConfig(
    loss_type="sppo_hard",
    per_device_train_batch_size=64,
    learning_rate=5e-7,
)

Documentation Quality

  • Consistent formatting with existing paper index structure
  • Clear algorithm descriptions explaining key innovations
  • Proper mathematical notation where applicable
  • Direct citations to published versions with PDF links

Impact

This enhancement makes TRL a more comprehensive resource for researchers and practitioners working with preference-based language model alignment, providing easy access to cutting-edge algorithms with validated configurations.

Testing

All configuration examples have been validated against the original paper specifications and TRL's API compatibility.

behroozazarkhalili and others added 12 commits August 22, 2025 06:19
- Added DAPO (An Open-Source LLM Reinforcement Learning System at Scale) section
- Includes proper paper reference and implementation details
- Added training configuration parameters from DAPO paper section 4.1
- Added Dr. GRPO configuration example with training parameters
- Includes paper reference and implementation details from training section
- Added parameters: loss_type, batch_size, num_generations, prompt/completion lengths, beta
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
- IPO (Identity Preference Optimization)
- SLiC-HF (Sequence Likelihood Calibration with Human Feedback)
- EXO (Efficient Exact Optimization)
- NCA (Noise Contrastive Alignment)
- rDPO (Robust Direct Preference Optimization)
- BCO (Binary Classifier Optimization)
- SPPO (Self-Play Preference Optimization)
- DiscoPOP (Discovered Preference Optimization)
- APO (Anchored Preference Optimization) with APO-zero and APO-down variants
- Replace double backslash LaTeX notation with standard markdown math syntax
- Correct typo: 'lenght' to 'length' in sequence normalization explanation
- Preserve original variable names (y_{i,t}) from paper specification
- Improve mathematical formula readability in markdown rendering
@behroozazarkhalili behroozazarkhalili force-pushed the update-paper-index branch 2 times, most recently from e171b71 to bb261c9 Compare September 1, 2025 19:37
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated

@qgallouedec qgallouedec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! just a few nits to fix

- Convert malformed inline math syntax to proper format
- Use consistent \\(...\\) notation for inline math in text sections
- Keep $...$ notation in Python code comments unchanged
- Remove duplicated text and extra indentation
- Add missing paper section for AOT method
- Ensure proper LaTeX rendering in documentation
Comment thread docs/source/paper_index.md Outdated
Comment thread docs/source/paper_index.md Outdated
@behroozazarkhalili behroozazarkhalili enabled auto-merge (squash) September 4, 2025 01:34
@qgallouedec qgallouedec self-requested a review September 4, 2025 01:59
@behroozazarkhalili behroozazarkhalili merged commit 6799160 into huggingface:main Sep 4, 2025
1 check passed
@qgallouedec qgallouedec mentioned this pull request Oct 30, 2025
55 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants