Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations#3990
Merged
behroozazarkhalili merged 17 commits intoSep 4, 2025
Merged
Conversation
- Added DAPO (An Open-Source LLM Reinforcement Learning System at Scale) section - Includes proper paper reference and implementation details - Added training configuration parameters from DAPO paper section 4.1
- Added Dr. GRPO configuration example with training parameters - Includes paper reference and implementation details from training section - Added parameters: loss_type, batch_size, num_generations, prompt/completion lengths, beta
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
- IPO (Identity Preference Optimization) - SLiC-HF (Sequence Likelihood Calibration with Human Feedback) - EXO (Efficient Exact Optimization) - NCA (Noise Contrastive Alignment) - rDPO (Robust Direct Preference Optimization) - BCO (Binary Classifier Optimization) - SPPO (Self-Play Preference Optimization) - DiscoPOP (Discovered Preference Optimization) - APO (Anchored Preference Optimization) with APO-zero and APO-down variants
e5de215 to
1f139f2
Compare
- Replace double backslash LaTeX notation with standard markdown math syntax
- Correct typo: 'lenght' to 'length' in sequence normalization explanation
- Preserve original variable names (y_{i,t}) from paper specification
- Improve mathematical formula readability in markdown rendering
e171b71 to
bb261c9
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
reviewed
Sep 2, 2025
qgallouedec
requested changes
Sep 2, 2025
qgallouedec
left a comment
Member
There was a problem hiding this comment.
Awesome! just a few nits to fix
- Convert malformed inline math syntax to proper format - Use consistent \\(...\\) notation for inline math in text sections - Keep $...$ notation in Python code comments unchanged - Remove duplicated text and extra indentation - Add missing paper section for AOT method - Ensure proper LaTeX rendering in documentation
5eb7821 to
a3c2eff
Compare
qgallouedec
reviewed
Sep 3, 2025
qgallouedec
reviewed
Sep 3, 2025
a3c2eff to
c7a0509
Compare
qgallouedec
approved these changes
Sep 4, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR significantly expands the TRL paper index documentation by adding comprehensive implementation guides for 9 additional state-of-the-art preference optimization and alignment algorithms.
Added Algorithms
Direct Preference Optimization Variants
Advanced Optimization Methods
Implementation Details
Each algorithm includes:
DPOConfigorRLOOConfigConfiguration Examples
All implementations provide complete, copy-paste ready configurations:
Documentation Quality
Impact
This enhancement makes TRL a more comprehensive resource for researchers and practitioners working with preference-based language model alignment, providing easy access to cutting-edge algorithms with validated configurations.
Testing
All configuration examples have been validated against the original paper specifications and TRL's API compatibility.