Skip to content

[megatron] chore: refactor to use Megatron-Bridge new APIs#6335

Merged
wuxibin89 merged 11 commits into
verl-project:mainfrom
HollowMan6:refactor_mbridge
May 21, 2026
Merged

[megatron] chore: refactor to use Megatron-Bridge new APIs#6335
wuxibin89 merged 11 commits into
verl-project:mainfrom
HollowMan6:refactor_mbridge

Conversation

@HollowMan6

@HollowMan6 HollowMan6 commented May 13, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Refactor to use Megatron-Bridge new helper APIs:

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the integration with megatron-bridge by utilizing the integration module for PEFT setup, DDP configuration, and provider configuration, which simplifies the code by removing manual class instantiations and attribute assignments. A review comment identifies that the refactor in transformer_impl.py omitted critical default configurations for the attention backend and MoE settings, potentially leading to performance regressions or incorrect model behavior.

Comment thread verl/workers/engine/megatron/transformer_impl.py
Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6 HollowMan6 marked this pull request as ready for review May 14, 2026 07:06
@HollowMan6 HollowMan6 requested a review from ISEEKYAN as a code owner May 14, 2026 07:06
Copilot AI review requested due to automatic review settings May 14, 2026 07:06
@HollowMan6 HollowMan6 requested a review from vermouth1992 as a code owner May 14, 2026 07:06
@HollowMan6 HollowMan6 marked this pull request as draft May 14, 2026 07:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the Megatron backend integration to rely on Megatron-Bridge’s newer helper APIs (provider configuration, PEFT creation/application, and DDP config creation), reducing local duplication and aligning more closely with upstream Megatron-Bridge behavior.

Changes:

  • Switch Megatron-Bridge provider setup to provider.configure(...) with overrides + pre-finalize hooks.
  • Replace in-repo PEFT object construction / adapter checkpoint loading with megatron.bridge.training.integration helpers.
  • Remove direct re-exports of Megatron-Bridge PEFT classes from verl.models.mcore.bridge.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
verl/workers/engine/megatron/transformer_impl.py Uses Megatron-Bridge provider.configure(...) and centralizes overrides/hooks.
verl/workers/config/megatron_peft.py Builds PEFT objects via integration.create_peft(...) instead of local class wiring.
verl/utils/megatron_utils.py Uses Megatron-Bridge integration helpers for PEFT hook creation and DDP config creation.
verl/models/mcore/bridge.py Removes PEFT class imports/exports now delegated to Megatron-Bridge integration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread verl/workers/engine/megatron/transformer_impl.py
Comment thread verl/utils/megatron_utils.py Outdated
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6 HollowMan6 marked this pull request as ready for review May 20, 2026 03:32
@HollowMan6 HollowMan6 requested a review from Copilot May 20, 2026 03:33
@HollowMan6

Copy link
Copy Markdown
Collaborator Author

/gemini review

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Comment thread verl/utils/megatron_utils.py Outdated
Comment thread verl/workers/engine/megatron/transformer_impl.py
Comment thread verl/utils/modelopt/vllm_modelopt_patch.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Megatron backend by offloading custom logic for PEFT, checkpointing, and model wrapping to the upstream megatron-bridge package. It also updates the vLLM integration for NVFP4/Marlin weight processing. Two critical issues were identified in verl/utils/modelopt/vllm_modelopt_patch.py where potential ZeroDivisionError exceptions could occur during scale factor calculations if weights or input scales are zero.

Comment thread verl/utils/modelopt/vllm_modelopt_patch.py
Comment thread verl/utils/modelopt/vllm_modelopt_patch.py
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6 HollowMan6 requested a review from wuxibin89 May 20, 2026 20:42
@wuxibin89 wuxibin89 merged commit bc3f3bf into verl-project:main May 21, 2026
109 of 199 checks passed
@HollowMan6 HollowMan6 deleted the refactor_mbridge branch May 21, 2026 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants