[megatron] chore: refactor to use Megatron-Bridge new APIs#6335
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the integration with megatron-bridge by utilizing the integration module for PEFT setup, DDP configuration, and provider configuration, which simplifies the code by removing manual class instantiations and attribute assignments. A review comment identifies that the refactor in transformer_impl.py omitted critical default configurations for the attention backend and MoE settings, potentially leading to performance regressions or incorrect model behavior.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
48bcc32 to
0a95930
Compare
There was a problem hiding this comment.
Pull request overview
Refactors the Megatron backend integration to rely on Megatron-Bridge’s newer helper APIs (provider configuration, PEFT creation/application, and DDP config creation), reducing local duplication and aligning more closely with upstream Megatron-Bridge behavior.
Changes:
- Switch Megatron-Bridge provider setup to
provider.configure(...)with overrides + pre-finalize hooks. - Replace in-repo PEFT object construction / adapter checkpoint loading with
megatron.bridge.training.integrationhelpers. - Remove direct re-exports of Megatron-Bridge PEFT classes from
verl.models.mcore.bridge.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
verl/workers/engine/megatron/transformer_impl.py |
Uses Megatron-Bridge provider.configure(...) and centralizes overrides/hooks. |
verl/workers/config/megatron_peft.py |
Builds PEFT objects via integration.create_peft(...) instead of local class wiring. |
verl/utils/megatron_utils.py |
Uses Megatron-Bridge integration helpers for PEFT hook creation and DDP config creation. |
verl/models/mcore/bridge.py |
Removes PEFT class imports/exports now delegated to Megatron-Bridge integration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
fa62ebf to
04542a1
Compare
Signed-off-by: Hollow Man <hollowman@opensuse.org>
04542a1 to
890a648
Compare
Signed-off-by: Hollow Man <hollowman@opensuse.org>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the Megatron backend by offloading custom logic for PEFT, checkpointing, and model wrapping to the upstream megatron-bridge package. It also updates the vLLM integration for NVFP4/Marlin weight processing. Two critical issues were identified in verl/utils/modelopt/vllm_modelopt_patch.py where potential ZeroDivisionError exceptions could occur during scale factor calculations if weights or input scales are zero.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
ff114fd to
269710d
Compare
What does this PR do?
Refactor to use Megatron-Bridge new helper APIs:
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.