Add external trainer integration helpers by HollowMan6 · Pull Request #3813 · NVIDIA-NeMo/Megatron-Bridge

HollowMan6 · 2026-05-13T22:43:25Z

This is intended to be additive. Existing Bridge training paths continue to use their current setup/checkpoint flows. This PR does not rename or add public aliases for existing private checkpoint helpers.

Check the refactor on verl side
verl-project/verl#6335

What does this PR do ?

This PR adds small, composable integration helpers for external training loops that use Megatron-Bridge but own their own rollout/trainer/checkpoint scheduling. External frameworks such as verl need Bridge for provider construction, PEFT setup, DDP config construction, adapter loading, and checkpoint path wrappers, but they do not want a full lifecycle setup function. This change keeps the API low-level and additive.

Changelog

Added ModelProviderMixin.configure(...) to apply dtype, model-parallel sizes from initialized parallel_state, caller overrides, pre-finalize hooks, and finalization in one provider-owned method.
Added megatron.bridge.training.integration with helper functions for:
- PEFT object creation
- PEFT pre-wrap hook creation
- PEFT adapter checkpoint loading
- DDP config construction
Added unit coverage for provider configuration and the new integration helpers.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2026-05-13T22:43:28Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copilot

Pull request overview

This PR adds low-level, composable integration helpers intended for external training loops (e.g., verl) that want to reuse Megatron-Bridge pieces (provider config, PEFT setup, DDP config, adapter loading) without adopting Bridge’s full training lifecycle.

Changes:

Added ModelProviderMixin.configure(...) to apply dtype and model-parallel sizes (from initialized parallel_state), apply caller overrides, run pre-finalize hooks, and finalize the provider.
Added megatron.bridge.training.integration with helper utilities for PEFT creation/hooks, PEFT adapter checkpoint loading, and DDP config construction.
Added unit tests covering the new provider configuration behavior and integration helpers.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`src/megatron/bridge/models/model_provider.py`	Adds `ModelProviderMixin.configure(...)` helper for external integration workflows.
`src/megatron/bridge/training/integration.py`	New integration helper module for PEFT/DDP setup and adapter checkpoint loading.
`tests/unit_tests/models/test_model_provider_mixin.py`	Adds unit tests for provider `configure(...)`.
`tests/unit_tests/training/test_integration.py`	Adds unit tests for the new integration helper functions.

Comments suppressed due to low confidence (1)

src/megatron/bridge/training/integration.py:213

_to_torch_dtype() uses dict indexing, so unsupported dtype inputs raise a KeyError. Since this is a public integration helper, raise a ValueError (or similar) with a clear message listing supported dtype encodings instead of leaking KeyError.

def _to_torch_dtype(dtype: torch.dtype | str | int | None) -> torch.dtype | None:
    if dtype is None or isinstance(dtype, torch.dtype):
        return dtype
    return {
        16: torch.float16,
        "16": torch.float16,
        "fp16": torch.float16,
        "float16": torch.float16,
        32: torch.float32,
        "32": torch.float32,
        "fp32": torch.float32,
        "float32": torch.float32,
        "bf16": torch.bfloat16,
        "bfloat16": torch.bfloat16,
    }[dtype]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

This PR adds small, composable integration helpers for external training loops that use Megatron-Bridge but own their own rollout/trainer/checkpoint scheduling. External frameworks such as verl need Bridge for provider construction, PEFT setup, DDP config construction, adapter loading, and checkpoint path wrappers, but they do not want a full lifecycle setup function. This change keeps the API low-level and additive. - Added ModelProviderMixin.configure(...) to apply dtype, model-parallel sizes from initialized parallel_state, caller overrides, pre-finalize hooks, and finalization in one provider-owned method. - Added megatron.bridge.training.integration with helper functions for: - provider configuration - PEFT object creation - PEFT pre-wrap hook creation - PEFT adapter checkpoint loading - model state dict generation - DDP config construction - explicit-path checkpoint save/load wrappers - post-setup model config finalization - Added unit coverage for provider configuration and the new integration helpers. This is intended to be additive. Existing Bridge training paths continue to use their current setup/checkpoint flows. This PR does not rename or add public aliases for existing private checkpoint helpers. Signed-off-by: Hollow Man <hollowman@opensuse.org>

Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 mentioned this pull request May 13, 2026

[megatron] chore: refactor to use Megatron-Bridge new APIs verl-project/verl#6335

Merged

8 tasks

HollowMan6 force-pushed the refactor_rl branch from 65db659 to c43216f Compare May 14, 2026 06:32

HollowMan6 marked this pull request as ready for review May 14, 2026 06:38

Copilot AI review requested due to automatic review settings May 14, 2026 06:38

copy-pr-bot Bot temporarily deployed to public May 14, 2026 06:38 Inactive

Copilot started reviewing on behalf of HollowMan6 May 14, 2026 06:39 View session

copy-pr-bot Bot temporarily deployed to test May 14, 2026 06:39 Inactive

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/model_provider.py Outdated

Comment thread src/megatron/bridge/training/integration.py Outdated

Comment thread src/megatron/bridge/training/integration.py Outdated

copy-pr-bot Bot temporarily deployed to public May 14, 2026 06:46 Inactive

copy-pr-bot Bot temporarily deployed to public May 14, 2026 06:47 Inactive

copy-pr-bot Bot temporarily deployed to public May 14, 2026 07:01 Inactive

yaoyu-33 added area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels May 14, 2026

HollowMan6 force-pushed the refactor_rl branch from c43216f to f557891 Compare May 14, 2026 07:03

copy-pr-bot Bot temporarily deployed to public May 14, 2026 07:04 Inactive

copy-pr-bot Bot temporarily deployed to test May 14, 2026 07:05 Inactive

copy-pr-bot Bot temporarily deployed to public May 14, 2026 07:11 Inactive

copy-pr-bot Bot temporarily deployed to public May 14, 2026 07:12 Inactive

copy-pr-bot Bot temporarily deployed to public May 14, 2026 07:26 Inactive

HollowMan6 requested a review from Copilot May 14, 2026 22:46

Copilot started reviewing on behalf of HollowMan6 May 14, 2026 22:46 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread src/megatron/bridge/training/integration.py Outdated

copy-pr-bot Bot temporarily deployed to public May 15, 2026 00:13 Inactive

copy-pr-bot Bot temporarily deployed to test May 15, 2026 00:14 Inactive

copy-pr-bot Bot temporarily deployed to public May 15, 2026 00:21 Inactive

copy-pr-bot Bot temporarily deployed to test May 18, 2026 22:00 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 22:07 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 22:20 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 22:30 Inactive

copy-pr-bot Bot temporarily deployed to test May 18, 2026 22:31 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 22:37 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 22:38 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 22:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 22:56 Inactive

copy-pr-bot Bot temporarily deployed to test May 18, 2026 22:56 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 23:03 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 23:04 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 23:21 Inactive

HollowMan6 added 7 commits May 19, 2026 08:51

Address review feedback

abf9daa

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Value model and freeze moe helper

a12f5ce

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Address claude review

51d49da

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Rename provider configure into apply_overrides_and_finalize and clean up

2927535

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Move intergration.py to corresponding modules utils.py

63095f5

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Clean up

1aceb64

Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 force-pushed the refactor_rl branch from bc5748d to 1aceb64 Compare May 19, 2026 15:51

copy-pr-bot Bot temporarily deployed to public May 19, 2026 15:52 Inactive

copy-pr-bot Bot temporarily deployed to test May 19, 2026 15:52 Inactive

Merge branch 'main' into refactor_rl

04b0940

yaoyu-33 approved these changes May 20, 2026

View reviewed changes

cuichenx mentioned this pull request May 26, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add external trainer integration helpers#3813

Add external trainer integration helpers#3813
yaoyu-33 merged 8 commits into
NVIDIA-NeMo:mainfrom
HollowMan6:refactor_rl

HollowMan6 commented May 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HollowMan6 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HollowMan6 commented May 13, 2026 •

edited

Loading