Skip to content

Add external trainer integration helpers#3813

Merged
yaoyu-33 merged 8 commits into
NVIDIA-NeMo:mainfrom
HollowMan6:refactor_rl
May 20, 2026
Merged

Add external trainer integration helpers#3813
yaoyu-33 merged 8 commits into
NVIDIA-NeMo:mainfrom
HollowMan6:refactor_rl

Conversation

@HollowMan6

@HollowMan6 HollowMan6 commented May 13, 2026

Copy link
Copy Markdown
Member

This is intended to be additive. Existing Bridge training paths continue to use their current setup/checkpoint flows. This PR does not rename or add public aliases for existing private checkpoint helpers.

Check the refactor on verl side
verl-project/verl#6335

What does this PR do ?

This PR adds small, composable integration helpers for external training loops that use Megatron-Bridge but own their own rollout/trainer/checkpoint scheduling. External frameworks such as verl need Bridge for provider construction, PEFT setup, DDP config construction, adapter loading, and checkpoint path wrappers, but they do not want a full lifecycle setup function. This change keeps the API low-level and additive.

Changelog

  • Added ModelProviderMixin.configure(...) to apply dtype, model-parallel sizes from initialized parallel_state, caller overrides, pre-finalize hooks, and finalization in one provider-owned method.
  • Added megatron.bridge.training.integration with helper functions for:
    • PEFT object creation
    • PEFT pre-wrap hook creation
    • PEFT adapter checkpoint loading
    • DDP config construction
  • Added unit coverage for provider configuration and the new integration helpers.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)
image

@copy-pr-bot

copy-pr-bot Bot commented May 13, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds low-level, composable integration helpers intended for external training loops (e.g., verl) that want to reuse Megatron-Bridge pieces (provider config, PEFT setup, DDP config, adapter loading) without adopting Bridge’s full training lifecycle.

Changes:

  • Added ModelProviderMixin.configure(...) to apply dtype and model-parallel sizes (from initialized parallel_state), apply caller overrides, run pre-finalize hooks, and finalize the provider.
  • Added megatron.bridge.training.integration with helper utilities for PEFT creation/hooks, PEFT adapter checkpoint loading, and DDP config construction.
  • Added unit tests covering the new provider configuration behavior and integration helpers.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/megatron/bridge/models/model_provider.py Adds ModelProviderMixin.configure(...) helper for external integration workflows.
src/megatron/bridge/training/integration.py New integration helper module for PEFT/DDP setup and adapter checkpoint loading.
tests/unit_tests/models/test_model_provider_mixin.py Adds unit tests for provider configure(...).
tests/unit_tests/training/test_integration.py Adds unit tests for the new integration helper functions.
Comments suppressed due to low confidence (1)

src/megatron/bridge/training/integration.py:213

  • _to_torch_dtype() uses dict indexing, so unsupported dtype inputs raise a KeyError. Since this is a public integration helper, raise a ValueError (or similar) with a clear message listing supported dtype encodings instead of leaking KeyError.
def _to_torch_dtype(dtype: torch.dtype | str | int | None) -> torch.dtype | None:
    if dtype is None or isinstance(dtype, torch.dtype):
        return dtype
    return {
        16: torch.float16,
        "16": torch.float16,
        "fp16": torch.float16,
        "float16": torch.float16,
        32: torch.float32,
        "32": torch.float32,
        "fp32": torch.float32,
        "float32": torch.float32,
        "bf16": torch.bfloat16,
        "bfloat16": torch.bfloat16,
    }[dtype]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/megatron/bridge/models/model_provider.py Outdated
Comment thread src/megatron/bridge/training/integration.py Outdated
Comment thread src/megatron/bridge/training/integration.py Outdated
@yaoyu-33 yaoyu-33 added area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels May 14, 2026
@HollowMan6 HollowMan6 requested a review from Copilot May 14, 2026 22:46

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread src/megatron/bridge/training/integration.py Outdated
This PR adds small, composable integration helpers for external training loops that use Megatron-Bridge but own their own rollout/trainer/checkpoint scheduling.
External frameworks such as verl need Bridge for provider construction, PEFT setup, DDP config construction, adapter loading, and checkpoint path wrappers, but they do not want a full lifecycle setup function. This change keeps the API low-level and additive.

- Added ModelProviderMixin.configure(...) to apply dtype, model-parallel sizes from initialized parallel_state, caller overrides, pre-finalize hooks, and finalization in one provider-owned method.
- Added megatron.bridge.training.integration with helper functions for:
  - provider configuration
  - PEFT object creation
  - PEFT pre-wrap hook creation
  - PEFT adapter checkpoint loading
  - model state dict generation
  - DDP config construction
  - explicit-path checkpoint save/load wrappers
  - post-setup model config finalization
- Added unit coverage for provider configuration and the new integration helpers.

This is intended to be additive. Existing Bridge training paths continue to use their current setup/checkpoint flows. This PR does not rename or add public aliases for existing private checkpoint helpers.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants