[mimo] Support bridge fan-out for variable modality tokens by liding-nv · Pull Request #5062 · NVIDIA/Megatron-LM

liding-nv · 2026-05-29T15:37:55Z

Summary

Non-colocated bridge fan-out previously sent equal batch-dim slices to every peer. For VLM/AVLM data the per-sample modality-token counts vary, so the LLM received the wrong embeddings. Tag encoder outputs with per-sample split sizes and have the bridge honor them when fanning activations/gradients out.

MimoModel._attach_modality_split_sizes annotates flat encoder outputs with _mimo_bridge_split_sizes derived from special_token_ids.
MimoModel._empty_encoder_output returns a (0, hidden_size) placeholder when a text-only sample reaches a non-colocated encoder rank; _has_encoder_tokens guards against silent loss of modality data.
BridgeCommunicator._split_tensor_at_batch_dim consumes _mimo_bridge_split_sizes (per-peer or per-sample) with sum/length checks.
BridgeCommunicator._communicate_shapes accepts a per-peer tensor list via the new _as_per_peer_tensors helper.
schedules.backward_step_multimodule unwraps single-element tensor lists so the per-peer paths feed back cleanly.

Tests

CPU-only unit tests added inside existing files:

test_bridge_communicator.py::TestBridgeCommunicatorSplitMetadata — _split_tensor_at_batch_dim sum-mismatch + num_splits=1short-circuit; _as_per_peer_tensors broadcast / pass-through / count-mismatch.
test_mimo_model.py::TestMimoModelFanoutHelpers — _attach_modality_split_sizes happy path + skip branches; _has_encoder_tokens presence/absence.

Signed-off-by: Li Ding <liding@nvidia.com>

copy-pr-bot · 2026-05-29T15:37:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

liding-nv · 2026-05-29T15:38:20Z

/ok to test ffe26d5

Signed-off-by: Li Ding <liding@nvidia.com>

liding-nv · 2026-05-29T15:51:13Z

/ok to test 6006d40

yashaswikarnati · 2026-05-29T16:47:03Z

    # Apply grad scaling if needed (for last stage only).
    for module_name in output_tensor.keys():
-        if output_tensor_grad[module_name] is None and config.grad_scale_func is not None:
+        output_tensor_grad_module = _unwrap_single_tensor_list(output_tensor_grad[module_name])


May be I'm missing something.. would we need this change in schedules

Yes. We hit this in the Qwen3.5-27B MIMO validation run: language TP4/PP4/DP2, image encoder TP1/PP1/DP1, with variable_seq_lengths=True.

In that path, get_tensor_shapes() returns [()]. The regular P2P communicator then returns a single-element grad list, so the multimodule dict can contain:

{"language": [grad_tensor]}

backward_step_multimodule() already unwraps this shape for input_tensor, but not for output_tensor_grad. Without this, we pass [grad_tensor] into backward instead of grad_tensor.

So this is just making multimodule backward handle the same single-tensor-list convention as the existing single-module schedule.

Signed-off-by: Li Ding <liding@nvidia.com>

liding-nv · 2026-06-01T02:25:36Z

/ok to test 8b29dc9

Signed-off-by: Li Ding <liding@nvidia.com>

liding-nv · 2026-06-01T02:33:36Z

/ok to test fe27b34

svcnvidia-nemo-ci · 2026-06-01T20:50:16Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26781230312

Signed-off-by: Li Ding <liding@nvidia.com>

liding-nv added 2 commits May 29, 2026 08:15

support mimo bridge fanout for variable modality tokens

2395425

Signed-off-by: Li Ding <liding@nvidia.com>

tests

ffe26d5

Signed-off-by: Li Ding <liding@nvidia.com>

liding-nv requested a review from yashaswikarnati May 29, 2026 15:38

copy-pr-bot Bot temporarily deployed to public May 29, 2026 15:38 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 15:42 Inactive

lint

6006d40

Signed-off-by: Li Ding <liding@nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 29, 2026 15:50 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 15:51 Inactive

copy-pr-bot Bot temporarily deployed to test May 29, 2026 15:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 15:55 Inactive

liding-nv marked this pull request as ready for review May 29, 2026 16:00

liding-nv requested review from a team as code owners May 29, 2026 16:00

svcnvidia-nemo-ci requested a review from a team May 29, 2026 16:01

svcnvidia-nemo-ci added the complexity: medium label May 29, 2026

copy-pr-bot Bot temporarily deployed to public May 29, 2026 16:04 Inactive

yashaswikarnati reviewed May 29, 2026

View reviewed changes

assert only fanout works

172f47b

Signed-off-by: Li Ding <liding@nvidia.com>

update

8b29dc9

Signed-off-by: Li Ding <liding@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:26 Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:29 Inactive

update

fe27b34

Signed-off-by: Li Ding <liding@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:34 Inactive

copy-pr-bot Bot temporarily deployed to test June 1, 2026 02:34 Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:37 Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:38 Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:45 Inactive

yashaswikarnati added the core_r0.16.0 Cherry-pick label for core_r0.16.0 release branch label Jun 1, 2026

jaredcasper approved these changes Jun 1, 2026

View reviewed changes

yaoyu-33 approved these changes Jun 1, 2026

View reviewed changes

svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Jun 1, 2026

ko3n1g added this pull request to the merge queue Jun 1, 2026

Merged via the queue into NVIDIA:main with commit de030fc Jun 1, 2026
83 of 85 checks passed

cuichenx mentioned this pull request Jun 2, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap NVIDIA-NeMo/Megatron-Bridge#3754

Open

yashaswikarnati mentioned this pull request Jun 4, 2026

Apply MIMO SP/CP sharding with explicit groups and enable THD in non-colocated path #5150

Merged

copy-pr-bot Bot pushed a commit that referenced this pull request Jun 12, 2026

[mimo] Support bridge fan-out for variable modality tokens (#5062)

1467256

Signed-off-by: Li Ding <liding@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mimo] Support bridge fan-out for variable modality tokens#5062

[mimo] Support bridge fan-out for variable modality tokens#5062
ko3n1g merged 9 commits into
NVIDIA:mainfrom
liding-nv:mimo-fanout-var-tokens

liding-nv commented May 29, 2026

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

liding-nv commented May 29, 2026

Uh oh!

liding-nv commented May 29, 2026

Uh oh!

yashaswikarnati May 29, 2026

Uh oh!

liding-nv May 29, 2026

Uh oh!

liding-nv commented Jun 1, 2026

Uh oh!

liding-nv commented Jun 1, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

liding-nv commented May 29, 2026

Summary

Tests

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

liding-nv commented May 29, 2026

Uh oh!

liding-nv commented May 29, 2026

Uh oh!

yashaswikarnati May 29, 2026

Choose a reason for hiding this comment

Uh oh!

liding-nv May 29, 2026

Choose a reason for hiding this comment

Uh oh!

liding-nv commented Jun 1, 2026

Uh oh!

liding-nv commented Jun 1, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants