Skip to content

[dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example#4751

Merged
Victarry merged 12 commits into
NVIDIA:devfrom
wplf:feat/qwen35-vl-example
May 29, 2026
Merged

[dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example#4751
Victarry merged 12 commits into
NVIDIA:devfrom
wplf:feat/qwen35-vl-example

Conversation

@wplf

@wplf wplf commented May 12, 2026

Copy link
Copy Markdown
Member

Qwen3.5 support series

This is part of a 5-PR series adding Qwen3.5-VL support, split for review clarity.

Dev PRs (this series):

Main PRs (corresponding mirrors):


Summary

Adds a standalone VLM training playground under examples/multimodal_dev/ with Qwen3.5-VL end-to-end.

Model-agnostic harness

  • pretrain_multimodal.py entry point and MODEL_REGISTRY so a new architecture is just a registry entry + backing module.
  • models/base.py, forward_step.py, arguments.py, data/ (mock + CORD-V2 dataset with THD pack/pad in collate).

Qwen3.5-VL

  • Full model: models/qwen35_vl/ — vision encoder, MRoPE (pre-computed for THD), decoder, factory, specs, configurations for proxy / 9B / 397B-A17B variants.
  • Run script + README, plus tests: tests/test_mrope_parity.py, test_cp_correctness.py, test_cp_support.py, test_cp_thd_correctness.py, test_thd_correctness.py, test_thd_e2e.py.

One-line training infra change

  • megatron/training/datasets/data_samplers.py: enable the vanilla-collate torch DataLoader path when the new arg use_vanilla_collate_fn is set (needed for CORD-V2 under BSHD).

Dependency

This example sets mrope_interleaved=True in its TransformerConfig and relies on the core MRoPE interleaved layout introduced in #4750. The diff here is self-contained (only examples/ + the 1-line data_samplers.py change), but the example won't run end-to-end until #4750 merges.

Functionality support

Resume training loss curve
bc2482cd-7d3a-4bda-8106-dee44c03b1f0

CP / THD correctness verification

tests/test_cp_thd_correctness.py runs CP=1 and CP=4 in a single torchrun --nproc-per-node 4 invocation (in-process destroy_model_parallel + initialize_model_parallel between phases, weights pinned by a state_dict snapshot, identical inputs via a seeded torch.Generator). Loss aggregated via AllReduce(SUM) on (num, den); grad_norm aggregated via AllReduce(SUM) of gradients on the CP group then divided by cp_size, so each rank holds the CP-mean gradient that matches CP=1's backward on the full-batch mean loss.

Default config (B=2, S=64, H=256, L=2, bf16):

Test CP=1 CP=4 abs diff rel diff
BSHD loss 7.03250265 7.03217983 3.23e-04 4.59e-05
BSHD grad_norm 4.84910854 4.84744710 1.66e-03 3.43e-04
THD loss 7.03250265 7.03241825 8.44e-05 1.20e-05
THD grad_norm 4.84910839 4.84912564 1.73e-05 3.56e-06

Cross-check: BSHD CP=1 loss ≡ THD CP=1 loss = 7.03250265, and BSHD CP=1 grad_norm ≈ THD CP=1 grad_norm to 7 decimals — equal-length sequences make the two attention paths mathematically identical, so the CP=1 grad_norm match confirms BSHD/THD parity at the gradient level as well.

Checkpoint conversion (HF → Megatron-FSDP DTensor)

The example consumes a Megatron-FSDP DTensor checkpoint, converted from the HuggingFace release via Megatron-Bridge.

Setup — clone Bridge and pin its 3rdparty/Megatron-LM submodule to this branch:

git clone --recurse-submodules https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
cd Megatron-Bridge/3rdparty/Megatron-LM
git remote add wplf https://github.com/wplf/Megatron-LM.git
git fetch wplf feat/qwen35-vl-example
git checkout feat/qwen35-vl-example
cd ../..

Convert (single 8×H100 node, EP=8 / TP=CP=1; --hf-model can be any Qwen3.5 variant, e.g. Qwen/Qwen3.5-35B-A3B):

PYTHONPATH=./src:./3rdparty/Megatron-LM/ \
  torchrun --nproc_per_node=8 \
  examples/conversion/mfsdp/convert_checkpoints_fsdp.py import \
  --hf-model Qwen/Qwen3.5-35B-A3B \
  --megatron-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B-fsdp \
  --ckpt-format fsdp_dtensor \
  --ep 8

HF weights are auto-fetched on first run via huggingface_hub. Adjust --tp / --cp / --ep to match the training topology (must satisfy WORLD_SIZE % (TP*CP*EP) == 0).

Output

${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B-fsdp/
├── iter_0000000/
│   ├── __0_0.distcp .. __7_0.distcp   # FSDP DTensor shards, one per rank (~18 GB each for 35B-A3B)
│   ├── .metadata
│   ├── run_config.yaml
│   └── train_state.pt
├── latest_checkpointed_iteration.txt
└── latest_train_state.pt

Bridge dependency — requires NVIDIA-NeMo/Megatron-Bridge#3987 (skip tokenizer save in convert_checkpoints_fsdp.py). Without that fix the checkpoint is still written correctly but the script exits non-zero after save with AttributeError: 'TokenizerConfig' object has no attribute 'make_vocab_size_divisible_by' against this branch's megatron.core.tokenizers.utils.build_tokenizer.

Risk

  • All new files under examples/multimodal_dev/.
  • data_samplers.py change is fully backwards-compatible: behavior is unchanged unless use_vanilla_collate_fn is explicitly set.

Test plan

  • pytest examples/multimodal_dev/tests/ passes.
  • scripts/run_qwen35_vl.sh proxy variant trains a few steps on mock data.
  • CORD-V2 dataset loads with --use-vanilla-collate-fn and trains a few steps.
  • torchrun --nproc-per-node 4 examples/multimodal_dev/tests/test_cp_thd_correctness.py — CP=1 vs CP=4 BSHD/THD loss + grad_norm within tolerance.

🤖 Generated with Claude Code

@copy-pr-bot

copy-pr-bot Bot commented May 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wplf wplf added the Run tests label May 12, 2026
@wplf wplf changed the title feat(examples/multimodal_dev): add Qwen3.5-VL training example [dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example May 12, 2026
@wplf wplf force-pushed the feat/qwen35-vl-example branch from 0c608f0 to 9d9392a Compare May 13, 2026 03:10
@wplf wplf force-pushed the feat/qwen35-vl-example branch from 9d9392a to 62a7890 Compare May 13, 2026 10:24
@Victarry Victarry self-requested a review May 19, 2026 04:57
@wplf wplf marked this pull request as ready for review May 19, 2026 08:28
@wplf wplf requested review from a team as code owners May 19, 2026 08:28
Adds a standalone VLM training playground under
``examples/multimodal_dev/`` with Qwen3.5-VL end-to-end.

Highlights
- Model-agnostic entry point (``pretrain_multimodal.py``) with a
  ``MODEL_REGISTRY`` so adding a new architecture is just a registry
  entry plus a backing module.
- Qwen3.5-VL model: vision encoder, MRoPE, decoder, factory, specs,
  configurations covering proxy / 9B / 397B-A17B variants.
- Datasets: mock data and CORD-V2 VLM dataset, with THD pack/pad in the
  collate function.
- THD + CP support consolidated in ``forward_step.py`` and the model
  layer (uses MRoPE THD pre-computation and ``cu_seqlens_q_padded`` CP
  partitioning).
- Run script + README, plus tests for MRoPE parity, CP correctness, CP
  support, and THD correctness / e2e.

Also gates the torch DataLoader vanilla-collate path on the new
``use_vanilla_collate_fn`` arg (one-line change to
``megatron/training/datasets/data_samplers.py``) so CORD-V2 works under
BSHD.

Functional dependency: the new model arch sets ``mrope_interleaved=True``
in its config and relies on the core MRoPE interleaved layout introduced
in a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>
@Victarry Victarry force-pushed the feat/qwen35-vl-example branch from 62a7890 to 0915f61 Compare May 20, 2026 02:00
wplf and others added 2 commits May 20, 2026 07:27
… preprocessing

Fixes 8 issues in vlm_dataset.py found by review against Megatron-Bridge's
qwen2_5_collate_fn reference implementation.

- loss_mask off-by-one (Bug 1): the previous mask was built on input_ids
  while labels were shifted, dropping the image->text supervision signal
  at the boundary. Now masks structural tokens on the shifted labels and
  also shifts loss_mask itself left by 1.
- missing SFT prompt masking (Bug 2): user-turn and chat-template tokens
  were trained on. Now uses backward substring token search (mirroring
  create_multiturn_loss_mask_by_search) to unmask only the assistant
  answer span.
- seq_length not enforced (Bug 3): long CORD-V2 samples could overflow.
  Now end-truncates input_ids in __getitem__ with a warning.
- unsafe pad_token_id fallback (Bug 4): falling back to 0 silently masked
  a real vocab token. Now falls back to EOS and raises if neither is set.
- silent image_token_id miss (Bug 6): fallback could return None, causing
  dataset / model disagreement. Now raises ValueError.
- stale docstrings (Bug 8): updated Qwen2.5-VL / --image-size references
  to Qwen3.5-VL / --total-seq-length.
- narrow skipped_tokens set (Bug 14): vision_start/end, im_start/end,
  video_pad, endoftext were not masked on labels. Now uses
  tok.all_special_ids union {pad_id, image_token_id}.
- lost Qwen-VL dynamic resolution (Bugs 15/17/19): fixed-square resize
  removed; conversation content carries the image object;
  qwen_vl_utils.process_vision_info extracts images; processor is called
  with min_pixels / max_pixels.
- pixel_values bf16 conversion (Bug 18): moved from forward_step into the
  dataset so per-step dtype checks become no-ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- raise --manual-gc-interval 5 → 50 to cut GC pause frequency on long runs.
- enable --moe-permute-fusion and --moe-router-fusion in the MoE branch
  (no-op for dense variants since MOE_ARGS is gated on NUM_EXPERTS>0).
- enable grad-accumulation fusion under FSDP by dropping
  --no-gradient-accumulation-fusion from FSDP_ARGS.
- add --log-timers-to-tensorboard and --log-params-norm to surface timer
  breakdown and parameter L2 norm in TB/wandb.
- drop the hardcoded CKPT_LOAD path from the in-script example invocations
  so the comment reflects from-scratch CP correctness runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wplf wplf force-pushed the feat/qwen35-vl-example branch from 36286c5 to 6d2e13c Compare May 20, 2026 15:39
Update the 'Copyright (c) 2025, NVIDIA CORPORATION' line to 2026 across
all newly-added Python files under examples/multimodal_dev/ for the
Qwen3.5-VL training example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Victarry

Copy link
Copy Markdown
Contributor

While running the PR's examples/multimodal_dev/tests tests, I hit a collection failure on the original branch: test_thd_e2e.py and test_thd_correctness.py import _pack_batch from examples.multimodal_dev.forward_step, but forward_step.py does not define/export _pack_batch.

Could you add that helper or update the tests to use the intended packing API?

@BestJuly

Copy link
Copy Markdown
Contributor

Please add the latest checkpoint conversion guide so users to resume from HF checkpoint and run. Previously we record the steps in this issue and there should have some updates now.

@wplf

wplf commented May 25, 2026

Copy link
Copy Markdown
Member Author

OK, I'm working on the UT of CP now. Checkpoint conversion guide may be finished this afternoon.

@wplf

wplf commented May 25, 2026

Copy link
Copy Markdown
Member Author

/ok to test 89ccadf

Document the HF -> Megatron-FSDP DTensor conversion path needed before
pretraining from pretrained weights: setup (clone Bridge, pin its
3rdparty/Megatron-LM submodule to this branch), the `torchrun
convert_checkpoints_fsdp.py import` command with EP=8 default topology,
expected output layout, and the open Bridge dependency
(NVIDIA-NeMo/Megatron-Bridge#3987) to skip the post-save tokenizer
build that otherwise crashes on this branch.
@wplf

wplf commented May 27, 2026

Copy link
Copy Markdown
Member Author

/ok to test 334d4e1

@cryoco

cryoco commented May 27, 2026

Copy link
Copy Markdown

May I learn about the corresponding TE version in qwen3.5 support?

@wplf

wplf commented May 27, 2026

Copy link
Copy Markdown
Member Author

May I learn about the corresponding TE version in qwen3.5 support?
FYI, TE dependency is fairly loose.

I'm using the latest TE with a cherry-pick from TE pr 2932 on gb.
If you're using Hopper, please ensure your cuDNN version is greater than 9.19.0.
Using Hopper with fused attention + thd + cuDNN below 9.19.0 will cause NaN issues during THD training.

@cryoco

cryoco commented May 27, 2026

Copy link
Copy Markdown

May I learn about the corresponding TE version in qwen3.5 support?
FYI, TE dependency is fairly loose.

I'm using the latest TE with a cherry-pick from TE pr 2932 on gb. If you're using Hopper, please ensure your cuDNN version is greater than 9.19.0. Using Hopper with fused attention + thd + cuDNN below 9.19.0 will cause NaN issues during THD training.

Thanks. BTW, is there any current performance benchmark on hopper or blackwell?

@wplf

wplf commented May 27, 2026

Copy link
Copy Markdown
Member Author

May I learn about the corresponding TE version in qwen3.5 support?
FYI, TE dependency is fairly loose.

I'm using the latest TE with a cherry-pick from TE pr 2932 on gb. If you're using Hopper, please ensure your cuDNN version is greater than 9.19.0. Using Hopper with fused attention + thd + cuDNN below 9.19.0 will cause NaN issues during THD training.

Thanks. BTW, is there any current performance benchmark on hopper or blackwell?

We are still working on it.

@Victarry Victarry left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!
Reproduced the E2E pipeline from weight conversion to model training and worked well. Left a few comments.

Comment thread examples/multimodal_dev/data/cord_v2.py
Comment thread examples/multimodal_dev/scripts/run_qwen35_vl.sh
Comment thread examples/multimodal_dev/scripts/run_qwen35_vl.sh
Comment thread examples/multimodal_dev/scripts/run_qwen35_vl.sh Outdated
Comment thread examples/multimodal_dev/models/qwen35_vl/vision_encoder.py Outdated
Comment thread examples/multimodal_dev/models/qwen35_vl/mrope.py
wplf added a commit to wplf/Megatron-LM that referenced this pull request May 28, 2026
Resolves the inline comments from @Victarry's PR review on NVIDIA#4751.

* vision_encoder.py — patch merger GELU was `approximate='tanh'` while
  the in-code NOTE acknowledged HF uses `approximate='none'`. Switched
  to `approximate='none'` to match the official Qwen3VLVisionPatchMerger
  numerics for HF -> Megatron checkpoint parity.

* pretrain_multimodal.py — added an explicit guard against
  `--pipeline-model-parallel-size > 1`. The model_provider builds the
  full model on every rank and ignores pre_process / post_process
  stage flags, so PP>1 would silently break Megatron's pipeline-parallel
  contract. Fail fast instead.

* scripts/run_qwen35_vl.sh — three fixes:
    1. `EP` now defaults to 1 (was 2). MoE variants must opt in via
       the environment override.
    2. After the variant case block, fail fast if
       `NUM_EXPERTS=0 && EP>1` so a dense run such as
       `MODEL_VARIANT=9b ./run_qwen35_vl.sh` no longer trips Megatron's
       arg validation downstream.
    3. `--moe-router-force-load-balancing` was unconditionally added to
       GPT_MODEL_ARGS (and therefore enabled even when no MoE args
       were emitted). It is now gated behind `FORCE_LOAD_BALANCING=1`,
       defaults off, and is appended to MOE_ARGS only when MoE is
       active. Real finetuning runs no longer freeze router routing
       decisions by default.

* data/{vlm_dataset.py -> cord_v2.py} + models/__init__.py — renamed
  the CORD-V2-specific module from the generic-sounding
  `vlm_dataset.py` to `cord_v2.py`, updated the model registry path
  string accordingly, and added an "Adding another VLM dataset" section
  to the module docstring documenting the per-dataset module +
  `MODEL_REGISTRY["..."]["dataset_providers"]` registration pattern.

* models/qwen35_vl/mrope.py — added a performance note on the
  `_build_sample_mrope_positions` helper documenting the
  `.tolist()` / `.item()` GPU<->CPU sync points and CUDA-graph
  incompatibility, and the precompute-in-collate / cache-by-shape
  follow-up plan. Behavior preserved here pending a follow-up data
  pipeline change.

The other tests-import comment (test_thd_*.py importing `_pack_batch`)
is already addressed on this branch: the helper is now named
`pack_or_pad_batch` and the tests import that symbol.
wplf added 2 commits May 27, 2026 20:35
Resolves the inline comments from @Victarry's PR review on NVIDIA#4751.

* vision_encoder.py — patch merger GELU was `approximate='tanh'` while
  the in-code NOTE acknowledged HF uses `approximate='none'`. Switched
  to `approximate='none'` to match the official Qwen3VLVisionPatchMerger
  numerics for HF -> Megatron checkpoint parity.

* pretrain_multimodal.py — added an explicit guard against
  `--pipeline-model-parallel-size > 1`. The model_provider builds the
  full model on every rank and ignores pre_process / post_process
  stage flags, so PP>1 would silently break Megatron's pipeline-parallel
  contract. Fail fast instead.

* scripts/run_qwen35_vl.sh — three fixes:
    1. `EP` now defaults to 1 (was 2). MoE variants must opt in via
       the environment override.
    2. After the variant case block, fail fast if
       `NUM_EXPERTS=0 && EP>1` so a dense run such as
       `MODEL_VARIANT=9b ./run_qwen35_vl.sh` no longer trips Megatron's
       arg validation downstream.
    3. `--moe-router-force-load-balancing` was unconditionally added to
       GPT_MODEL_ARGS (and therefore enabled even when no MoE args
       were emitted). It is now gated behind `FORCE_LOAD_BALANCING=1`,
       defaults off, and is appended to MOE_ARGS only when MoE is
       active. Real finetuning runs no longer freeze router routing
       decisions by default.

* data/{vlm_dataset.py -> cord_v2.py} + models/__init__.py — renamed
  the CORD-V2-specific module from the generic-sounding
  `vlm_dataset.py` to `cord_v2.py`, updated the model registry path
  string accordingly, and added an "Adding another VLM dataset" section
  to the module docstring documenting the per-dataset module +
  `MODEL_REGISTRY["..."]["dataset_providers"]` registration pattern.

* models/qwen35_vl/mrope.py — added a performance note on the
  `_build_sample_mrope_positions` helper documenting the
  `.tolist()` / `.item()` GPU<->CPU sync points and CUDA-graph
  incompatibility, and the precompute-in-collate / cache-by-shape
  follow-up plan. Behavior preserved here pending a follow-up data
  pipeline change.

The other tests-import comment (test_thd_*.py importing `_pack_batch`)
is already addressed on this branch: the helper is now named
`pack_or_pad_batch` and the tests import that symbol.
New test ``tests/test_vision_patch_merger_parity.py`` verifies the
Megatron patch merger against an inlined verbatim copy of
HuggingFace ``Qwen3VLVisionPatchMerger`` (``use_postshuffle_norm=False``
branch from ``transformers/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py``).
The HF reference is inlined so the test has no runtime dependency on
the ``transformers`` package.

The test copies HF state-dict tensors into the Megatron module (TP=1,
1:1 mapping), runs both on the same random input, and asserts
``torch.testing.assert_close`` on the logits in fp32 and bf16:

  [torch.float32] shape=(16, 3584) max_abs_diff=2.551e-05 (atol=1e-4)
  [torch.bfloat16] shape=(16, 3584) max_abs_diff=3.906e-03 (atol=5e-2)

The fp32 residual is structural (TE LayerNorm vs nn.LayerNorm use
different fused reduction orders) and the bf16 figure is at the
arithmetic floor for a two-layer MLP. This pins the GELU
``approximate='none'`` fix (commit 8aace7b) against future
regressions.

Run with::

    torchrun --nproc_per_node=1 \\
        examples/multimodal_dev/tests/test_vision_patch_merger_parity.py
@wplf wplf force-pushed the feat/qwen35-vl-example branch from 3248b34 to 4dfdd06 Compare May 28, 2026 03:36
wplf added a commit to wplf/Megatron-LM that referenced this pull request May 28, 2026
Resolves the inline comments from @Victarry's PR review on NVIDIA#4751.

* vision_encoder.py — patch merger GELU was `approximate='tanh'` while
  the in-code NOTE acknowledged HF uses `approximate='none'`. Switched
  to `approximate='none'` to match the official Qwen3VLVisionPatchMerger
  numerics for HF -> Megatron checkpoint parity.

* pretrain_multimodal.py — added an explicit guard against
  `--pipeline-model-parallel-size > 1`. The model_provider builds the
  full model on every rank and ignores pre_process / post_process
  stage flags, so PP>1 would silently break Megatron's pipeline-parallel
  contract. Fail fast instead.

* scripts/run_qwen35_vl.sh — three fixes:
    1. `EP` now defaults to 1 (was 2). MoE variants must opt in via
       the environment override.
    2. After the variant case block, fail fast if
       `NUM_EXPERTS=0 && EP>1` so a dense run such as
       `MODEL_VARIANT=9b ./run_qwen35_vl.sh` no longer trips Megatron's
       arg validation downstream.
    3. `--moe-router-force-load-balancing` was unconditionally added to
       GPT_MODEL_ARGS (and therefore enabled even when no MoE args
       were emitted). It is now gated behind `FORCE_LOAD_BALANCING=1`,
       defaults off, and is appended to MOE_ARGS only when MoE is
       active. Real finetuning runs no longer freeze router routing
       decisions by default.

* data/{vlm_dataset.py -> cord_v2.py} + models/__init__.py — renamed
  the CORD-V2-specific module from the generic-sounding
  `vlm_dataset.py` to `cord_v2.py`, updated the model registry path
  string accordingly, and added an "Adding another VLM dataset" section
  to the module docstring documenting the per-dataset module +
  `MODEL_REGISTRY["..."]["dataset_providers"]` registration pattern.

* models/qwen35_vl/mrope.py — added a performance note on the
  `_build_sample_mrope_positions` helper documenting the
  `.tolist()` / `.item()` GPU<->CPU sync points and CUDA-graph
  incompatibility, and the precompute-in-collate / cache-by-shape
  follow-up plan. Behavior preserved here pending a follow-up data
  pipeline change.

The other tests-import comment (test_thd_*.py importing `_pack_batch`)
is already addressed on this branch: the helper is now named
`pack_or_pad_batch` and the tests import that symbol.
@wplf

wplf commented May 28, 2026

Copy link
Copy Markdown
Member Author

/ok to test df54ae7

@copy-pr-bot

copy-pr-bot Bot commented May 28, 2026

Copy link
Copy Markdown

/ok to test df54ae7

@wplf, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@wplf

wplf commented May 28, 2026

Copy link
Copy Markdown
Member Author

/ok to test 4dfdd06

@Victarry Victarry added this pull request to the merge queue May 29, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26629578247

Merged via the queue into NVIDIA:dev with commit 58f3e67 May 29, 2026
184 of 185 checks passed
wplf added a commit to wplf/Megatron-LM that referenced this pull request Jun 4, 2026
…shadow PR

Adopt the merged dev [5/5] shadow PR NVIDIA#4751 (commit 58f3e67) verbatim for
examples/multimodal_dev/ — it carries newer bug fixes:
- replace data/vlm_dataset.py with data/cord_v2.py
- add tests/_helpers.py, tests/test_cp_thd_correctness.py,
  tests/test_vision_patch_merger_parity.py
- sync the remaining 23 example files to NVIDIA#4751's content

data_samplers.py is intentionally NOT changed to match NVIDIA#4751: main uses
args.hybrid_context_parallel whereas dev uses args.dynamic_context_parallel
(the arg was renamed across branches), so NVIDIA#4756's existing line is the correct
main adaptation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wplf added a commit to wplf/Megatron-LM that referenced this pull request Jun 12, 2026
Squash of the fused-mRoPE work (Add Qwen3.5 MRoPE fusion benchmark
support; Fix THD mRoPE CP fallback consistency; mRoPE THD review
cleanup; enforce per-sequence CP divisibility on the fused THD launch
path; unit-test coverage for real Qwen3.5-VL shapes).

Adds a fused mRoPE kernel (megatron/core/fusions/fused_mrope.py) with an
is_fused_mrope_available() gate, raw-mrope-freqs plumbing through
rope_utils / rotary_pos_embedding / gpt_model / attention, the
transformer_config + arguments toggles, and tests/unit_tests/fusions/
test_fused_mrope.py. Core only: the examples/multimodal_dev integration
is intentionally dropped because that example is already upstream (NVIDIA#4751)
and has diverged from this branch's copy.

Co-Authored-By: Li Tao <litao@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants