[dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example by wplf · Pull Request #4751 · NVIDIA/Megatron-LM

wplf · 2026-05-12T06:56:56Z

Qwen3.5 support series

This is part of a 5-PR series adding Qwen3.5-VL support, split for review clarity.

Dev PRs (this series):

[1/5] MTP packed-seq CP+THD fix — [Dev] fix(mtp): use padded cu_seqlens in MTP roll for THD with CP #4494
[2/5] FSDP DTensor Bridge checkpoint compatibility — [dev] [2/5] Qwen3.5 support: FSDP DTensor Bridge checkpoint compatibility #4748
[3/5] SharedExpertMLP meta init — [dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init #4749
[4/5] Interleaved MRoPE layout — [dev] [4/5] Qwen3.5 support: Interleaved MRoPE layout #4750
[5/5] Qwen3.5-VL training example — [dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example #4751 ← this PR

Main PRs (corresponding mirrors):

Summary

Adds a standalone VLM training playground under examples/multimodal_dev/ with Qwen3.5-VL end-to-end.

Model-agnostic harness

pretrain_multimodal.py entry point and MODEL_REGISTRY so a new architecture is just a registry entry + backing module.
models/base.py, forward_step.py, arguments.py, data/ (mock + CORD-V2 dataset with THD pack/pad in collate).

Qwen3.5-VL

Full model: models/qwen35_vl/ — vision encoder, MRoPE (pre-computed for THD), decoder, factory, specs, configurations for proxy / 9B / 397B-A17B variants.
Run script + README, plus tests: tests/test_mrope_parity.py, test_cp_correctness.py, test_cp_support.py, test_cp_thd_correctness.py, test_thd_correctness.py, test_thd_e2e.py.

One-line training infra change

megatron/training/datasets/data_samplers.py: enable the vanilla-collate torch DataLoader path when the new arg use_vanilla_collate_fn is set (needed for CORD-V2 under BSHD).

Dependency

This example sets mrope_interleaved=True in its TransformerConfig and relies on the core MRoPE interleaved layout introduced in #4750. The diff here is self-contained (only examples/ + the 1-line data_samplers.py change), but the example won't run end-to-end until #4750 merges.

Functionality support

Resume training loss curve

CP / THD correctness verification

tests/test_cp_thd_correctness.py runs CP=1 and CP=4 in a single torchrun --nproc-per-node 4 invocation (in-process destroy_model_parallel + initialize_model_parallel between phases, weights pinned by a state_dict snapshot, identical inputs via a seeded torch.Generator). Loss aggregated via AllReduce(SUM) on (num, den); grad_norm aggregated via AllReduce(SUM) of gradients on the CP group then divided by cp_size, so each rank holds the CP-mean gradient that matches CP=1's backward on the full-batch mean loss.

Default config (B=2, S=64, H=256, L=2, bf16):

Test	CP=1	CP=4	abs diff	rel diff
BSHD loss	7.03250265	7.03217983	3.23e-04	4.59e-05
BSHD grad_norm	4.84910854	4.84744710	1.66e-03	3.43e-04
THD loss	7.03250265	7.03241825	8.44e-05	1.20e-05
THD grad_norm	4.84910839	4.84912564	1.73e-05	3.56e-06

Cross-check: BSHD CP=1 loss ≡ THD CP=1 loss = 7.03250265, and BSHD CP=1 grad_norm ≈ THD CP=1 grad_norm to 7 decimals — equal-length sequences make the two attention paths mathematically identical, so the CP=1 grad_norm match confirms BSHD/THD parity at the gradient level as well.

Checkpoint conversion (HF → Megatron-FSDP DTensor)

The example consumes a Megatron-FSDP DTensor checkpoint, converted from the HuggingFace release via Megatron-Bridge.

Setup — clone Bridge and pin its 3rdparty/Megatron-LM submodule to this branch:

git clone --recurse-submodules https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
cd Megatron-Bridge/3rdparty/Megatron-LM
git remote add wplf https://github.com/wplf/Megatron-LM.git
git fetch wplf feat/qwen35-vl-example
git checkout feat/qwen35-vl-example
cd ../..

Convert (single 8×H100 node, EP=8 / TP=CP=1; --hf-model can be any Qwen3.5 variant, e.g. Qwen/Qwen3.5-35B-A3B):

PYTHONPATH=./src:./3rdparty/Megatron-LM/ \
  torchrun --nproc_per_node=8 \
  examples/conversion/mfsdp/convert_checkpoints_fsdp.py import \
  --hf-model Qwen/Qwen3.5-35B-A3B \
  --megatron-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B-fsdp \
  --ckpt-format fsdp_dtensor \
  --ep 8

HF weights are auto-fetched on first run via huggingface_hub. Adjust --tp / --cp / --ep to match the training topology (must satisfy WORLD_SIZE % (TP*CP*EP) == 0).

Output

${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B-fsdp/
├── iter_0000000/
│   ├── __0_0.distcp .. __7_0.distcp   # FSDP DTensor shards, one per rank (~18 GB each for 35B-A3B)
│   ├── .metadata
│   ├── run_config.yaml
│   └── train_state.pt
├── latest_checkpointed_iteration.txt
└── latest_train_state.pt

Bridge dependency — requires NVIDIA-NeMo/Megatron-Bridge#3987 (skip tokenizer save in convert_checkpoints_fsdp.py). Without that fix the checkpoint is still written correctly but the script exits non-zero after save with AttributeError: 'TokenizerConfig' object has no attribute 'make_vocab_size_divisible_by' against this branch's megatron.core.tokenizers.utils.build_tokenizer.

Risk

All new files under examples/multimodal_dev/.
data_samplers.py change is fully backwards-compatible: behavior is unchanged unless use_vanilla_collate_fn is explicitly set.

Test plan

pytest examples/multimodal_dev/tests/ passes.
scripts/run_qwen35_vl.sh proxy variant trains a few steps on mock data.
CORD-V2 dataset loads with --use-vanilla-collate-fn and trains a few steps.
torchrun --nproc-per-node 4 examples/multimodal_dev/tests/test_cp_thd_correctness.py — CP=1 vs CP=4 BSHD/THD loss + grad_norm within tolerance.

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-12T06:57:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Adds a standalone VLM training playground under ``examples/multimodal_dev/`` with Qwen3.5-VL end-to-end. Highlights - Model-agnostic entry point (``pretrain_multimodal.py``) with a ``MODEL_REGISTRY`` so adding a new architecture is just a registry entry plus a backing module. - Qwen3.5-VL model: vision encoder, MRoPE, decoder, factory, specs, configurations covering proxy / 9B / 397B-A17B variants. - Datasets: mock data and CORD-V2 VLM dataset, with THD pack/pad in the collate function. - THD + CP support consolidated in ``forward_step.py`` and the model layer (uses MRoPE THD pre-computation and ``cu_seqlens_q_padded`` CP partitioning). - Run script + README, plus tests for MRoPE parity, CP correctness, CP support, and THD correctness / e2e. Also gates the torch DataLoader vanilla-collate path on the new ``use_vanilla_collate_fn`` arg (one-line change to ``megatron/training/datasets/data_samplers.py``) so CORD-V2 works under BSHD. Functional dependency: the new model arch sets ``mrope_interleaved=True`` in its config and relies on the core MRoPE interleaved layout introduced in a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>

… preprocessing Fixes 8 issues in vlm_dataset.py found by review against Megatron-Bridge's qwen2_5_collate_fn reference implementation. - loss_mask off-by-one (Bug 1): the previous mask was built on input_ids while labels were shifted, dropping the image->text supervision signal at the boundary. Now masks structural tokens on the shifted labels and also shifts loss_mask itself left by 1. - missing SFT prompt masking (Bug 2): user-turn and chat-template tokens were trained on. Now uses backward substring token search (mirroring create_multiturn_loss_mask_by_search) to unmask only the assistant answer span. - seq_length not enforced (Bug 3): long CORD-V2 samples could overflow. Now end-truncates input_ids in __getitem__ with a warning. - unsafe pad_token_id fallback (Bug 4): falling back to 0 silently masked a real vocab token. Now falls back to EOS and raises if neither is set. - silent image_token_id miss (Bug 6): fallback could return None, causing dataset / model disagreement. Now raises ValueError. - stale docstrings (Bug 8): updated Qwen2.5-VL / --image-size references to Qwen3.5-VL / --total-seq-length. - narrow skipped_tokens set (Bug 14): vision_start/end, im_start/end, video_pad, endoftext were not masked on labels. Now uses tok.all_special_ids union {pad_id, image_token_id}. - lost Qwen-VL dynamic resolution (Bugs 15/17/19): fixed-square resize removed; conversation content carries the image object; qwen_vl_utils.process_vision_info extracts images; processor is called with min_pixels / max_pixels. - pixel_values bf16 conversion (Bug 18): moved from forward_step into the dataset so per-step dtype checks become no-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- raise --manual-gc-interval 5 → 50 to cut GC pause frequency on long runs. - enable --moe-permute-fusion and --moe-router-fusion in the MoE branch (no-op for dense variants since MOE_ARGS is gated on NUM_EXPERTS>0). - enable grad-accumulation fusion under FSDP by dropping --no-gradient-accumulation-fusion from FSDP_ARGS. - add --log-timers-to-tensorboard and --log-params-norm to surface timer breakdown and parameter L2 norm in TB/wandb. - drop the hardcoded CKPT_LOAD path from the in-script example invocations so the comment reflects from-scratch CP correctness runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update the 'Copyright (c) 2025, NVIDIA CORPORATION' line to 2026 across all newly-added Python files under examples/multimodal_dev/ for the Qwen3.5-VL training example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Victarry · 2026-05-24T06:54:57Z

While running the PR's examples/multimodal_dev/tests tests, I hit a collection failure on the original branch: test_thd_e2e.py and test_thd_correctness.py import _pack_batch from examples.multimodal_dev.forward_step, but forward_step.py does not define/export _pack_batch.

Could you add that helper or update the tests to use the intended packing API?

BestJuly · 2026-05-25T04:02:00Z

Please add the latest checkpoint conversion guide so users to resume from HF checkpoint and run. Previously we record the steps in this issue and there should have some updates now.

wplf · 2026-05-25T04:06:20Z

OK, I'm working on the UT of CP now. Checkpoint conversion guide may be finished this afternoon.

wplf · 2026-05-25T08:37:34Z

/ok to test 89ccadf

Document the HF -> Megatron-FSDP DTensor conversion path needed before pretraining from pretrained weights: setup (clone Bridge, pin its 3rdparty/Megatron-LM submodule to this branch), the `torchrun convert_checkpoints_fsdp.py import` command with EP=8 default topology, expected output layout, and the open Bridge dependency (NVIDIA-NeMo/Megatron-Bridge#3987) to skip the post-save tokenizer build that otherwise crashes on this branch.

wplf · 2026-05-27T06:23:33Z

/ok to test 334d4e1

cryoco · 2026-05-27T07:30:13Z

May I learn about the corresponding TE version in qwen3.5 support?

wplf · 2026-05-27T07:40:27Z

May I learn about the corresponding TE version in qwen3.5 support?
FYI, TE dependency is fairly loose.

I'm using the latest TE with a cherry-pick from TE pr 2932 on gb.
If you're using Hopper, please ensure your cuDNN version is greater than 9.19.0.
Using Hopper with fused attention + thd + cuDNN below 9.19.0 will cause NaN issues during THD training.

cryoco · 2026-05-27T11:21:53Z

May I learn about the corresponding TE version in qwen3.5 support?
FYI, TE dependency is fairly loose.

I'm using the latest TE with a cherry-pick from TE pr 2932 on gb. If you're using Hopper, please ensure your cuDNN version is greater than 9.19.0. Using Hopper with fused attention + thd + cuDNN below 9.19.0 will cause NaN issues during THD training.

Thanks. BTW, is there any current performance benchmark on hopper or blackwell?

wplf · 2026-05-27T12:22:03Z

May I learn about the corresponding TE version in qwen3.5 support?
FYI, TE dependency is fairly loose.

I'm using the latest TE with a cherry-pick from TE pr 2932 on gb. If you're using Hopper, please ensure your cuDNN version is greater than 9.19.0. Using Hopper with fused attention + thd + cuDNN below 9.19.0 will cause NaN issues during THD training.

Thanks. BTW, is there any current performance benchmark on hopper or blackwell?

We are still working on it.

Victarry

Thanks for this PR!
Reproduced the E2E pipeline from weight conversion to model training and worked well. Left a few comments.

@Victarry

Resolves the inline comments from @Victarry's PR review on NVIDIA#4751. * vision_encoder.py — patch merger GELU was `approximate='tanh'` while the in-code NOTE acknowledged HF uses `approximate='none'`. Switched to `approximate='none'` to match the official Qwen3VLVisionPatchMerger numerics for HF -> Megatron checkpoint parity. * pretrain_multimodal.py — added an explicit guard against `--pipeline-model-parallel-size > 1`. The model_provider builds the full model on every rank and ignores pre_process / post_process stage flags, so PP>1 would silently break Megatron's pipeline-parallel contract. Fail fast instead. * scripts/run_qwen35_vl.sh — three fixes: 1. `EP` now defaults to 1 (was 2). MoE variants must opt in via the environment override. 2. After the variant case block, fail fast if `NUM_EXPERTS=0 && EP>1` so a dense run such as `MODEL_VARIANT=9b ./run_qwen35_vl.sh` no longer trips Megatron's arg validation downstream. 3. `--moe-router-force-load-balancing` was unconditionally added to GPT_MODEL_ARGS (and therefore enabled even when no MoE args were emitted). It is now gated behind `FORCE_LOAD_BALANCING=1`, defaults off, and is appended to MOE_ARGS only when MoE is active. Real finetuning runs no longer freeze router routing decisions by default. * data/{vlm_dataset.py -> cord_v2.py} + models/__init__.py — renamed the CORD-V2-specific module from the generic-sounding `vlm_dataset.py` to `cord_v2.py`, updated the model registry path string accordingly, and added an "Adding another VLM dataset" section to the module docstring documenting the per-dataset module + `MODEL_REGISTRY["..."]["dataset_providers"]` registration pattern. * models/qwen35_vl/mrope.py — added a performance note on the `_build_sample_mrope_positions` helper documenting the `.tolist()` / `.item()` GPU<->CPU sync points and CUDA-graph incompatibility, and the precompute-in-collate / cache-by-shape follow-up plan. Behavior preserved here pending a follow-up data pipeline change. The other tests-import comment (test_thd_*.py importing `_pack_batch`) is already addressed on this branch: the helper is now named `pack_or_pad_batch` and the tests import that symbol.

@Victarry

Resolves the inline comments from @Victarry's PR review on NVIDIA#4751. * vision_encoder.py — patch merger GELU was `approximate='tanh'` while the in-code NOTE acknowledged HF uses `approximate='none'`. Switched to `approximate='none'` to match the official Qwen3VLVisionPatchMerger numerics for HF -> Megatron checkpoint parity. * pretrain_multimodal.py — added an explicit guard against `--pipeline-model-parallel-size > 1`. The model_provider builds the full model on every rank and ignores pre_process / post_process stage flags, so PP>1 would silently break Megatron's pipeline-parallel contract. Fail fast instead. * scripts/run_qwen35_vl.sh — three fixes: 1. `EP` now defaults to 1 (was 2). MoE variants must opt in via the environment override. 2. After the variant case block, fail fast if `NUM_EXPERTS=0 && EP>1` so a dense run such as `MODEL_VARIANT=9b ./run_qwen35_vl.sh` no longer trips Megatron's arg validation downstream. 3. `--moe-router-force-load-balancing` was unconditionally added to GPT_MODEL_ARGS (and therefore enabled even when no MoE args were emitted). It is now gated behind `FORCE_LOAD_BALANCING=1`, defaults off, and is appended to MOE_ARGS only when MoE is active. Real finetuning runs no longer freeze router routing decisions by default. * data/{vlm_dataset.py -> cord_v2.py} + models/__init__.py — renamed the CORD-V2-specific module from the generic-sounding `vlm_dataset.py` to `cord_v2.py`, updated the model registry path string accordingly, and added an "Adding another VLM dataset" section to the module docstring documenting the per-dataset module + `MODEL_REGISTRY["..."]["dataset_providers"]` registration pattern. * models/qwen35_vl/mrope.py — added a performance note on the `_build_sample_mrope_positions` helper documenting the `.tolist()` / `.item()` GPU<->CPU sync points and CUDA-graph incompatibility, and the precompute-in-collate / cache-by-shape follow-up plan. Behavior preserved here pending a follow-up data pipeline change. The other tests-import comment (test_thd_*.py importing `_pack_batch`) is already addressed on this branch: the helper is now named `pack_or_pad_batch` and the tests import that symbol.

New test ``tests/test_vision_patch_merger_parity.py`` verifies the Megatron patch merger against an inlined verbatim copy of HuggingFace ``Qwen3VLVisionPatchMerger`` (``use_postshuffle_norm=False`` branch from ``transformers/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py``). The HF reference is inlined so the test has no runtime dependency on the ``transformers`` package. The test copies HF state-dict tensors into the Megatron module (TP=1, 1:1 mapping), runs both on the same random input, and asserts ``torch.testing.assert_close`` on the logits in fp32 and bf16: [torch.float32] shape=(16, 3584) max_abs_diff=2.551e-05 (atol=1e-4) [torch.bfloat16] shape=(16, 3584) max_abs_diff=3.906e-03 (atol=5e-2) The fp32 residual is structural (TE LayerNorm vs nn.LayerNorm use different fused reduction orders) and the bf16 figure is at the arithmetic floor for a two-layer MLP. This pins the GELU ``approximate='none'`` fix (commit 8aace7b) against future regressions. Run with:: torchrun --nproc_per_node=1 \\ examples/multimodal_dev/tests/test_vision_patch_merger_parity.py

@Victarry

Resolves the inline comments from @Victarry's PR review on NVIDIA#4751. * vision_encoder.py — patch merger GELU was `approximate='tanh'` while the in-code NOTE acknowledged HF uses `approximate='none'`. Switched to `approximate='none'` to match the official Qwen3VLVisionPatchMerger numerics for HF -> Megatron checkpoint parity. * pretrain_multimodal.py — added an explicit guard against `--pipeline-model-parallel-size > 1`. The model_provider builds the full model on every rank and ignores pre_process / post_process stage flags, so PP>1 would silently break Megatron's pipeline-parallel contract. Fail fast instead. * scripts/run_qwen35_vl.sh — three fixes: 1. `EP` now defaults to 1 (was 2). MoE variants must opt in via the environment override. 2. After the variant case block, fail fast if `NUM_EXPERTS=0 && EP>1` so a dense run such as `MODEL_VARIANT=9b ./run_qwen35_vl.sh` no longer trips Megatron's arg validation downstream. 3. `--moe-router-force-load-balancing` was unconditionally added to GPT_MODEL_ARGS (and therefore enabled even when no MoE args were emitted). It is now gated behind `FORCE_LOAD_BALANCING=1`, defaults off, and is appended to MOE_ARGS only when MoE is active. Real finetuning runs no longer freeze router routing decisions by default. * data/{vlm_dataset.py -> cord_v2.py} + models/__init__.py — renamed the CORD-V2-specific module from the generic-sounding `vlm_dataset.py` to `cord_v2.py`, updated the model registry path string accordingly, and added an "Adding another VLM dataset" section to the module docstring documenting the per-dataset module + `MODEL_REGISTRY["..."]["dataset_providers"]` registration pattern. * models/qwen35_vl/mrope.py — added a performance note on the `_build_sample_mrope_positions` helper documenting the `.tolist()` / `.item()` GPU<->CPU sync points and CUDA-graph incompatibility, and the precompute-in-collate / cache-by-shape follow-up plan. Behavior preserved here pending a follow-up data pipeline change. The other tests-import comment (test_thd_*.py importing `_pack_batch`) is already addressed on this branch: the helper is now named `pack_or_pad_batch` and the tests import that symbol.

wplf · 2026-05-28T03:41:32Z

/ok to test df54ae7

copy-pr-bot · 2026-05-28T03:41:36Z

/ok to test df54ae7

@wplf, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

wplf · 2026-05-28T03:42:35Z

/ok to test 4dfdd06

svcnvidia-nemo-ci · 2026-05-29T09:30:18Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26629578247

…shadow PR Adopt the merged dev [5/5] shadow PR NVIDIA#4751 (commit 58f3e67) verbatim for examples/multimodal_dev/ — it carries newer bug fixes: - replace data/vlm_dataset.py with data/cord_v2.py - add tests/_helpers.py, tests/test_cp_thd_correctness.py, tests/test_vision_patch_merger_parity.py - sync the remaining 23 example files to NVIDIA#4751's content data_samplers.py is intentionally NOT changed to match NVIDIA#4751: main uses args.hybrid_context_parallel whereas dev uses args.dynamic_context_parallel (the arg was renamed across branches), so NVIDIA#4756's existing line is the correct main adaptation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Squash of the fused-mRoPE work (Add Qwen3.5 MRoPE fusion benchmark support; Fix THD mRoPE CP fallback consistency; mRoPE THD review cleanup; enforce per-sequence CP divisibility on the fused THD launch path; unit-test coverage for real Qwen3.5-VL shapes). Adds a fused mRoPE kernel (megatron/core/fusions/fused_mrope.py) with an is_fused_mrope_available() gate, raw-mrope-freqs plumbing through rope_utils / rotary_pos_embedding / gpt_model / attention, the transformer_config + arguments toggles, and tests/unit_tests/fusions/ test_fused_mrope.py. Core only: the examples/multimodal_dev integration is intentionally dropped because that example is already upstream (NVIDIA#4751) and has diverged from this branch's copy. Co-Authored-By: Li Tao <litao@nvidia.com>

wplf added the Run tests label May 12, 2026

wplf changed the title ~~feat(examples/multimodal_dev): add Qwen3.5-VL training example~~ [dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example May 12, 2026

wplf force-pushed the feat/qwen35-vl-example branch from 0c608f0 to 9d9392a Compare May 13, 2026 03:10

wplf mentioned this pull request May 13, 2026

[dev] [follow-up] Qwen3.5 support: MoE aux loss padding_mask #4776

Merged

wplf force-pushed the feat/qwen35-vl-example branch from 9d9392a to 62a7890 Compare May 13, 2026 10:24

Victarry self-requested a review May 19, 2026 04:57

Victarry mentioned this pull request May 19, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

wplf marked this pull request as ready for review May 19, 2026 08:28

wplf requested review from a team as code owners May 19, 2026 08:28

svcnvidia-nemo-ci added the complexity: high label May 19, 2026

Victarry force-pushed the feat/qwen35-vl-example branch from 62a7890 to 0915f61 Compare May 20, 2026 02:00

wplf and others added 2 commits May 20, 2026 07:27

wplf force-pushed the feat/qwen35-vl-example branch from 36286c5 to 6d2e13c Compare May 20, 2026 15:39

copy-pr-bot Bot temporarily deployed to test May 25, 2026 08:38 Inactive

copy-pr-bot Bot temporarily deployed to test May 27, 2026 06:24 Inactive

Victarry reviewed May 27, 2026

View reviewed changes

wplf mentioned this pull request May 28, 2026

examples/multimodal_dev: docs + Victarry review fixes + vision patch merger parity test wplf/Megatron-LM#3

Closed

wplf added 2 commits May 27, 2026 20:35

wplf force-pushed the feat/qwen35-vl-example branch from 3248b34 to 4dfdd06 Compare May 28, 2026 03:36

copy-pr-bot Bot temporarily deployed to test May 28, 2026 03:43 Inactive

Victarry approved these changes May 28, 2026

View reviewed changes

Victarry added this pull request to the merge queue May 29, 2026

Merged via the queue into NVIDIA:dev with commit 58f3e67 May 29, 2026
184 of 185 checks passed

wplf mentioned this pull request Jun 3, 2026

Merge GDN conv fusion (fused pre-gated-delta-rule kernels) wplf/Megatron-LM#13

Closed

This was referenced Jun 11, 2026

don't merge #5284

Draft

feat(fusions): fused mRoPE for Qwen3.5-VL #5294

Open

Conversation

wplf commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3.5 support series

Summary

Dependency

Functionality support

CP / THD correctness verification

Checkpoint conversion (HF → Megatron-FSDP DTensor)

Risk

Test plan

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

Victarry commented May 24, 2026

Uh oh!

BestJuly commented May 25, 2026

Uh oh!

wplf commented May 25, 2026

Uh oh!

wplf commented May 25, 2026

Uh oh!

wplf commented May 27, 2026

Uh oh!

cryoco commented May 27, 2026

Uh oh!

wplf commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cryoco commented May 27, 2026

Uh oh!

wplf commented May 27, 2026

Uh oh!

Victarry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wplf commented May 28, 2026

Uh oh!

copy-pr-bot Bot commented May 28, 2026

Uh oh!

wplf commented May 28, 2026

Uh oh!

svcnvidia-nemo-ci commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wplf commented May 12, 2026 •

edited

Loading

wplf commented May 27, 2026 •

edited

Loading