[dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example#4751
Conversation
0c608f0 to
9d9392a
Compare
9d9392a to
62a7890
Compare
Adds a standalone VLM training playground under ``examples/multimodal_dev/`` with Qwen3.5-VL end-to-end. Highlights - Model-agnostic entry point (``pretrain_multimodal.py``) with a ``MODEL_REGISTRY`` so adding a new architecture is just a registry entry plus a backing module. - Qwen3.5-VL model: vision encoder, MRoPE, decoder, factory, specs, configurations covering proxy / 9B / 397B-A17B variants. - Datasets: mock data and CORD-V2 VLM dataset, with THD pack/pad in the collate function. - THD + CP support consolidated in ``forward_step.py`` and the model layer (uses MRoPE THD pre-computation and ``cu_seqlens_q_padded`` CP partitioning). - Run script + README, plus tests for MRoPE parity, CP correctness, CP support, and THD correctness / e2e. Also gates the torch DataLoader vanilla-collate path on the new ``use_vanilla_collate_fn`` arg (one-line change to ``megatron/training/datasets/data_samplers.py``) so CORD-V2 works under BSHD. Functional dependency: the new model arch sets ``mrope_interleaved=True`` in its config and relies on the core MRoPE interleaved layout introduced in a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>
62a7890 to
0915f61
Compare
… preprocessing
Fixes 8 issues in vlm_dataset.py found by review against Megatron-Bridge's
qwen2_5_collate_fn reference implementation.
- loss_mask off-by-one (Bug 1): the previous mask was built on input_ids
while labels were shifted, dropping the image->text supervision signal
at the boundary. Now masks structural tokens on the shifted labels and
also shifts loss_mask itself left by 1.
- missing SFT prompt masking (Bug 2): user-turn and chat-template tokens
were trained on. Now uses backward substring token search (mirroring
create_multiturn_loss_mask_by_search) to unmask only the assistant
answer span.
- seq_length not enforced (Bug 3): long CORD-V2 samples could overflow.
Now end-truncates input_ids in __getitem__ with a warning.
- unsafe pad_token_id fallback (Bug 4): falling back to 0 silently masked
a real vocab token. Now falls back to EOS and raises if neither is set.
- silent image_token_id miss (Bug 6): fallback could return None, causing
dataset / model disagreement. Now raises ValueError.
- stale docstrings (Bug 8): updated Qwen2.5-VL / --image-size references
to Qwen3.5-VL / --total-seq-length.
- narrow skipped_tokens set (Bug 14): vision_start/end, im_start/end,
video_pad, endoftext were not masked on labels. Now uses
tok.all_special_ids union {pad_id, image_token_id}.
- lost Qwen-VL dynamic resolution (Bugs 15/17/19): fixed-square resize
removed; conversation content carries the image object;
qwen_vl_utils.process_vision_info extracts images; processor is called
with min_pixels / max_pixels.
- pixel_values bf16 conversion (Bug 18): moved from forward_step into the
dataset so per-step dtype checks become no-ops.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- raise --manual-gc-interval 5 → 50 to cut GC pause frequency on long runs. - enable --moe-permute-fusion and --moe-router-fusion in the MoE branch (no-op for dense variants since MOE_ARGS is gated on NUM_EXPERTS>0). - enable grad-accumulation fusion under FSDP by dropping --no-gradient-accumulation-fusion from FSDP_ARGS. - add --log-timers-to-tensorboard and --log-params-norm to surface timer breakdown and parameter L2 norm in TB/wandb. - drop the hardcoded CKPT_LOAD path from the in-script example invocations so the comment reflects from-scratch CP correctness runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
36286c5 to
6d2e13c
Compare
Update the 'Copyright (c) 2025, NVIDIA CORPORATION' line to 2026 across all newly-added Python files under examples/multimodal_dev/ for the Qwen3.5-VL training example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
While running the PR's Could you add that helper or update the tests to use the intended packing API? |
|
Please add the latest checkpoint conversion guide so users to resume from HF checkpoint and run. Previously we record the steps in this issue and there should have some updates now. |
|
OK, I'm working on the UT of CP now. Checkpoint conversion guide may be finished this afternoon. |
|
/ok to test 89ccadf |
Document the HF -> Megatron-FSDP DTensor conversion path needed before pretraining from pretrained weights: setup (clone Bridge, pin its 3rdparty/Megatron-LM submodule to this branch), the `torchrun convert_checkpoints_fsdp.py import` command with EP=8 default topology, expected output layout, and the open Bridge dependency (NVIDIA-NeMo/Megatron-Bridge#3987) to skip the post-save tokenizer build that otherwise crashes on this branch.
|
/ok to test 334d4e1 |
|
May I learn about the corresponding TE version in qwen3.5 support? |
I'm using the latest TE with a cherry-pick from TE pr 2932 on gb. |
Thanks. BTW, is there any current performance benchmark on hopper or blackwell? |
We are still working on it. |
Victarry
left a comment
There was a problem hiding this comment.
Thanks for this PR!
Reproduced the E2E pipeline from weight conversion to model training and worked well. Left a few comments.
Resolves the inline comments from @Victarry's PR review on NVIDIA#4751. * vision_encoder.py — patch merger GELU was `approximate='tanh'` while the in-code NOTE acknowledged HF uses `approximate='none'`. Switched to `approximate='none'` to match the official Qwen3VLVisionPatchMerger numerics for HF -> Megatron checkpoint parity. * pretrain_multimodal.py — added an explicit guard against `--pipeline-model-parallel-size > 1`. The model_provider builds the full model on every rank and ignores pre_process / post_process stage flags, so PP>1 would silently break Megatron's pipeline-parallel contract. Fail fast instead. * scripts/run_qwen35_vl.sh — three fixes: 1. `EP` now defaults to 1 (was 2). MoE variants must opt in via the environment override. 2. After the variant case block, fail fast if `NUM_EXPERTS=0 && EP>1` so a dense run such as `MODEL_VARIANT=9b ./run_qwen35_vl.sh` no longer trips Megatron's arg validation downstream. 3. `--moe-router-force-load-balancing` was unconditionally added to GPT_MODEL_ARGS (and therefore enabled even when no MoE args were emitted). It is now gated behind `FORCE_LOAD_BALANCING=1`, defaults off, and is appended to MOE_ARGS only when MoE is active. Real finetuning runs no longer freeze router routing decisions by default. * data/{vlm_dataset.py -> cord_v2.py} + models/__init__.py — renamed the CORD-V2-specific module from the generic-sounding `vlm_dataset.py` to `cord_v2.py`, updated the model registry path string accordingly, and added an "Adding another VLM dataset" section to the module docstring documenting the per-dataset module + `MODEL_REGISTRY["..."]["dataset_providers"]` registration pattern. * models/qwen35_vl/mrope.py — added a performance note on the `_build_sample_mrope_positions` helper documenting the `.tolist()` / `.item()` GPU<->CPU sync points and CUDA-graph incompatibility, and the precompute-in-collate / cache-by-shape follow-up plan. Behavior preserved here pending a follow-up data pipeline change. The other tests-import comment (test_thd_*.py importing `_pack_batch`) is already addressed on this branch: the helper is now named `pack_or_pad_batch` and the tests import that symbol.
Resolves the inline comments from @Victarry's PR review on NVIDIA#4751. * vision_encoder.py — patch merger GELU was `approximate='tanh'` while the in-code NOTE acknowledged HF uses `approximate='none'`. Switched to `approximate='none'` to match the official Qwen3VLVisionPatchMerger numerics for HF -> Megatron checkpoint parity. * pretrain_multimodal.py — added an explicit guard against `--pipeline-model-parallel-size > 1`. The model_provider builds the full model on every rank and ignores pre_process / post_process stage flags, so PP>1 would silently break Megatron's pipeline-parallel contract. Fail fast instead. * scripts/run_qwen35_vl.sh — three fixes: 1. `EP` now defaults to 1 (was 2). MoE variants must opt in via the environment override. 2. After the variant case block, fail fast if `NUM_EXPERTS=0 && EP>1` so a dense run such as `MODEL_VARIANT=9b ./run_qwen35_vl.sh` no longer trips Megatron's arg validation downstream. 3. `--moe-router-force-load-balancing` was unconditionally added to GPT_MODEL_ARGS (and therefore enabled even when no MoE args were emitted). It is now gated behind `FORCE_LOAD_BALANCING=1`, defaults off, and is appended to MOE_ARGS only when MoE is active. Real finetuning runs no longer freeze router routing decisions by default. * data/{vlm_dataset.py -> cord_v2.py} + models/__init__.py — renamed the CORD-V2-specific module from the generic-sounding `vlm_dataset.py` to `cord_v2.py`, updated the model registry path string accordingly, and added an "Adding another VLM dataset" section to the module docstring documenting the per-dataset module + `MODEL_REGISTRY["..."]["dataset_providers"]` registration pattern. * models/qwen35_vl/mrope.py — added a performance note on the `_build_sample_mrope_positions` helper documenting the `.tolist()` / `.item()` GPU<->CPU sync points and CUDA-graph incompatibility, and the precompute-in-collate / cache-by-shape follow-up plan. Behavior preserved here pending a follow-up data pipeline change. The other tests-import comment (test_thd_*.py importing `_pack_batch`) is already addressed on this branch: the helper is now named `pack_or_pad_batch` and the tests import that symbol.
New test ``tests/test_vision_patch_merger_parity.py`` verifies the Megatron patch merger against an inlined verbatim copy of HuggingFace ``Qwen3VLVisionPatchMerger`` (``use_postshuffle_norm=False`` branch from ``transformers/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py``). The HF reference is inlined so the test has no runtime dependency on the ``transformers`` package. The test copies HF state-dict tensors into the Megatron module (TP=1, 1:1 mapping), runs both on the same random input, and asserts ``torch.testing.assert_close`` on the logits in fp32 and bf16: [torch.float32] shape=(16, 3584) max_abs_diff=2.551e-05 (atol=1e-4) [torch.bfloat16] shape=(16, 3584) max_abs_diff=3.906e-03 (atol=5e-2) The fp32 residual is structural (TE LayerNorm vs nn.LayerNorm use different fused reduction orders) and the bf16 figure is at the arithmetic floor for a two-layer MLP. This pins the GELU ``approximate='none'`` fix (commit 8aace7b) against future regressions. Run with:: torchrun --nproc_per_node=1 \\ examples/multimodal_dev/tests/test_vision_patch_merger_parity.py
3248b34 to
4dfdd06
Compare
Resolves the inline comments from @Victarry's PR review on NVIDIA#4751. * vision_encoder.py — patch merger GELU was `approximate='tanh'` while the in-code NOTE acknowledged HF uses `approximate='none'`. Switched to `approximate='none'` to match the official Qwen3VLVisionPatchMerger numerics for HF -> Megatron checkpoint parity. * pretrain_multimodal.py — added an explicit guard against `--pipeline-model-parallel-size > 1`. The model_provider builds the full model on every rank and ignores pre_process / post_process stage flags, so PP>1 would silently break Megatron's pipeline-parallel contract. Fail fast instead. * scripts/run_qwen35_vl.sh — three fixes: 1. `EP` now defaults to 1 (was 2). MoE variants must opt in via the environment override. 2. After the variant case block, fail fast if `NUM_EXPERTS=0 && EP>1` so a dense run such as `MODEL_VARIANT=9b ./run_qwen35_vl.sh` no longer trips Megatron's arg validation downstream. 3. `--moe-router-force-load-balancing` was unconditionally added to GPT_MODEL_ARGS (and therefore enabled even when no MoE args were emitted). It is now gated behind `FORCE_LOAD_BALANCING=1`, defaults off, and is appended to MOE_ARGS only when MoE is active. Real finetuning runs no longer freeze router routing decisions by default. * data/{vlm_dataset.py -> cord_v2.py} + models/__init__.py — renamed the CORD-V2-specific module from the generic-sounding `vlm_dataset.py` to `cord_v2.py`, updated the model registry path string accordingly, and added an "Adding another VLM dataset" section to the module docstring documenting the per-dataset module + `MODEL_REGISTRY["..."]["dataset_providers"]` registration pattern. * models/qwen35_vl/mrope.py — added a performance note on the `_build_sample_mrope_positions` helper documenting the `.tolist()` / `.item()` GPU<->CPU sync points and CUDA-graph incompatibility, and the precompute-in-collate / cache-by-shape follow-up plan. Behavior preserved here pending a follow-up data pipeline change. The other tests-import comment (test_thd_*.py importing `_pack_batch`) is already addressed on this branch: the helper is now named `pack_or_pad_batch` and the tests import that symbol.
|
/ok to test df54ae7 |
@wplf, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 4dfdd06 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26629578247 |
…shadow PR Adopt the merged dev [5/5] shadow PR NVIDIA#4751 (commit 58f3e67) verbatim for examples/multimodal_dev/ — it carries newer bug fixes: - replace data/vlm_dataset.py with data/cord_v2.py - add tests/_helpers.py, tests/test_cp_thd_correctness.py, tests/test_vision_patch_merger_parity.py - sync the remaining 23 example files to NVIDIA#4751's content data_samplers.py is intentionally NOT changed to match NVIDIA#4751: main uses args.hybrid_context_parallel whereas dev uses args.dynamic_context_parallel (the arg was renamed across branches), so NVIDIA#4756's existing line is the correct main adaptation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Squash of the fused-mRoPE work (Add Qwen3.5 MRoPE fusion benchmark support; Fix THD mRoPE CP fallback consistency; mRoPE THD review cleanup; enforce per-sequence CP divisibility on the fused THD launch path; unit-test coverage for real Qwen3.5-VL shapes). Adds a fused mRoPE kernel (megatron/core/fusions/fused_mrope.py) with an is_fused_mrope_available() gate, raw-mrope-freqs plumbing through rope_utils / rotary_pos_embedding / gpt_model / attention, the transformer_config + arguments toggles, and tests/unit_tests/fusions/ test_fused_mrope.py. Core only: the examples/multimodal_dev integration is intentionally dropped because that example is already upstream (NVIDIA#4751) and has diverged from this branch's copy. Co-Authored-By: Li Tao <litao@nvidia.com>
Qwen3.5 support series
This is part of a 5-PR series adding Qwen3.5-VL support, split for review clarity.
Dev PRs (this series):
Main PRs (corresponding mirrors):
Summary
Adds a standalone VLM training playground under
examples/multimodal_dev/with Qwen3.5-VL end-to-end.Model-agnostic harness
pretrain_multimodal.pyentry point andMODEL_REGISTRYso a new architecture is just a registry entry + backing module.models/base.py,forward_step.py,arguments.py,data/(mock + CORD-V2 dataset with THD pack/pad in collate).Qwen3.5-VL
models/qwen35_vl/— vision encoder, MRoPE (pre-computed for THD), decoder, factory, specs, configurations for proxy / 9B / 397B-A17B variants.tests/test_mrope_parity.py,test_cp_correctness.py,test_cp_support.py,test_cp_thd_correctness.py,test_thd_correctness.py,test_thd_e2e.py.One-line training infra change
megatron/training/datasets/data_samplers.py: enable the vanilla-collate torch DataLoader path when the new arguse_vanilla_collate_fnis set (needed for CORD-V2 under BSHD).Dependency
This example sets
mrope_interleaved=Truein itsTransformerConfigand relies on the core MRoPE interleaved layout introduced in #4750. The diff here is self-contained (onlyexamples/+ the 1-linedata_samplers.pychange), but the example won't run end-to-end until #4750 merges.Functionality support
Resume training loss curve

CP / THD correctness verification
tests/test_cp_thd_correctness.pyruns CP=1 and CP=4 in a singletorchrun --nproc-per-node 4invocation (in-processdestroy_model_parallel+initialize_model_parallelbetween phases, weights pinned by astate_dictsnapshot, identical inputs via a seededtorch.Generator). Loss aggregated viaAllReduce(SUM)on(num, den); grad_norm aggregated viaAllReduce(SUM)of gradients on the CP group then divided bycp_size, so each rank holds the CP-mean gradient that matches CP=1's backward on the full-batch mean loss.Default config (B=2, S=64, H=256, L=2, bf16):
Cross-check: BSHD CP=1 loss ≡ THD CP=1 loss =
7.03250265, and BSHD CP=1 grad_norm ≈ THD CP=1 grad_norm to 7 decimals — equal-length sequences make the two attention paths mathematically identical, so the CP=1 grad_norm match confirms BSHD/THD parity at the gradient level as well.Checkpoint conversion (HF → Megatron-FSDP DTensor)
The example consumes a Megatron-FSDP DTensor checkpoint, converted from the HuggingFace release via Megatron-Bridge.
Setup — clone Bridge and pin its
3rdparty/Megatron-LMsubmodule to this branch:Convert (single 8×H100 node, EP=8 / TP=CP=1;
--hf-modelcan be any Qwen3.5 variant, e.g.Qwen/Qwen3.5-35B-A3B):PYTHONPATH=./src:./3rdparty/Megatron-LM/ \ torchrun --nproc_per_node=8 \ examples/conversion/mfsdp/convert_checkpoints_fsdp.py import \ --hf-model Qwen/Qwen3.5-35B-A3B \ --megatron-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B-fsdp \ --ckpt-format fsdp_dtensor \ --ep 8HF weights are auto-fetched on first run via
huggingface_hub. Adjust--tp/--cp/--epto match the training topology (must satisfyWORLD_SIZE % (TP*CP*EP) == 0).Output
Bridge dependency — requires NVIDIA-NeMo/Megatron-Bridge#3987 (skip tokenizer save in
convert_checkpoints_fsdp.py). Without that fix the checkpoint is still written correctly but the script exits non-zero after save withAttributeError: 'TokenizerConfig' object has no attribute 'make_vocab_size_divisible_by'against this branch'smegatron.core.tokenizers.utils.build_tokenizer.Risk
examples/multimodal_dev/.data_samplers.pychange is fully backwards-compatible: behavior is unchanged unlessuse_vanilla_collate_fnis explicitly set.Test plan
pytest examples/multimodal_dev/tests/passes.scripts/run_qwen35_vl.shproxy variant trains a few steps on mock data.--use-vanilla-collate-fnand trains a few steps.torchrun --nproc-per-node 4 examples/multimodal_dev/tests/test_cp_thd_correctness.py— CP=1 vs CP=4 BSHD/THD loss + grad_norm within tolerance.🤖 Generated with Claude Code