[model] Add initial Qwen3-Omni support in Megatron-Bridge#2831
[model] Add initial Qwen3-Omni support in Megatron-Bridge#2831hbhflw2000 wants to merge 14 commits into
Conversation
Signed-off-by: Lianglipeng <lianglipeng@didiglobal.com> Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: Lianglipeng <lianglipeng@didiglobal.com> Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: Lianglipeng <lianglipeng@didiglobal.com> Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
📝 WalkthroughWalkthroughThis pull request adds comprehensive support for the Qwen3-Omni multimodal model to the Megatron-Bridge framework. It introduces model wrappers, bridge/provider implementations, multimodal RoPE position indexing logic, transformer configurations, and extensive test coverage for vision and audio modalities alongside existing language capabilities. Changes
Sequence Diagram(s)sequenceDiagram
participant Input as Input Data
participant TM as Qwen3OmniThinkerModel
participant VE as Vision Encoder
participant AE as Audio Encoder
participant LM as Language Model
participant Output as Output
Input->>TM: forward(input_ids, pixel_values, input_features, ...)
rect rgba(100, 150, 200, 0.5)
Note over TM,AE: Feature Extraction Phase
alt has vision features
TM->>VE: get_image_features(pixel_values, image_grid_thw)
VE-->>TM: image_embeddings
TM->>VE: get_video_features(pixel_values_videos, video_grid_thw)
VE-->>TM: video_embeddings
end
alt has audio features
TM->>AE: get_audio_features(input_features, feature_attention_mask)
AE-->>TM: audio_embeddings
end
end
rect rgba(150, 100, 200, 0.5)
Note over TM,LM: Embedding Fusion & Position Encoding Phase
TM->>TM: compute position_ids via get_rope_index()
TM->>TM: build inputs_embeds from input_ids
TM->>TM: merge image/video/audio embeddings
TM->>TM: apply attention masks and spatial merging
end
rect rgba(100, 200, 150, 0.5)
Note over TM,LM: Language Model Forward Phase
TM->>LM: forward(decoder_input, position_ids, attention_mask, ...)
LM-->>TM: logits
end
TM-->>Output: logits
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The review demands careful attention across multiple heterogeneous components: multimodal forward flow logic in the thinker model, intricate RoPE position indexing for image/video/audio grid-based positioning, bridge parameter mapping registries, provider configuration propagation, and comprehensive test coverage spanning unit and functional domains. While individual files follow consistent patterns, the cross-file dependencies and complex multimodal fusion semantics require substantive reasoning per component area. Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 17
🧹 Nitpick comments (1)
src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/model.py (1)
33-107: Add Google-style docstrings to the exported wrapper API.
Qwen3OmniModelis a new public entrypoint, but the class and its delegated methods do not currently describe Args, Returns, or Raises in a Sphinx-parseable way.As per coding guidelines, "Use Google style docstrings for classes and functions (parseable by Sphinx), including Args, Returns, and Raises sections."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/model.py` around lines 33 - 107, The Qwen3OmniModel class and its public methods (__init__, shared_embedding_or_output_weight, set_input_tensor, freeze, forward) lack Google-style docstrings; add Sphinx-parseable docstrings to the class and each public method describing Args, Returns, and Raises (where applicable), documenting parameter types/semantics (e.g., input_ids: torch.Tensor, position_ids: torch.Tensor | None, inference_params: InferenceParams | None), the return value of forward (torch.Tensor), and any exceptions set_input_tensor or freeze may raise; ensure the class-level docstring explains that this is a wrapper delegating to Qwen3OmniThinkerModel and notes which methods delegate to thinker so generated docs are clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/models/vlm/qwen3-omni.md`:
- Around line 7-42: Add a new "Usage" subsection under the top of qwen3-omni.md
showing a minimal HF↔Megatron workflow for Qwen3-Omni-30B-A3B-Instruct (one-line
commands for Hugging Face -> Megatron bridge conversion and Megatron -> Hugging
Face export and a short example of running the single-GPU smoke validation), and
add a "Known limitations" subsection that explicitly lists unsupported
runtime/validation paths mentioned in the PR: inference_params, packed
sequences, sequence-parallel vision, and multi-rank/distributed validation;
ensure both headings are concise, call out the model name
Qwen3-Omni-30B-A3B-Instruct, and reference the bridge/export/validation steps
already described so readers know where to find full commands.
In `@src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/rope.py`:
- Around line 83-87: The current logic computing video_nums and detecting
subsequent vision blocks misses cases where vision blocks are represented by
audio_start_token_id (use_audio_in_video=True), causing ed_vision_start to skip
the combined branch; update the detection to treat audio-start tokens as valid
vision-block markers: when computing video_nums (and where code checks
input_tokens for presence of image_token_id/video_token_id) include
audio_start_token_id as an alternative marker, and when scanning for
ed_vision_start or looping over vision blocks (references: video_nums,
vision_tokens, use_audio_in_video, audio_start_token_id, video_token_id,
image_token_id, input_tokens, ed_vision_start, vision_start_token_id) ensure the
loop/condition tests for any of audio_start_token_id OR video_token_id OR
image_token_id so audio-in-video blocks are not skipped.
- Around line 64-65: The current multimodal branch only handles image/video and
skips pure-audio, causing the cumsum fallback to assert on missing
attention_mask; update the conditional logic in the block that sets
total_input_ids (and the similar block around the 215-220 region) to also detect
audio-only inputs (e.g., presence of audio grid/length variables used in this
module) and perform the audio-length expansion of total_input_ids/positions
before falling back to the plain cumsum path; ensure the fallback that asserts
on attention_mask runs only after all modality-specific expansions (image,
video, audio) are attempted so pure-audio requests produce sequential positions
instead of hitting the assert.
In `@src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/thinker_model.py`:
- Around line 315-323: The zip over deepstack_visual_embeds and
video_embeds_multiscale should fail fast on unequal lengths: update the loop
that builds joint_embeds (the for image_embed, video_embed in
zip(deepstack_visual_embeds, video_embeds_multiscale) block) to call zip with
strict=True (i.e., zip(deepstack_visual_embeds, video_embeds_multiscale,
strict=True)) so a mismatch raises immediately instead of silently truncating;
keep the rest of the logic (image_mask_joint/video_mask_joint, embed_joint
creation, and appending to joint_embeds) unchanged.
- Around line 243-261: The code currently only calls get_rope_index when visual
inputs exist, causing audio-only batches to use
_build_text_only_mrope_position_ids and ignore
position_id_per_seconds/seconds_per_chunk; update the conditional that sets
position_ids in the block that references pixel_values and pixel_values_videos
so it also detects audio inputs (e.g., audio_feature_lengths or audio-related
tensors like audio_feature_lengths/audio_seqlens) and calls get_rope_index when
audio is present; modify the branch around position_ids assignment (the logic
using _build_text_only_mrope_position_ids, get_rope_index, image_grid_thw,
video_grid_thw, audio_feature_lengths, position_id_per_seconds,
seconds_per_chunk) to ensure audio-only batches route through get_rope_index so
temporal rope indices for audio are applied.
In `@src/megatron/bridge/models/qwen_omni/qwen3_omni_bridge.py`:
- Around line 40-45: The provider_bridge function currently accepts
talker_config and code2wav_config but only maps thinker-side weights, which can
silently drop audio-output stacks for checkpoints like
Qwen3OmniMoeForConditionalGeneration; update provider_bridge (the method that
inspects hf_pretrained.config, thinker_config, talker_config, code2wav_config,
and text_config) to detect when talker_config or code2wav_config indicate audio
output (e.g., an enable_audio_output flag or equivalent) and immediately raise a
clear exception identifying that full-Omni audio-output checkpoints are not
supported, instead of proceeding with partial conversion.
- Around line 51-86: The provider instantiation ignores HF config fields used
later at runtime—populate language_max_sequence_length (used by Qwen3VLGPTModel
in Qwen3OmniModelProvider.provide) and spatial_merge_size (used by
Qwen3OmniThinkerModel.forward via self.config.spatial_merge_size) from the
HF/text or thinker configs when creating Qwen3OmniModelProvider; update the
Qwen3OmniModelProvider call site to pass
language_max_sequence_length=getattr(text_config,
"language_max_sequence_length", text_config.max_position_embeddings) and
spatial_merge_size=getattr(thinker_config, "spatial_merge_size",
<reasonable-default>) so the runtime config matches the checkpoint’s actual
context window and vision merge geometry.
In `@src/megatron/bridge/models/qwen_omni/qwen3_omni_provider.py`:
- Around line 67-93: The provide method forwards
pre_process/post_process/vp_stage as None instead of deriving the provider's
pipeline-stage defaults, causing Qwen3OmniModel/Qwen3OmniThinkerModel to skip
tower construction; fix by resolving those defaults before constructing
Qwen3OmniModel: compute resolved_pre_process and resolved_post_process (derive
from vp_stage or the provider's pipeline-stage default logic used elsewhere) so
they are explicit booleans (not None), and pass those resolved values into the
Qwen3OmniModel constructor (keep the same call to model.freeze as-is).
In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Around line 63-71: The except block currently only catches
ModuleNotFoundError, which misses ImportError when the module exists but the
symbol is absent; update the exception handler for the import of
get_transformer_block_with_experimental_attention_variant_spec to catch both
ModuleNotFoundError and ImportError (i.e., except (ModuleNotFoundError,
ImportError):) and keep the fallback function
get_transformer_block_with_experimental_attention_variant_spec that raises
NotImplementedError unchanged.
In `@src/megatron/bridge/models/qwen/qwen_provider.py`:
- Around line 25-34: The compatibility guard only catches ModuleNotFoundError so
an ImportError (module present but missing the symbol) still escapes; update the
except clause around the import of
get_transformer_block_with_experimental_attention_variant_spec from
megatron.core.models.gpt.experimental_attention_variant_module_specs to catch
both ModuleNotFoundError and ImportError, preserving the fallback definition of
get_transformer_block_with_experimental_attention_variant_spec that raises
NotImplementedError.
In `@tests/functional_tests/models/qwen_omni/test_qwen3_omni_conversion.py`:
- Around line 96-100: The subprocess invocation using subprocess.run(cmd,
capture_output=True, text=True, cwd=repo_root, env=env) can hang; add a timeout
(e.g., timeout=300) to the subprocess.run call and wrap the call in a try/except
that catches subprocess.TimeoutExpired and other non-zero return cases, printing
captured stdout/stderr and calling pytest.fail(...) (instead of assert False)
with a clear message on timeout or non-zero returncode; reference the
subprocess.run invocation and the variables cmd, result, repo_root, env in your
changes.
In `@tests/functional_tests/models/qwen_omni/test_qwen3_omni_smoke.py`:
- Around line 48-79: The test currently mutates process-global state in
_init_dist and _init_model_parallel (calls to dist.init_process_group,
parallel_state.initialize_model_parallel and setting os.environ MASTER_PORT), so
move the Megatron smoke path into a separate subprocess: create a small helper
entrypoint (or script) that performs the logic in _init_dist and
_init_model_parallel, use subprocess.run (or multiprocessing.Process with spawn)
to execute that entrypoint from the pytest worker, pass any needed env via the
subprocess env (use a dynamically chosen free port instead of a fixed
MASTER_PORT), assert the subprocess exit code and capture stdout/stderr for
failures, and remove direct calls to dist.init_process_group /
parallel_state.initialize_model_parallel from the main test process to avoid
polluting later tests.
In `@tests/functional_tests/models/qwen_omni/utils.py`:
- Around line 72-110: The function create_qwen3_omni_smoke_model has a TOCTOU
race around the exists/rmtree/mkdir sequence; serialize model creation using the
existing SMOKE_LOCK_DIR to prevent two workers from simultaneously
building/deleting the same output_dir. Wrap the cache-check and build logic in
an exclusive lock (create or acquire a lock file inside SMOKE_LOCK_DIR or use a
FileLock), so only one process builds: check if output exists after acquiring
the lock, if not build into a temporary directory (e.g.,
output_dir.with_suffix(".tmp") or ·output_dir + ".tmp"·), write config and
weights there, then atomically rename/replace the temp dir to output_dir, and
finally release the lock; ensure the lock acquisition/release is in try/finally
and do not remove output_dir if it already exists after reacquiring the lock.
Use the symbols create_qwen3_omni_smoke_model and SMOKE_LOCK_DIR to locate where
to add the lock and temp-dir logic.
In `@tests/unit_tests/models/qwen_omni/conftest.py`:
- Around line 18-26: The current autouse fixture clear_lru_cache in
tests/unit_tests/models/qwen_omni/conftest.py unconditionally becomes a no-op;
change it to attempt to import or call the repository-wide clear_lru_cache
behavior and only skip if that import fails — otherwise retain the
clear-before/clear-after semantics around yield so tests that call
read_run_config or read_train_state still get a cleared LRU cache; implement
this by importing the upstream fixture or its clear function inside the conftest
fixture (referencing clear_lru_cache) and delegating to it (or wrapping its
setup/teardown) when available, falling back to the current yield-only noop only
on ImportError or AttributeError.
In `@tests/unit_tests/models/qwen_omni/modeling_qwen3_omni/test_omni_model.py`:
- Around line 98-115: The setup_class currently hardcodes MASTER_PORT="29501",
causing port conflicts for concurrent test runs; modify setup_class to pick an
available ephemeral port (e.g., bind to port 0 on a temporary socket to discover
a free port) and set os.environ["MASTER_PORT"] to that value before calling
dist.init_process_group (keep referencing setup_class and
dist.init_process_group), and implement a matching teardown_class that restores
any previous MASTER_ADDR/MASTER_PORT/RANK/LOCAL_RANK/WORLD_SIZE environment
variables (or removes the ones you set) and calls dist.destroy_process_group if
initialized to clean up the session; ensure you preserve any pre-existing env
values so teardown_class can restore them.
- Around line 175-182: The tests build a Transformer Engine layer spec
unconditionally via _make_layer_spec using
get_gpt_layer_with_transformer_engine_spec (which requires CUDA), causing
CPU-only runs to fail; fix by either (A) adding `@pytest.mark.run_only_on`("gpu")
to the affected test functions so they skip when CUDA is unavailable, or (B)
change _make_layer_spec to detect CUDA availability and return
get_gpt_layer_local_spec() when CUDA is not present (mimic the pattern from
gpt_builder.py/gpt_provider.py); update only the _make_layer_spec implementation
or annotate the test functions accordingly and ensure references remain to
_make_layer_spec, get_gpt_layer_with_transformer_engine_spec, and
get_gpt_layer_local_spec.
In `@tests/unit_tests/models/qwen_omni/test_qwen3_omni_provider.py`:
- Around line 19-22: This test module is missing the module-level pytest unit
marker; add "import pytest" and set "pytestmark = pytest.mark.unit" near the top
alongside the existing import (i.e., next to the "from
megatron.bridge.models.qwen_omni import Qwen3OmniModelProvider") so the
TestQwen3OmniModelProvider class is discovered by marker-based test selection.
---
Nitpick comments:
In `@src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/model.py`:
- Around line 33-107: The Qwen3OmniModel class and its public methods (__init__,
shared_embedding_or_output_weight, set_input_tensor, freeze, forward) lack
Google-style docstrings; add Sphinx-parseable docstrings to the class and each
public method describing Args, Returns, and Raises (where applicable),
documenting parameter types/semantics (e.g., input_ids: torch.Tensor,
position_ids: torch.Tensor | None, inference_params: InferenceParams | None),
the return value of forward (torch.Tensor), and any exceptions set_input_tensor
or freeze may raise; ensure the class-level docstring explains that this is a
wrapper delegating to Qwen3OmniThinkerModel and notes which methods delegate to
thinker so generated docs are clear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c0051edc-104e-4136-b6ed-e2820150003b
📒 Files selected for processing (22)
docs/models/vlm/index.mddocs/models/vlm/qwen3-omni.mdsrc/megatron/bridge/models/__init__.pysrc/megatron/bridge/models/qwen/qwen_provider.pysrc/megatron/bridge/models/qwen_omni/__init__.pysrc/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/__init__.pysrc/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/model.pysrc/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/rope.pysrc/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/thinker_model.pysrc/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/transformer_config.pysrc/megatron/bridge/models/qwen_omni/qwen3_omni_bridge.pysrc/megatron/bridge/models/qwen_omni/qwen3_omni_provider.pysrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.pysrc/megatron/bridge/models/qwen_vl/qwen35_vl_provider.pytests/functional_tests/models/qwen_omni/__init__.pytests/functional_tests/models/qwen_omni/test_qwen3_omni_conversion.pytests/functional_tests/models/qwen_omni/test_qwen3_omni_smoke.pytests/functional_tests/models/qwen_omni/utils.pytests/unit_tests/models/qwen_omni/conftest.pytests/unit_tests/models/qwen_omni/modeling_qwen3_omni/test_omni_model.pytests/unit_tests/models/qwen_omni/test_qwen3_omni_bridge.pytests/unit_tests/models/qwen_omni/test_qwen3_omni_provider.py
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
|
/claude review |
| ) -> torch.Tensor: | ||
| if feature_attention_mask is not None: | ||
| audio_feature_lengths = torch.sum(feature_attention_mask, dim=1) | ||
| input_features = input_features.permute(0, 2, 1)[feature_attention_mask.bool()].permute(1, 0) |
There was a problem hiding this comment.
Nit/potential bug: When feature_attention_mask is not None, line 202 does .permute(0, 2, 1)[mask].permute(1, 0) — but after boolean indexing, the result is 2D (num_selected, num_mel_bins), not 3D, so the second .permute(1, 0) transposes to (num_mel_bins, num_selected). This matches the HF upstream behavior, but it's worth double-checking that self.audio_model expects (feature_dim, time) layout rather than (time, feature_dim) after masking.
There was a problem hiding this comment.
This is intentional and matches the HF Qwen3-Omni thinker implementation.
The masked audio features are reshaped into the layout expected by the reused HF audio encoder path. We kept this behavior aligned with upstream, and the current audio unit/smoke validation did not show a shape mismatch on this path.
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """Functional tests for local Qwen3-Omni smoke-checkpoint creation and conversion.""" |
There was a problem hiding this comment.
need L0 bash script to actual run the tests, otherwise this test would not run. Check latest contribute.md
There was a problem hiding this comment.
Thanks for calling this out.
I followed up on the latest CONTRIBUTING.md guidance and wired this through the functional CI path using an L2_Launch_*.sh launcher, which is the same pattern currently used by Qwen VL / Qwen3.5 VL model functional tests.
Specifically, I added:
tests/functional_tests/L2_Launch_models_qwen3_omni.sh- the corresponding
cicd-functional-testsmatrix entry in.github/workflows/cicd-main.yml
So this test is now connected to the functional CI workflow rather than remaining only as a standalone local pytest file.
I also validated the path locally with:
- a local CI-equivalent toy functional conversion run
- local full-model multimodal inference
- local smoke HF -> Megatron -> HF conversion/inference
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
|
Thanks again for the detailed review. I’ve addressed the review feedback in the latest updates, including the example scripts, functional CI launcher integration, and the related code/doc cleanups. I also re-ran the local validation paths for the toy functional conversion flow and the local multimodal smoke paths. Whenever you have time, this should be ready for another review pass. |
|
@hbhflw2000 we had a functional test refactor recently. Could you please rename L2_Launch_models_qwen3_omni.sh -> L0_Launch_models_qwen3_omni.sh and resolve the conflict in cicd-main.yml by adding L0_Launch_models_qwen3_omni.sh to the current list of tests? Thanks |
Signed-off-by: hbhflw2000 <417911774@qq.com>
|
Thanks for the update. I’ve renamed the launcher to I’ve also forward-ported the branch onto the current upstream Qwen layout in Please take another look when convenient, especially at the current mergeability / review state. |
Signed-off-by: hbhflw2000 <417911774@qq.com>
|
/ok to test 0c1ff02 |
@suiyoubi, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
Thanks again for the previous review. It looks like the last From my side, I believe all author-side follow-ups have been addressed. If there is anything still pending for Could you help retry the CI when convenient? Thanks! |
What does this PR do?
This PR adds initial Qwen3-Omni support to Megatron Bridge, including:
The current scope focuses on checkpoint conversion and thinker-side multimodal runtime enablement.
Summary
New Features
position_embedding_type="mrope"in the provider/bridge path.Implementation
Added the following components under
src/megatron/bridge/models/qwen_omni/:Tests
Added unit tests for:
Added functional tests for:
Documentation
docs/models/vlm/qwen3-omni.md.README.mdfrom smoke artifactsValidation
Local test runs
Observed results
Current Scope
This PR currently covers:
Not included in this PR
The following are out of scope for this initial support PR:
inference_paramsNotes
CI
See the CI section in the contributing documentation for how to trigger CI.
For external contributors, an NVIDIA developer must approve and trigger the CI workflow.
Pre-review Checklist
If some items are still in progress, this PR can remain in Draft until ready.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests