Skip to content

[model] Add initial Qwen3-Omni support in Megatron-Bridge#2831

Closed
hbhflw2000 wants to merge 14 commits into
NVIDIA-NeMo:mainfrom
hbhflw2000:omni3
Closed

[model] Add initial Qwen3-Omni support in Megatron-Bridge#2831
hbhflw2000 wants to merge 14 commits into
NVIDIA-NeMo:mainfrom
hbhflw2000:omni3

Conversation

@hbhflw2000

@hbhflw2000 hbhflw2000 commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR adds initial Qwen3-Omni support to Megatron Bridge, including:

  • Hugging Face ↔ Megatron checkpoint conversion
  • Thinker-side multimodal runtime support for text, image, video, and audio
  • Unit and functional test coverage
  • Model documentation

The current scope focuses on checkpoint conversion and thinker-side multimodal runtime enablement.


Summary

New Features

  • Added Qwen3-Omni model support to Megatron Bridge.
  • Implemented Hugging Face ↔ Megatron conversion for Qwen3-Omni checkpoints.
  • Added thinker-side multimodal runtime support covering:
    • text
    • image
    • video
    • audio
  • Reused the existing Qwen3-VL language and vision path where applicable, and integrated the Hugging Face Qwen3-Omni vision/audio towers.
  • Added multimodal RoPE handling and Megatron runtime compatibility fixes for Qwen3-Omni.
  • Fixed mRoPE config propagation by setting position_embedding_type="mrope" in the provider/bridge path.
  • Registered Qwen3-Omni in Bridge model exports and AutoBridge support.

Implementation

Added the following components under src/megatron/bridge/models/qwen_omni/:

  • provider
  • bridge
  • transformer config
  • model wrapper
  • thinker runtime

Tests

Added unit tests for:

  • provider
  • bridge
  • model

Added functional tests for:

  • smoke checkpoint creation
  • single-GPU HF → Megatron → HF roundtrip conversion
  • Hugging Face multimodal end-to-end smoke
  • Megatron multimodal end-to-end smoke

Documentation

  • Added Qwen3-Omni model documentation under docs/models/vlm/qwen3-omni.md.
  • Updated docs and test utilities to be more upstream-friendly:
    • removed copied upstream README.md from smoke artifacts
    • allowed local asset paths to be overridden via environment variables
    • replaced development-plan style docs with model support documentation

Validation

Local test runs

PYTHONPATH=src pytest tests/unit_tests/models/qwen_omni/test_qwen3_omni_provider.py \
  tests/unit_tests/models/qwen_omni/test_qwen3_omni_bridge.py \
  tests/unit_tests/models/qwen_omni/modeling_qwen3_omni/test_omni_model.py -q

PYTHONPATH=src pytest tests/functional_tests/models/qwen_omni -q

Observed results

  • Targeted unit tests: 11/11 passed
  • Qwen3-Omni functional suite: 4/4 passed

Current Scope

This PR currently covers:

  • checkpoint conversion
  • thinker-side multimodal runtime support
  • basic single-rank functional validation

Not included in this PR

The following are out of scope for this initial support PR:

  • Megatron inference with inference_params
  • packed sequence support
  • vision runtime with sequence parallel
  • distributed parallelism validation beyond the current single-rank functional coverage

Notes

  • Functional smoke tests use user-provided local assets.
  • Asset, model, and cache paths can be overridden through environment variables.

CI

See the CI section in the contributing documentation for how to trigger CI.

For external contributors, an NVIDIA developer must approve and trigger the CI workflow.


Pre-review Checklist

  • Read and followed the contributor guidelines
  • Added necessary tests
  • Added or updated necessary documentation
  • Does this PR affect optional-install components? (for example: Numba, Pynini, Apex)
  • Reviewer: verify correct import guards for optional libraries

If some items are still in progress, this PR can remain in Draft until ready.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Qwen3-Omni multimodal model support with integrated audio, vision, and text processing capabilities.
    • Implemented model-to-Megatron conversion bridge and provider for Qwen3-Omni.
    • Added multimodal rotary position embedding support for accurate token positioning across modalities.
  • Documentation

    • Added comprehensive Qwen3-Omni model documentation covering architecture, implementation details, and supported workflows.
  • Tests

    • Added functional and unit tests for Qwen3-Omni model validation and conversion.

Signed-off-by: Lianglipeng <lianglipeng@didiglobal.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: Lianglipeng <lianglipeng@didiglobal.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: Lianglipeng <lianglipeng@didiglobal.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
@copy-pr-bot

copy-pr-bot Bot commented Mar 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@hbhflw2000 hbhflw2000 changed the title Omni3 [model,test] Add initial Qwen3-Omni support in Megatron-Bridge Mar 17, 2026
@coderabbitai

coderabbitai Bot commented Mar 17, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This pull request adds comprehensive support for the Qwen3-Omni multimodal model to the Megatron-Bridge framework. It introduces model wrappers, bridge/provider implementations, multimodal RoPE position indexing logic, transformer configurations, and extensive test coverage for vision and audio modalities alongside existing language capabilities.

Changes

Cohort / File(s) Summary
Documentation
docs/models/vlm/index.md, docs/models/vlm/qwen3-omni.md
Added Qwen3-Omni documentation entry and new model documentation file describing architecture, checkpoint mappings, supported workflows, and validation coverage.
Core Model Integration
src/megatron/bridge/models/__init__.py
Exported three new public entities (Qwen3OmniBridge, Qwen3OmniModel, Qwen3OmniModelProvider) for Qwen3-Omni support under the Omni Models section.
Qwen3-Omni Model Implementations
src/megatron/bridge/models/qwen_omni/__init__.py, src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/__init__.py, src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/model.py
Introduced Qwen3OmniModel wrapper class delegating to Qwen3OmniThinkerModel with support for language, thinker, talker, and code2wav components; package-level exports for model access.
Transformer Configuration & RoPE
src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/transformer_config.py, src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/rope.py
Added Qwen3OmniTransformerConfig dataclass with multimodal token IDs and audio/vision parameters; implemented complex multimodal RoPE position indexing logic supporting image, video, and audio modalities with grid-based spatial-temporal positioning.
Multimodal Thinker Model
src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/thinker_model.py
Introduced Qwen3OmniThinkerModel orchestrating text, vision, and audio processing with multimodal embedding fusion, feature alignment, and MRope-based position encoding; supports selective freezing of sub-components.
Bridge & Provider
src/megatron/bridge/models/qwen_omni/qwen3_omni_bridge.py, src/megatron/bridge/models/qwen_omni/qwen3_omni_provider.py
Implemented Qwen3OmniBridge connecting HF Qwen3OmniMoeForConditionalGeneration to Megatron format with parameter mapping registry; added Qwen3OmniModelProvider dataclass extending MoE provider with multimodal configuration knobs and model instantiation logic.
Qwen Provider Fallbacks
src/megatron/bridge/models/qwen/qwen_provider.py, src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py
Replaced direct imports of experimental attention variant specs with guarded try/except blocks and fallback functions raising NotImplementedError; added get_rotary_seq_len helper for inference context compatibility.
Test Infrastructure
tests/unit_tests/models/qwen_omni/conftest.py, tests/functional_tests/models/qwen_omni/__init__.py
Added pytest configuration overriding global cache-clearing fixture and test module initialization.
Functional Tests
tests/functional_tests/models/qwen_omni/test_qwen3_omni_conversion.py, tests/functional_tests/models/qwen_omni/test_qwen3_omni_smoke.py, tests/functional_tests/models/qwen_omni/utils.py
Implemented end-to-end conversion and smoke tests validating HF and Megatron model compatibility; added utility functions for creating minimal smoke models and preparing real sample inputs with vision/audio data.
Unit Tests
tests/unit_tests/models/qwen_omni/modeling_qwen3_omni/test_omni_model.py, tests/unit_tests/models/qwen_omni/test_qwen3_omni_bridge.py, tests/unit_tests/models/qwen_omni/test_qwen3_omni_provider.py
Added comprehensive unit tests for Qwen3OmniModel multimodal forward paths (text, image, audio), bridge configuration and dtype handling, and provider initialization with custom thinker configurations.

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Data
    participant TM as Qwen3OmniThinkerModel
    participant VE as Vision Encoder
    participant AE as Audio Encoder
    participant LM as Language Model
    participant Output as Output

    Input->>TM: forward(input_ids, pixel_values, input_features, ...)
    
    rect rgba(100, 150, 200, 0.5)
    Note over TM,AE: Feature Extraction Phase
    
    alt has vision features
        TM->>VE: get_image_features(pixel_values, image_grid_thw)
        VE-->>TM: image_embeddings
        TM->>VE: get_video_features(pixel_values_videos, video_grid_thw)
        VE-->>TM: video_embeddings
    end
    
    alt has audio features
        TM->>AE: get_audio_features(input_features, feature_attention_mask)
        AE-->>TM: audio_embeddings
    end
    end
    
    rect rgba(150, 100, 200, 0.5)
    Note over TM,LM: Embedding Fusion & Position Encoding Phase
    
    TM->>TM: compute position_ids via get_rope_index()
    TM->>TM: build inputs_embeds from input_ids
    TM->>TM: merge image/video/audio embeddings
    TM->>TM: apply attention masks and spatial merging
    end
    
    rect rgba(100, 200, 150, 0.5)
    Note over TM,LM: Language Model Forward Phase
    
    TM->>LM: forward(decoder_input, position_ids, attention_mask, ...)
    LM-->>TM: logits
    end
    
    TM-->>Output: logits
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The review demands careful attention across multiple heterogeneous components: multimodal forward flow logic in the thinker model, intricate RoPE position indexing for image/video/audio grid-based positioning, bridge parameter mapping registries, provider configuration propagation, and comprehensive test coverage spanning unit and functional domains. While individual files follow consistent patterns, the cross-file dependencies and complex multimodal fusion semantics require substantive reasoning per component area.

Possibly related PRs

  • #2634 (Qwen2.5-Omni multimodal bridge/provider): Introduces analogous "Omni" model family structure for a different Qwen variant with similar multimodal architecture, bridge/provider/rope/config modules, and test patterns.

Suggested labels

community-request

Suggested reviewers

  • yaoyu-33
  • ko3n1g
  • chtruong814
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR introduces major Qwen3-Omni multimodal support (~1000 lines) but no concrete test execution results, logs, or validation outputs are documented despite PR objectives stating they should be provided. Add test execution results to PR description: unit/functional test outputs, checkpoint conversion results, end-to-end smoke test results with multimodal inputs, and verification that review comment code issues are fixed.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title '[model] Add initial Qwen3-Omni support in Megatron-Bridge' accurately summarizes the main change: introducing Qwen3-Omni model support to the Megatron-Bridge framework with provider, bridge, config, and runtime components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Nitpick comments (1)
src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/model.py (1)

33-107: Add Google-style docstrings to the exported wrapper API.

Qwen3OmniModel is a new public entrypoint, but the class and its delegated methods do not currently describe Args, Returns, or Raises in a Sphinx-parseable way.

As per coding guidelines, "Use Google style docstrings for classes and functions (parseable by Sphinx), including Args, Returns, and Raises sections."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/model.py` around
lines 33 - 107, The Qwen3OmniModel class and its public methods (__init__,
shared_embedding_or_output_weight, set_input_tensor, freeze, forward) lack
Google-style docstrings; add Sphinx-parseable docstrings to the class and each
public method describing Args, Returns, and Raises (where applicable),
documenting parameter types/semantics (e.g., input_ids: torch.Tensor,
position_ids: torch.Tensor | None, inference_params: InferenceParams | None),
the return value of forward (torch.Tensor), and any exceptions set_input_tensor
or freeze may raise; ensure the class-level docstring explains that this is a
wrapper delegating to Qwen3OmniThinkerModel and notes which methods delegate to
thinker so generated docs are clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/models/vlm/qwen3-omni.md`:
- Around line 7-42: Add a new "Usage" subsection under the top of qwen3-omni.md
showing a minimal HF↔Megatron workflow for Qwen3-Omni-30B-A3B-Instruct (one-line
commands for Hugging Face -> Megatron bridge conversion and Megatron -> Hugging
Face export and a short example of running the single-GPU smoke validation), and
add a "Known limitations" subsection that explicitly lists unsupported
runtime/validation paths mentioned in the PR: inference_params, packed
sequences, sequence-parallel vision, and multi-rank/distributed validation;
ensure both headings are concise, call out the model name
Qwen3-Omni-30B-A3B-Instruct, and reference the bridge/export/validation steps
already described so readers know where to find full commands.

In `@src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/rope.py`:
- Around line 83-87: The current logic computing video_nums and detecting
subsequent vision blocks misses cases where vision blocks are represented by
audio_start_token_id (use_audio_in_video=True), causing ed_vision_start to skip
the combined branch; update the detection to treat audio-start tokens as valid
vision-block markers: when computing video_nums (and where code checks
input_tokens for presence of image_token_id/video_token_id) include
audio_start_token_id as an alternative marker, and when scanning for
ed_vision_start or looping over vision blocks (references: video_nums,
vision_tokens, use_audio_in_video, audio_start_token_id, video_token_id,
image_token_id, input_tokens, ed_vision_start, vision_start_token_id) ensure the
loop/condition tests for any of audio_start_token_id OR video_token_id OR
image_token_id so audio-in-video blocks are not skipped.
- Around line 64-65: The current multimodal branch only handles image/video and
skips pure-audio, causing the cumsum fallback to assert on missing
attention_mask; update the conditional logic in the block that sets
total_input_ids (and the similar block around the 215-220 region) to also detect
audio-only inputs (e.g., presence of audio grid/length variables used in this
module) and perform the audio-length expansion of total_input_ids/positions
before falling back to the plain cumsum path; ensure the fallback that asserts
on attention_mask runs only after all modality-specific expansions (image,
video, audio) are attempted so pure-audio requests produce sequential positions
instead of hitting the assert.

In `@src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/thinker_model.py`:
- Around line 315-323: The zip over deepstack_visual_embeds and
video_embeds_multiscale should fail fast on unequal lengths: update the loop
that builds joint_embeds (the for image_embed, video_embed in
zip(deepstack_visual_embeds, video_embeds_multiscale) block) to call zip with
strict=True (i.e., zip(deepstack_visual_embeds, video_embeds_multiscale,
strict=True)) so a mismatch raises immediately instead of silently truncating;
keep the rest of the logic (image_mask_joint/video_mask_joint, embed_joint
creation, and appending to joint_embeds) unchanged.
- Around line 243-261: The code currently only calls get_rope_index when visual
inputs exist, causing audio-only batches to use
_build_text_only_mrope_position_ids and ignore
position_id_per_seconds/seconds_per_chunk; update the conditional that sets
position_ids in the block that references pixel_values and pixel_values_videos
so it also detects audio inputs (e.g., audio_feature_lengths or audio-related
tensors like audio_feature_lengths/audio_seqlens) and calls get_rope_index when
audio is present; modify the branch around position_ids assignment (the logic
using _build_text_only_mrope_position_ids, get_rope_index, image_grid_thw,
video_grid_thw, audio_feature_lengths, position_id_per_seconds,
seconds_per_chunk) to ensure audio-only batches route through get_rope_index so
temporal rope indices for audio are applied.

In `@src/megatron/bridge/models/qwen_omni/qwen3_omni_bridge.py`:
- Around line 40-45: The provider_bridge function currently accepts
talker_config and code2wav_config but only maps thinker-side weights, which can
silently drop audio-output stacks for checkpoints like
Qwen3OmniMoeForConditionalGeneration; update provider_bridge (the method that
inspects hf_pretrained.config, thinker_config, talker_config, code2wav_config,
and text_config) to detect when talker_config or code2wav_config indicate audio
output (e.g., an enable_audio_output flag or equivalent) and immediately raise a
clear exception identifying that full-Omni audio-output checkpoints are not
supported, instead of proceeding with partial conversion.
- Around line 51-86: The provider instantiation ignores HF config fields used
later at runtime—populate language_max_sequence_length (used by Qwen3VLGPTModel
in Qwen3OmniModelProvider.provide) and spatial_merge_size (used by
Qwen3OmniThinkerModel.forward via self.config.spatial_merge_size) from the
HF/text or thinker configs when creating Qwen3OmniModelProvider; update the
Qwen3OmniModelProvider call site to pass
language_max_sequence_length=getattr(text_config,
"language_max_sequence_length", text_config.max_position_embeddings) and
spatial_merge_size=getattr(thinker_config, "spatial_merge_size",
<reasonable-default>) so the runtime config matches the checkpoint’s actual
context window and vision merge geometry.

In `@src/megatron/bridge/models/qwen_omni/qwen3_omni_provider.py`:
- Around line 67-93: The provide method forwards
pre_process/post_process/vp_stage as None instead of deriving the provider's
pipeline-stage defaults, causing Qwen3OmniModel/Qwen3OmniThinkerModel to skip
tower construction; fix by resolving those defaults before constructing
Qwen3OmniModel: compute resolved_pre_process and resolved_post_process (derive
from vp_stage or the provider's pipeline-stage default logic used elsewhere) so
they are explicit booleans (not None), and pass those resolved values into the
Qwen3OmniModel constructor (keep the same call to model.freeze as-is).

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Around line 63-71: The except block currently only catches
ModuleNotFoundError, which misses ImportError when the module exists but the
symbol is absent; update the exception handler for the import of
get_transformer_block_with_experimental_attention_variant_spec to catch both
ModuleNotFoundError and ImportError (i.e., except (ModuleNotFoundError,
ImportError):) and keep the fallback function
get_transformer_block_with_experimental_attention_variant_spec that raises
NotImplementedError unchanged.

In `@src/megatron/bridge/models/qwen/qwen_provider.py`:
- Around line 25-34: The compatibility guard only catches ModuleNotFoundError so
an ImportError (module present but missing the symbol) still escapes; update the
except clause around the import of
get_transformer_block_with_experimental_attention_variant_spec from
megatron.core.models.gpt.experimental_attention_variant_module_specs to catch
both ModuleNotFoundError and ImportError, preserving the fallback definition of
get_transformer_block_with_experimental_attention_variant_spec that raises
NotImplementedError.

In `@tests/functional_tests/models/qwen_omni/test_qwen3_omni_conversion.py`:
- Around line 96-100: The subprocess invocation using subprocess.run(cmd,
capture_output=True, text=True, cwd=repo_root, env=env) can hang; add a timeout
(e.g., timeout=300) to the subprocess.run call and wrap the call in a try/except
that catches subprocess.TimeoutExpired and other non-zero return cases, printing
captured stdout/stderr and calling pytest.fail(...) (instead of assert False)
with a clear message on timeout or non-zero returncode; reference the
subprocess.run invocation and the variables cmd, result, repo_root, env in your
changes.

In `@tests/functional_tests/models/qwen_omni/test_qwen3_omni_smoke.py`:
- Around line 48-79: The test currently mutates process-global state in
_init_dist and _init_model_parallel (calls to dist.init_process_group,
parallel_state.initialize_model_parallel and setting os.environ MASTER_PORT), so
move the Megatron smoke path into a separate subprocess: create a small helper
entrypoint (or script) that performs the logic in _init_dist and
_init_model_parallel, use subprocess.run (or multiprocessing.Process with spawn)
to execute that entrypoint from the pytest worker, pass any needed env via the
subprocess env (use a dynamically chosen free port instead of a fixed
MASTER_PORT), assert the subprocess exit code and capture stdout/stderr for
failures, and remove direct calls to dist.init_process_group /
parallel_state.initialize_model_parallel from the main test process to avoid
polluting later tests.

In `@tests/functional_tests/models/qwen_omni/utils.py`:
- Around line 72-110: The function create_qwen3_omni_smoke_model has a TOCTOU
race around the exists/rmtree/mkdir sequence; serialize model creation using the
existing SMOKE_LOCK_DIR to prevent two workers from simultaneously
building/deleting the same output_dir. Wrap the cache-check and build logic in
an exclusive lock (create or acquire a lock file inside SMOKE_LOCK_DIR or use a
FileLock), so only one process builds: check if output exists after acquiring
the lock, if not build into a temporary directory (e.g.,
output_dir.with_suffix(".tmp") or ·output_dir + ".tmp"·), write config and
weights there, then atomically rename/replace the temp dir to output_dir, and
finally release the lock; ensure the lock acquisition/release is in try/finally
and do not remove output_dir if it already exists after reacquiring the lock.
Use the symbols create_qwen3_omni_smoke_model and SMOKE_LOCK_DIR to locate where
to add the lock and temp-dir logic.

In `@tests/unit_tests/models/qwen_omni/conftest.py`:
- Around line 18-26: The current autouse fixture clear_lru_cache in
tests/unit_tests/models/qwen_omni/conftest.py unconditionally becomes a no-op;
change it to attempt to import or call the repository-wide clear_lru_cache
behavior and only skip if that import fails — otherwise retain the
clear-before/clear-after semantics around yield so tests that call
read_run_config or read_train_state still get a cleared LRU cache; implement
this by importing the upstream fixture or its clear function inside the conftest
fixture (referencing clear_lru_cache) and delegating to it (or wrapping its
setup/teardown) when available, falling back to the current yield-only noop only
on ImportError or AttributeError.

In `@tests/unit_tests/models/qwen_omni/modeling_qwen3_omni/test_omni_model.py`:
- Around line 98-115: The setup_class currently hardcodes MASTER_PORT="29501",
causing port conflicts for concurrent test runs; modify setup_class to pick an
available ephemeral port (e.g., bind to port 0 on a temporary socket to discover
a free port) and set os.environ["MASTER_PORT"] to that value before calling
dist.init_process_group (keep referencing setup_class and
dist.init_process_group), and implement a matching teardown_class that restores
any previous MASTER_ADDR/MASTER_PORT/RANK/LOCAL_RANK/WORLD_SIZE environment
variables (or removes the ones you set) and calls dist.destroy_process_group if
initialized to clean up the session; ensure you preserve any pre-existing env
values so teardown_class can restore them.
- Around line 175-182: The tests build a Transformer Engine layer spec
unconditionally via _make_layer_spec using
get_gpt_layer_with_transformer_engine_spec (which requires CUDA), causing
CPU-only runs to fail; fix by either (A) adding `@pytest.mark.run_only_on`("gpu")
to the affected test functions so they skip when CUDA is unavailable, or (B)
change _make_layer_spec to detect CUDA availability and return
get_gpt_layer_local_spec() when CUDA is not present (mimic the pattern from
gpt_builder.py/gpt_provider.py); update only the _make_layer_spec implementation
or annotate the test functions accordingly and ensure references remain to
_make_layer_spec, get_gpt_layer_with_transformer_engine_spec, and
get_gpt_layer_local_spec.

In `@tests/unit_tests/models/qwen_omni/test_qwen3_omni_provider.py`:
- Around line 19-22: This test module is missing the module-level pytest unit
marker; add "import pytest" and set "pytestmark = pytest.mark.unit" near the top
alongside the existing import (i.e., next to the "from
megatron.bridge.models.qwen_omni import Qwen3OmniModelProvider") so the
TestQwen3OmniModelProvider class is discovered by marker-based test selection.

---

Nitpick comments:
In `@src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/model.py`:
- Around line 33-107: The Qwen3OmniModel class and its public methods (__init__,
shared_embedding_or_output_weight, set_input_tensor, freeze, forward) lack
Google-style docstrings; add Sphinx-parseable docstrings to the class and each
public method describing Args, Returns, and Raises (where applicable),
documenting parameter types/semantics (e.g., input_ids: torch.Tensor,
position_ids: torch.Tensor | None, inference_params: InferenceParams | None),
the return value of forward (torch.Tensor), and any exceptions set_input_tensor
or freeze may raise; ensure the class-level docstring explains that this is a
wrapper delegating to Qwen3OmniThinkerModel and notes which methods delegate to
thinker so generated docs are clear.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c0051edc-104e-4136-b6ed-e2820150003b

📥 Commits

Reviewing files that changed from the base of the PR and between 57d10ad and 2ba2be9.

📒 Files selected for processing (22)
  • docs/models/vlm/index.md
  • docs/models/vlm/qwen3-omni.md
  • src/megatron/bridge/models/__init__.py
  • src/megatron/bridge/models/qwen/qwen_provider.py
  • src/megatron/bridge/models/qwen_omni/__init__.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/__init__.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/model.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/rope.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/thinker_model.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/transformer_config.py
  • src/megatron/bridge/models/qwen_omni/qwen3_omni_bridge.py
  • src/megatron/bridge/models/qwen_omni/qwen3_omni_provider.py
  • src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py
  • src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
  • tests/functional_tests/models/qwen_omni/__init__.py
  • tests/functional_tests/models/qwen_omni/test_qwen3_omni_conversion.py
  • tests/functional_tests/models/qwen_omni/test_qwen3_omni_smoke.py
  • tests/functional_tests/models/qwen_omni/utils.py
  • tests/unit_tests/models/qwen_omni/conftest.py
  • tests/unit_tests/models/qwen_omni/modeling_qwen3_omni/test_omni_model.py
  • tests/unit_tests/models/qwen_omni/test_qwen3_omni_bridge.py
  • tests/unit_tests/models/qwen_omni/test_qwen3_omni_provider.py

Comment thread docs/models/vlm/qwen3-omni.md
Comment thread src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/rope.py Outdated
Comment thread src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/rope.py
Comment thread tests/functional_tests/models/qwen_omni/utils.py Outdated
Comment thread tests/unit_tests/models/qwen_omni/conftest.py
Comment thread tests/unit_tests/models/qwen_omni/test_qwen3_omni_provider.py
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor

/claude review

@yaoyu-33 yaoyu-33 added needs-follow-up area:model Model implementations and HF bridge logic labels Mar 18, 2026
Comment thread docs/models/vlm/qwen3-omni.md
Comment thread src/megatron/bridge/models/qwen_omni/modeling_qwen3_omni/rope.py Outdated
Comment on lines +199 to +202
) -> torch.Tensor:
if feature_attention_mask is not None:
audio_feature_lengths = torch.sum(feature_attention_mask, dim=1)
input_features = input_features.permute(0, 2, 1)[feature_attention_mask.bool()].permute(1, 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit/potential bug: When feature_attention_mask is not None, line 202 does .permute(0, 2, 1)[mask].permute(1, 0) — but after boolean indexing, the result is 2D (num_selected, num_mel_bins), not 3D, so the second .permute(1, 0) transposes to (num_mel_bins, num_selected). This matches the HF upstream behavior, but it's worth double-checking that self.audio_model expects (feature_dim, time) layout rather than (time, feature_dim) after masking.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional and matches the HF Qwen3-Omni thinker implementation.
The masked audio features are reshaped into the layout expected by the reused HF audio encoder path. We kept this behavior aligned with upstream, and the current audio unit/smoke validation did not show a shape mismatch on this path.

Comment thread docs/models/vlm/qwen3-omni.md Outdated
Comment thread docs/models/vlm/qwen3-omni.md Outdated
# See the License for the specific language governing permissions and
# limitations under the License.

"""Functional tests for local Qwen3-Omni smoke-checkpoint creation and conversion."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need L0 bash script to actual run the tests, otherwise this test would not run. Check latest contribute.md

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out.

I followed up on the latest CONTRIBUTING.md guidance and wired this through the functional CI path using an L2_Launch_*.sh launcher, which is the same pattern currently used by Qwen VL / Qwen3.5 VL model functional tests.

Specifically, I added:

  • tests/functional_tests/L2_Launch_models_qwen3_omni.sh
  • the corresponding cicd-functional-tests matrix entry in .github/workflows/cicd-main.yml

So this test is now connected to the functional CI workflow rather than remaining only as a standalone local pytest file.

I also validated the path locally with:

  • a local CI-equivalent toy functional conversion run
  • local full-model multimodal inference
  • local smoke HF -> Megatron -> HF conversion/inference

Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
@hbhflw2000 hbhflw2000 requested a review from a team as a code owner March 19, 2026 09:20
@hbhflw2000

Copy link
Copy Markdown
Contributor Author

Thanks again for the detailed review.

I’ve addressed the review feedback in the latest updates, including the example scripts, functional CI launcher integration, and the related code/doc cleanups. I also re-ran the local validation paths for the toy functional conversion flow and the local multimodal smoke paths.

Whenever you have time, this should be ready for another review pass.

@cuichenx

Copy link
Copy Markdown
Contributor

@hbhflw2000 we had a functional test refactor recently. Could you please rename L2_Launch_models_qwen3_omni.sh -> L0_Launch_models_qwen3_omni.sh and resolve the conflict in cicd-main.yml by adding L0_Launch_models_qwen3_omni.sh to the current list of tests? Thanks

Signed-off-by: hbhflw2000 <417911774@qq.com>
@hbhflw2000

hbhflw2000 commented Mar 23, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the update.

I’ve renamed the launcher to L0_Launch_models_qwen3_omni.sh and updated cicd-main.yml to add L0_Launch_models_qwen3_omni.sh to the current functional test list.

I’ve also forward-ported the branch onto the current upstream Qwen layout in 31e2295b, so from my side this should have addressed the layout drift and CI launcher alignment issues as well.

Please take another look when convenient, especially at the current mergeability / review state.

Signed-off-by: hbhflw2000 <417911774@qq.com>
@suiyoubi

Copy link
Copy Markdown
Contributor

/ok to test 0c1ff02

@copy-pr-bot

copy-pr-bot Bot commented Mar 30, 2026

Copy link
Copy Markdown

/ok to test 0c1ff02

@suiyoubi, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@hbhflw2000

Copy link
Copy Markdown
Contributor Author

Thanks again for the previous review.

It looks like the last /ok to test hit an infra issue (E2), so the CI workflows didn’t actually start.

From my side, I believe all author-side follow-ups have been addressed. If there is anything still pending for needs-author, please let me know and I’ll update it right away.

Could you help retry the CI when convenient? Thanks!

@hbhflw2000 hbhflw2000 changed the title [model,test] Add initial Qwen3-Omni support in Megatron-Bridge [model] Add initial Qwen3-Omni support in Megatron-Bridge Apr 15, 2026
@hbhflw2000

Copy link
Copy Markdown
Contributor Author

Closing this in favor of #3317.

#3317 builds on this work and extends it to training, so I’ll keep the review there to avoid duplicate effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic community-request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants