[model, recipe, examples] feat: add Nemotron-3 Nano Omni support by cuichenx · Pull Request #3760 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-05-08T23:30:46Z

Summary

Adds end-to-end support for Nemotron-3 Nano Omni (30B-A3B MoE multimodal: MoE Mamba/attention hybrid LM + RADIO vision tower + Parakeet sound encoder), targeting HF architecture NemotronH_Nano_Omni_Reasoning_V3.
New bridge / provider / sound encoder under src/megatron/bridge/models/nemotron_omni/, recipe under src/megatron/bridge/recipes/nemotron_omni/, forward step at src/megatron/bridge/training/nemotron_omni_step.py, Energon task encoder for chat-ML samples with raw-waveform / mel audio, and supporting glue (collate fn, valor32k_avqa maker, audiohandler decoder, packing toggle on EnergonProvider).
New examples/models/vlm/nemotron_3_omni/ directory with conversion script, single- / multi-modality inference, slurm SFT/LoRA scripts, data-prep scripts, and evaluation scripts.

Test plan

Locally verified end-to-end against nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 on an 8 × H100 80GB node:

Notes:

The [ssm] and [audio] extras (mamba-ssm, causal-conv1d, librosa) are required at install time; decord is needed for video sampling.
Full-parameter SFT does not fit on a single 8×H100 node (Adam fp32 state OOMs at optimizer init); use the 2-node slurm script or freeze_language_model=True for single-node runs.

🤖 Generated with Claude Code

Adds end-to-end support for Nemotron-3 Nano Omni (30B-A3B MoE multimodal: MoE Mamba/attention hybrid LM + RADIO vision tower + Parakeet sound encoder), targeting HF architecture NemotronH_Nano_Omni_Reasoning_V3: - Bridge + provider + sound encoder under src/megatron/bridge/models/nemotron_omni/ - Recipe (CORD-V2 SFT/PEFT, VALOR32K-AVQA SFT/PEFT) under src/megatron/bridge/recipes/nemotron_omni/ - Forward step under src/megatron/bridge/training/nemotron_omni_step.py - Energon task encoder for chat-ML samples with raw-waveform/mel audio - VLM dataset glue: nemotron_omni_collate_fn, valor32k_avqa maker, audiohandler decoder, packing toggle on EnergonProvider - Examples under examples/models/vlm/nemotron_3_omni/: README, conversion script, single- and multi-modality inference, slurm SFT/LoRA scripts, data-prep scripts, evaluation scripts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-05-08T23:30:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

… for video inference Two small clarifications in the Nemotron-3 Nano Omni example README, based on a fresh end-to-end verification run: - Checkpoint Conversion → Export: call out that --trust-remote-code is required for the export step, not just import. The exporter loads the HF config, which references the custom modeling module shipped with NemotronH_Nano_Omni_Reasoning_V3. - Inference: add a callout that the video modes (rows 2 and 4) need `decord` installed, since it is not pulled in by any pyproject extra. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chen Cui <chcui@nvidia.com>

claude · 2026-05-08T23:35:10Z

Light Code Review

Critical

encode_batch() drops packing metadata (nemotron_omni_task_encoder.py): cu_seqlens, cu_seqlens_unpadded, cu_seqlens_argmin, and max_seqlen are computed in batch() and stored on NemotronOmniTaskBatch, but encode_batch() does not forward them to the training step dict. When pack_sequences=True on the Energon path, the model will silently treat packed sequences as a single long sequence with wrong attention masking. See inline comment for fix.
Debug print left in nemotron_omni_step.py:270-273: import os as _os + env-gated print() for NOMNI_DEBUG_TILES. Should be removed before merge.

Minor

Bare print() in modeling_nemotron_omni.py:42-47: freeze() uses bare print -- should use logger or print_rank_0() per project rules.
Copyright year: nemotron_omni_sound.py, hf_to_megatron_generate_nemotron_omni.py, cord_v2_inference.py, and valor32k_avqa_inference.py use Copyright (c) 2025 but project convention is 2026.
Missing copyright header: src/megatron/bridge/models/nemotron_omni/init.py has no NVIDIA copyright block.

Missing test coverage

This PR adds a new model family (bridge, provider, recipe, task encoder, collate, forward step) with no unit or functional tests. Per the adding-model-support guidelines, the following are expected:

Unit tests (tests/unit_tests/models/nemotron_omni/): test_nemotron_omni_bridge.py (mock HF config, verify provider_bridge() mapping and mapping_registry() coverage), test_nemotron_omni_provider.py (verify provider defaults, freeze logic, sound encoder construction)
Functional tests (tests/functional_tests/models/nemotron_omni/): test_nemotron_omni_conversion.py (toy model HF/Megatron roundtrip)
Recipe unit tests: monkeypatched AutoBridge, verify ConfigContainer structure for each recipe function

Suggested test cases

No perf tests impacted.

claude · 2026-05-12T03:17:33Z

Code Review — Nemotron-3 Nano Omni

Critical: Debug code left in production

src/megatron/bridge/training/nemotron_omni_step.py:271-273 has debug instrumentation that should not ship:

import os as _os
if _os.environ.get("NOMNI_DEBUG_TILES") == "1":
    print(f"[DEBUG step] num_image_tiles=...")

This is an inline import os, a bare print(), and an env-var-gated debug block in the hot training path. Please remove the three lines entirely.

Bare `print()` instead of logger

src/megatron/bridge/models/nemotron_omni/modeling_nemotron_omni.py:42,46 uses bare print() for freeze logging:

print(f"Freezing sound_model.{name}")
...
print(f"Freezing sound_projection.{name}")

Per project guidelines (CLAUDE.md): "NEVER use bare print() — use logging.getLogger(__name__) or print_rank_0()." Replace with logger.info(...) or print_rank_0(...) to match the rest of the codebase.

Duplicate HTTP request in `load_image()`

examples/conversion/hf_to_megatron_generate_nemotron_omni.py:304-306:

response = requests.get(image_path)
response.raise_for_status()
return Image.open(requests.get(image_path, stream=True).raw)

The image is fetched twice — once for the status check and once for Image.open. Use a single request:

response = requests.get(image_path, stream=True)
response.raise_for_status()
return Image.open(response.raw)

Missing tests

The PR adds a full VLM model (bridge, provider, model class, training step, task encoder, collate, recipes) but contains no unit or functional tests. Per the adding-model-support skill Phase 4:

Unit tests (tests/unit_tests/models/nemotron_omni/): mock HF config → verify provider_bridge() field mapping, mapping_registry() coverage, sound encoder freeze logic, and the NemotronOmniModelProvider custom fields.
Functional tests (tests/functional_tests/models/nemotron_omni/): toy model HF↔Megatron roundtrip (at minimum TP=1,PP=1).

At bare minimum, a bridge unit test with a mocked config would catch regressions in the provider mapping and weight registry.

Suggested test cases

Test	Type	What it covers
`test_nemotron_omni_bridge.py::TestProviderBridge::test_provider_type`	unit	`provider_bridge()` returns `NemotronOmniModelProvider`
`test_nemotron_omni_bridge.py::TestProviderBridge::test_moe_fields`	unit	MoE config fields (`num_moe_experts`, `moe_router_topk`, shared expert) mapped correctly
`test_nemotron_omni_bridge.py::TestProviderBridge::test_sound_fields`	unit	Sound encoder config (`sound_config`, `freeze_sound_model`) propagated to provider
`test_nemotron_omni_bridge.py::TestProviderBridge::test_tie_word_embeddings_from_top_level`	unit	`share_embeddings_and_output_weights` read from top-level HF config, not `text_config`
`test_nemotron_omni_bridge.py::TestMappingRegistry::test_has_sound_encoder_mappings`	unit	`mapping_registry()` includes sound encoder `ReplicatedMapping` entries
`test_nemotron_omni_bridge.py::TestMappingRegistry::test_has_temporal_embedder_mappings`	unit	Temporal video embedder weights are mapped
`test_nemotron_omni_provider.py::TestFreeze::test_freeze_sound_model`	unit	`freeze(freeze_sound_model=True)` sets `requires_grad=False` on sound params
`test_nemotron_omni_conversion.py::TestRoundtrip::test_tp1_pp1`	functional (GPU)	Toy model roundtrip HF→Megatron→HF at TP=1,PP=1
No perf tests impacted	—	No performance configs were added or modified

🤖 Generated with Claude Code

cuichenx · 2026-05-13T22:25:36Z

/ok to test 77739ca

cuichenx · 2026-05-15T21:43:37Z

/ok to test dcae350

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx · 2026-05-16T00:53:23Z

/ok to test 4dfe1ad

Signed-off-by: Chen Cui <chcui@nvidia.com>

Signed-off-by: Chen Cui <chcui@nvidia.com> # Conflicts: # examples/models/nemotron/nemotron_3_omni/README.md # examples/models/qwen/nemotron_3_omni/conversion.sh # examples/models/qwen/nemotron_3_omni/cord_v2_inference.py # examples/models/qwen/nemotron_3_omni/hf_to_megatron_generate_nemotron_omni.py # examples/models/qwen/nemotron_3_omni/inference.sh # examples/models/qwen/nemotron_3_omni/slurm_peft_cord_v2.sh # examples/models/qwen/nemotron_3_omni/slurm_peft_valor32k_avqa.sh # examples/models/qwen/nemotron_3_omni/slurm_sft_cord_v2.sh # examples/models/qwen/nemotron_3_omni/slurm_sft_valor32k_avqa.sh # examples/models/qwen/nemotron_3_omni/valor32k_avqa_inference.py

cuichenx · 2026-05-19T22:12:58Z

/ok to test 060d951