Skip to content

[model, recipe, examples] feat: add Nemotron-3 Nano Omni support#3760

Merged
cuichenx merged 16 commits into
mainfrom
chcui/nemotron_3_omni_pr
May 20, 2026
Merged

[model, recipe, examples] feat: add Nemotron-3 Nano Omni support#3760
cuichenx merged 16 commits into
mainfrom
chcui/nemotron_3_omni_pr

Conversation

@cuichenx

@cuichenx cuichenx commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds end-to-end support for Nemotron-3 Nano Omni (30B-A3B MoE multimodal: MoE Mamba/attention hybrid LM + RADIO vision tower + Parakeet sound encoder), targeting HF architecture NemotronH_Nano_Omni_Reasoning_V3.
  • New bridge / provider / sound encoder under src/megatron/bridge/models/nemotron_omni/, recipe under src/megatron/bridge/recipes/nemotron_omni/, forward step at src/megatron/bridge/training/nemotron_omni_step.py, Energon task encoder for chat-ML samples with raw-waveform / mel audio, and supporting glue (collate fn, valor32k_avqa maker, audiohandler decoder, packing toggle on EnergonProvider).
  • New examples/models/vlm/nemotron_3_omni/ directory with conversion script, single- / multi-modality inference, slurm SFT/LoRA scripts, data-prep scripts, and evaluation scripts.

Test plan

Locally verified end-to-end against nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 on an 8 × H100 80GB node:

  • HF → Megatron import (33B params, 7517 tensors, low-memory save) — ✅
  • Megatron → HF export (--not-strict, 4 expected-missing tensors regenerated from config) — ✅
  • HF↔Megatron multi-GPU roundtrip (TP=2 EP=2) — ✅ all weights match
  • Inference: image+text (1 GPU, from converted ckpt) — ✅ detailed H100 spec table description
  • Inference: image+text (1 GPU, on-the-fly conversion) — ✅
  • Inference: video+text (8 GPU, TP=4 EP=4) — ✅ 15 frames 上·1.12s, plant/flower description (requires decord)
  • Inference: audio+text (1 GPU) — ✅ exact transcription match
  • Inference: video+audio+text (8 GPU, TP=4 EP=2) — ✅ combined description
  • Image SFT smoke test — CORD-V2, 20 iters, frozen LM (single-node 8×H100) — ✅ lm loss 1.108→1.090, ~2.06 s/iter, peak ~36.7 GiB/GPU
  • Image PEFT smoke test — CORD-V2 LoRA, 20 iters — ✅ lm loss 1.022→0.558, ~2.35 s/iter, peak ~20.3 GiB/GPU
  • Audio SFT smoke test — CV17, 10 iters (in flight at PR open)

Notes:

  • The [ssm] and [audio] extras (mamba-ssm, causal-conv1d, librosa) are required at install time; decord is needed for video sampling.
  • Full-parameter SFT does not fit on a single 8×H100 node (Adam fp32 state OOMs at optimizer init); use the 2-node slurm script or freeze_language_model=True for single-node runs.

🤖 Generated with Claude Code

Adds end-to-end support for Nemotron-3 Nano Omni (30B-A3B MoE multimodal:
MoE Mamba/attention hybrid LM + RADIO vision tower + Parakeet sound
encoder), targeting HF architecture NemotronH_Nano_Omni_Reasoning_V3:

- Bridge + provider + sound encoder under
  src/megatron/bridge/models/nemotron_omni/
- Recipe (CORD-V2 SFT/PEFT, VALOR32K-AVQA SFT/PEFT) under
  src/megatron/bridge/recipes/nemotron_omni/
- Forward step under src/megatron/bridge/training/nemotron_omni_step.py
- Energon task encoder for chat-ML samples with raw-waveform/mel audio
- VLM dataset glue: nemotron_omni_collate_fn, valor32k_avqa maker,
  audiohandler decoder, packing toggle on EnergonProvider
- Examples under examples/models/vlm/nemotron_3_omni/: README, conversion
  script, single- and multi-modality inference, slurm SFT/LoRA scripts,
  data-prep scripts, evaluation scripts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread src/megatron/bridge/data/energon/nemotron_omni_task_encoder.py
… for video inference

Two small clarifications in the Nemotron-3 Nano Omni example README,
based on a fresh end-to-end verification run:

- Checkpoint Conversion → Export: call out that --trust-remote-code is
  required for the export step, not just import. The exporter loads the
  HF config, which references the custom modeling module shipped with
  NemotronH_Nano_Omni_Reasoning_V3.
- Inference: add a callout that the video modes (rows 2 and 4) need
  `decord` installed, since it is not pulled in by any pyproject extra.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Comment thread src/megatron/bridge/training/nemotron_omni_step.py Outdated
Comment thread src/megatron/bridge/models/nemotron_omni/modeling_nemotron_omni.py Outdated
Comment thread src/megatron/bridge/models/nemotron_omni/nemotron_omni_sound.py Outdated
Comment thread src/megatron/bridge/models/nemotron_omni/__init__.py
@claude

claude Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Light Code Review

Critical

  1. encode_batch() drops packing metadata (nemotron_omni_task_encoder.py): cu_seqlens, cu_seqlens_unpadded, cu_seqlens_argmin, and max_seqlen are computed in batch() and stored on NemotronOmniTaskBatch, but encode_batch() does not forward them to the training step dict. When pack_sequences=True on the Energon path, the model will silently treat packed sequences as a single long sequence with wrong attention masking. See inline comment for fix.

  2. Debug print left in nemotron_omni_step.py:270-273: import os as _os + env-gated print() for NOMNI_DEBUG_TILES. Should be removed before merge.

Minor

  1. Bare print() in modeling_nemotron_omni.py:42-47: freeze() uses bare print -- should use logger or print_rank_0() per project rules.

  2. Copyright year: nemotron_omni_sound.py, hf_to_megatron_generate_nemotron_omni.py, cord_v2_inference.py, and valor32k_avqa_inference.py use Copyright (c) 2025 but project convention is 2026.

  3. Missing copyright header: src/megatron/bridge/models/nemotron_omni/init.py has no NVIDIA copyright block.

Missing test coverage

This PR adds a new model family (bridge, provider, recipe, task encoder, collate, forward step) with no unit or functional tests. Per the adding-model-support guidelines, the following are expected:

  • Unit tests (tests/unit_tests/models/nemotron_omni/): test_nemotron_omni_bridge.py (mock HF config, verify provider_bridge() mapping and mapping_registry() coverage), test_nemotron_omni_provider.py (verify provider defaults, freeze logic, sound encoder construction)
  • Functional tests (tests/functional_tests/models/nemotron_omni/): test_nemotron_omni_conversion.py (toy model HF/Megatron roundtrip)
  • Recipe unit tests: monkeypatched AutoBridge, verify ConfigContainer structure for each recipe function

Suggested test cases

No perf tests impacted.

@yaoyu-33 yaoyu-33 added area:model Model implementations and HF bridge logic blocked Work cannot move forward until an external dependency is cleared feature New capabilities, enhancements, or enablement work high-complexity Harder to merge: prone to conflicts and needs additional test coverage labels May 11, 2026
@cuichenx cuichenx marked this pull request as draft May 11, 2026 20:55
@cuichenx cuichenx marked this pull request as ready for review May 12, 2026 03:12
@claude

claude Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Code Review — Nemotron-3 Nano Omni

Critical: Debug code left in production

src/megatron/bridge/training/nemotron_omni_step.py:271-273 has debug instrumentation that should not ship:

import os as _os
if _os.environ.get("NOMNI_DEBUG_TILES") == "1":
    print(f"[DEBUG step] num_image_tiles=...")

This is an inline import os, a bare print(), and an env-var-gated debug block in the hot training path. Please remove the three lines entirely.

Bare print() instead of logger

src/megatron/bridge/models/nemotron_omni/modeling_nemotron_omni.py:42,46 uses bare print() for freeze logging:

print(f"Freezing sound_model.{name}")
...
print(f"Freezing sound_projection.{name}")

Per project guidelines (CLAUDE.md): "NEVER use bare print() — use logging.getLogger(__name__) or print_rank_0()." Replace with logger.info(...) or print_rank_0(...) to match the rest of the codebase.

Duplicate HTTP request in load_image()

examples/conversion/hf_to_megatron_generate_nemotron_omni.py:304-306:

response = requests.get(image_path)
response.raise_for_status()
return Image.open(requests.get(image_path, stream=True).raw)

The image is fetched twice — once for the status check and once for Image.open. Use a single request:

response = requests.get(image_path, stream=True)
response.raise_for_status()
return Image.open(response.raw)

Missing tests

The PR adds a full VLM model (bridge, provider, model class, training step, task encoder, collate, recipes) but contains no unit or functional tests. Per the adding-model-support skill Phase 4:

  • Unit tests (tests/unit_tests/models/nemotron_omni/): mock HF config → verify provider_bridge() field mapping, mapping_registry() coverage, sound encoder freeze logic, and the NemotronOmniModelProvider custom fields.
  • Functional tests (tests/functional_tests/models/nemotron_omni/): toy model HF↔Megatron roundtrip (at minimum TP=1,PP=1).

At bare minimum, a bridge unit test with a mocked config would catch regressions in the provider mapping and weight registry.


Suggested test cases

Test Type What it covers
test_nemotron_omni_bridge.py::TestProviderBridge::test_provider_type unit provider_bridge() returns NemotronOmniModelProvider
test_nemotron_omni_bridge.py::TestProviderBridge::test_moe_fields unit MoE config fields (num_moe_experts, moe_router_topk, shared expert) mapped correctly
test_nemotron_omni_bridge.py::TestProviderBridge::test_sound_fields unit Sound encoder config (sound_config, freeze_sound_model) propagated to provider
test_nemotron_omni_bridge.py::TestProviderBridge::test_tie_word_embeddings_from_top_level unit share_embeddings_and_output_weights read from top-level HF config, not text_config
test_nemotron_omni_bridge.py::TestMappingRegistry::test_has_sound_encoder_mappings unit mapping_registry() includes sound encoder ReplicatedMapping entries
test_nemotron_omni_bridge.py::TestMappingRegistry::test_has_temporal_embedder_mappings unit Temporal video embedder weights are mapped
test_nemotron_omni_provider.py::TestFreeze::test_freeze_sound_model unit freeze(freeze_sound_model=True) sets requires_grad=False on sound params
test_nemotron_omni_conversion.py::TestRoundtrip::test_tp1_pp1 functional (GPU) Toy model roundtrip HF→Megatron→HF at TP=1,PP=1
No perf tests impacted No performance configs were added or modified

🤖 Generated with Claude Code

Comment thread src/megatron/bridge/training/nemotron_omni_step.py Outdated
Comment thread src/megatron/bridge/models/nemotron_omni/modeling_nemotron_omni.py Outdated
Comment thread examples/conversion/hf_to_megatron_generate_nemotron_omni.py Outdated
@cuichenx cuichenx marked this pull request as draft May 12, 2026 04:13
@cuichenx cuichenx marked this pull request as ready for review May 13, 2026 22:24
@cuichenx cuichenx removed the blocked Work cannot move forward until an external dependency is cleared label May 13, 2026
@cuichenx

Copy link
Copy Markdown
Contributor Author

/ok to test 77739ca

@cuichenx

Copy link
Copy Markdown
Contributor Author

/ok to test dcae350

Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx

Copy link
Copy Markdown
Contributor Author

/ok to test 4dfe1ad

cuichenx added 2 commits May 19, 2026 14:55
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>

# Conflicts:
#	examples/models/nemotron/nemotron_3_omni/README.md
#	examples/models/qwen/nemotron_3_omni/conversion.sh
#	examples/models/qwen/nemotron_3_omni/cord_v2_inference.py
#	examples/models/qwen/nemotron_3_omni/hf_to_megatron_generate_nemotron_omni.py
#	examples/models/qwen/nemotron_3_omni/inference.sh
#	examples/models/qwen/nemotron_3_omni/slurm_peft_cord_v2.sh
#	examples/models/qwen/nemotron_3_omni/slurm_peft_valor32k_avqa.sh
#	examples/models/qwen/nemotron_3_omni/slurm_sft_cord_v2.sh
#	examples/models/qwen/nemotron_3_omni/slurm_sft_valor32k_avqa.sh
#	examples/models/qwen/nemotron_3_omni/valor32k_avqa_inference.py
@cuichenx

Copy link
Copy Markdown
Contributor Author

/ok to test 060d951

@cuichenx cuichenx merged commit e86c289 into main May 20, 2026
173 of 176 checks passed
@cuichenx cuichenx deleted the chcui/nemotron_3_omni_pr branch May 20, 2026 02:03
@cuichenx cuichenx mentioned this pull request Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work high-complexity Harder to merge: prone to conflicts and needs additional test coverage ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants