feat(mimo): add Whisper audio encoder to LLaVA MIMO training pipeline by kamran-nvidia · Pull Request #3520 · NVIDIA-NeMo/Megatron-Bridge

kamran-nvidia · 2026-04-24T21:22:16Z

What does this PR do ?

Adds end-to-end audio modality support to the MIMO LLaVA example, alongside the existing CLIP vision branch. Specifically:

Changelog

Whisper encoder: new examples/models/megatron_mimo/whisper/ subpackage with a Megatron-style Whisper encoder (whisper_model.py, whisper_layer_specs.py).
HF → Megatron conversion + verification: convert_hf_whisper_to_megatron.py emits per-TP-rank checkpoints; verify_whisper_conversion.py checks numerical parity against HuggingFace Whisper.
Audio-aware MIMO training script: megatron_mimo_training_llava_audio.py wires the Whisper encoder into the MIMO LLaVA stack (vision + audio + Vicuna-7B LLM), with matching launch scripts (run_hetero_llava_audio.sh, run_hetero_llava_audio_parallelism_tests.sh).
Dataset preparation: synthesize_llava_pretrain_audio.py synthesizes TTS audio (NeMo FastPitch + HiFiGAN) for LLaVA-Pretrain and emits an augmented JSON; prepare_llava_pretrain_audio.sh shards/merges the run.
Dataset hook: small change in src/megatron/bridge/data/megatron_mimo/dataset.py to surface the new audio field.

Also from this PR (#3531):

Adds an opt-in --deterministic flag to the MIMO LLaVA and LLaVA-audio
training scripts (megatron_mimo_training_llava.py,
megatron_mimo_training_llava_audio.py). When set, vision, language, audio,
and projection configs run in FP32 with unfused attention, loss fusion
disabled, and full activation recompute so runs are bitwise reproducible.
Threads the deterministic argument through the relevant config and model
spec helpers so each sub-module is configured consistently.
Updates the launcher shell scripts under
examples/models/megatron_mimo/ (run_hetero_llava.sh,
run_hetero_llava_audio.sh, and the three *_parallelism_tests*.sh
variants) to forward --deterministic and export the env vars required for
deterministic execution (e.g. cuBLAS / NCCL settings).
No behavior change when the flag is omitted — default training paths are
untouched.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

- Introduced `whisper` module with `WhisperEncoder` for Megatron. - Implemented `convert_hf_whisper_to_megatron.py` to convert HuggingFace Whisper weights to Megatron format. - Added verification script `verify_whisper_conversion.py` to compare outputs between HuggingFace and Megatron models. - Created `whisper_layer_specs.py` to define layer specifications for Whisper encoder. - Updated dataset processing in `dataset.py` to handle audio feature extractors with sampling rate. Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

copy-pr-bot · 2026-04-24T21:22:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kamran-nvidia · 2026-04-24T21:23:22Z

/ok to test 9c558ff

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-04-24T21:31:01Z

/ok to test 05f22f5

kamran-nvidia · 2026-04-24T21:31:08Z

/claude review

claude

Light Review Summary

3 bugs, 1 typo, 1 coverage gap:

Typo — whisper/__init__.py has copyright year 2025; all other new files use 2026.
Bug — whisper_model.py:21: bare except: silently catches SystemExit/KeyboardInterrupt. Should be except ImportError:.
Bug — megatron_mimo_training_llava_audio.py:316-317: audio sample rate from sf.read() is discarded. Files not at 16 kHz will silently produce wrong Whisper features.
Minor — convert_hf_whisper_to_megatron.py:167-170: the silent zero-fill for missing K-proj bias is correct for current Whisper variants but fragile — worth an explicit assert or comment.
Test coverage — This PR adds ~2600 lines of new model, conversion, and training code with no unit tests. The PR checklist boxes for tests and docs are unchecked. At minimum, the conversion round-trip (convert_hf_whisper_to_megatron → load_megatron_whisper_weights → shape/value check) and the QKV interleave logic should have unit tests.

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-04-25T01:25:24Z

/ok to test 3a72a2b

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-04-25T15:27:44Z

/ok to test ab9a70e

- Introduced `test_synthesize_audio_helpers.py` to validate helper functions in audio synthesis, including human text extraction and audio relative path generation. - Created `test_training_llava_audio_helpers.py` for testing pure helpers in the training script, focusing on functions related to token span finding and audio configuration. - Enhanced `test_whisper_conversion.py` with additional tests for handling unmapped encoder keys and verifying conversion shapes. - Added `test_whisper_encoder_forward.py` to ensure correct forward-pass behavior in the Whisper encoder, including output shape contracts and position embedding checks. - Implemented `test_whisper_layer_specs.py` to confirm that attention mask types remain consistent and validate linear layer implementations in Whisper's architecture. Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-04-25T17:38:57Z

/ok to test 67a106e

…rove clarity Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-04-25T22:08:52Z

/ok to test cc9e68e

kamran-nvidia · 2026-04-25T23:14:45Z

/ok to test 4c8f254

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-04-26T01:44:52Z

/ok to test 42fb555

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

cuichenx · 2026-04-30T22:55:08Z

should this modeling code live in the examples folder?

@cuichenx Good question. I guess I can move this to the models folder but we are not qualifying the Whisper model as a stand-alone model in the MBridge with this PR. Wdyt?

@cuichenx what's the move for where modeling code lives?
I saw it in several places.. CLIP is in MCore, QWen3VL is in MBridge. Any thoughts on where we should drop this audio encoder? As Kamran mentioned, it won't be a standalone model...

…meters in training script Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

… learning rates Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-05-11T17:12:19Z

/ok to test 1fc43e1

… 5.6 Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

kamran-nvidia · 2026-05-11T18:48:59Z

/ok to test 43c65a7

…NVIDIA-NeMo#3520) Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

linting

05f22f5

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread examples/models/megatron_mimo/whisper/__init__.py Outdated

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread examples/models/megatron_mimo/whisper/whisper_model.py

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread examples/models/megatron_mimo/megatron_mimo_training_llava_audio.py Outdated

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread examples/models/megatron_mimo/whisper/convert_hf_whisper_to_megatron.py

claude Bot reviewed Apr 24, 2026

View reviewed changes

kamran-nvidia added 2 commits April 24, 2026 18:09

Address comments

c4265ba

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

Add unit tests

3a72a2b

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

copy-pr-bot Bot temporarily deployed to test April 25, 2026 01:26 Inactive

Add more tests

ab9a70e

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

copy-pr-bot Bot temporarily deployed to test April 25, 2026 15:28 Inactive

copy-pr-bot Bot temporarily deployed to public April 25, 2026 15:34 Inactive

copy-pr-bot Bot temporarily deployed to test April 25, 2026 17:39 Inactive

Refactor audio configuration tests to simplify function calls and imp…

cc9e68e

…rove clarity Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

copy-pr-bot Bot temporarily deployed to test April 25, 2026 22:09 Inactive

Merge branch 'main' into kamran/mimo_audio

4c8f254

copy-pr-bot Bot temporarily deployed to test April 25, 2026 23:15 Inactive

Add more tests

42fb555

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

fix(mimo): reduce global batch size from 128 to 96 in training scripts

834ae7d

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

cuichenx reviewed Apr 30, 2026

View reviewed changes

kamran-nvidia added 4 commits May 1, 2026 07:32

fix(mimo): add support for UNFREEZE_LLM and adjust learning rate para…

fe36878

…meters in training script Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

feat(mimo): enhance run scripts with UNFREEZE_LLM support and dynamic…

72f8751

… learning rates Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

Merge branch 'main' into kamran/mimo_audio

2b4c64e

fix(test): remove redundant microbatches configuration in pretrain test

38aaeb3

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

cuichenx mentioned this pull request May 8, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

Merge branch 'main' into kamran/mimo_audio

1fc43e1

copy-pr-bot Bot temporarily deployed to public May 11, 2026 17:12 Inactive

copy-pr-bot Bot temporarily deployed to test May 11, 2026 17:13 Inactive

copy-pr-bot Bot temporarily deployed to public May 11, 2026 17:20 Inactive

copy-pr-bot Bot temporarily deployed to public May 11, 2026 17:21 Inactive

copy-pr-bot Bot temporarily deployed to public May 11, 2026 17:35 Inactive

kamran-nvidia added 3 commits May 11, 2026 11:42

fix: normalize state_dict keys for compatibility with transformers >=…

32467c2

… 5.6 Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

lint

a1d255c

Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>

Merge branch 'main' into kamran/mimo_audio

43c65a7

copy-pr-bot Bot temporarily deployed to public May 11, 2026 18:49 Inactive

copy-pr-bot Bot temporarily deployed to test May 11, 2026 18:51 Inactive

copy-pr-bot Bot temporarily deployed to public May 11, 2026 18:57 Inactive

copy-pr-bot Bot temporarily deployed to public May 11, 2026 18:58 Inactive

copy-pr-bot Bot temporarily deployed to public May 11, 2026 19:11 Inactive

yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label May 12, 2026

liding-nv approved these changes May 12, 2026

View reviewed changes

kamran-nvidia merged commit e215dc9 into main May 12, 2026
95 checks passed

kamran-nvidia deleted the kamran/mimo_audio branch May 12, 2026 20:16

Conversation

kamran-nvidia commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

kamran-nvidia commented Apr 24, 2026

Uh oh!

kamran-nvidia commented Apr 24, 2026

Uh oh!

kamran-nvidia commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Light Review Summary

Uh oh!

kamran-nvidia commented Apr 25, 2026

Uh oh!

kamran-nvidia commented Apr 25, 2026

Uh oh!

kamran-nvidia commented Apr 25, 2026

Uh oh!

kamran-nvidia commented Apr 25, 2026

Uh oh!

kamran-nvidia commented Apr 25, 2026

Uh oh!

kamran-nvidia commented Apr 26, 2026

Uh oh!

Uh oh!

cuichenx Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

kamran-nvidia May 8, 2026

Choose a reason for hiding this comment

Uh oh!

liding-nv May 11, 2026

Choose a reason for hiding this comment

Uh oh!

kamran-nvidia commented May 11, 2026

Uh oh!

kamran-nvidia commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kamran-nvidia commented Apr 24, 2026 •

edited

Loading