feat(mimo): add Whisper audio encoder to LLaVA MIMO training pipeline#3520
Conversation
- Introduced `whisper` module with `WhisperEncoder` for Megatron. - Implemented `convert_hf_whisper_to_megatron.py` to convert HuggingFace Whisper weights to Megatron format. - Added verification script `verify_whisper_conversion.py` to compare outputs between HuggingFace and Megatron models. - Created `whisper_layer_specs.py` to define layer specifications for Whisper encoder. - Updated dataset processing in `dataset.py` to handle audio feature extractors with sampling rate. Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test 9c558ff |
|
/ok to test 05f22f5 |
|
/claude review |
There was a problem hiding this comment.
Light Review Summary
3 bugs, 1 typo, 1 coverage gap:
-
Typo —
whisper/__init__.pyhas copyright year 2025; all other new files use 2026. -
Bug —
whisper_model.py:21: bareexcept:silently catchesSystemExit/KeyboardInterrupt. Should beexcept ImportError:. -
Bug —
megatron_mimo_training_llava_audio.py:316-317: audio sample rate fromsf.read()is discarded. Files not at 16 kHz will silently produce wrong Whisper features. -
Minor —
convert_hf_whisper_to_megatron.py:167-170: the silent zero-fill for missing K-proj bias is correct for current Whisper variants but fragile — worth an explicit assert or comment. -
Test coverage — This PR adds ~2600 lines of new model, conversion, and training code with no unit tests. The PR checklist boxes for tests and docs are unchecked. At minimum, the conversion round-trip (
convert_hf_whisper_to_megatron→load_megatron_whisper_weights→ shape/value check) and the QKV interleave logic should have unit tests.
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test 3a72a2b |
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test ab9a70e |
- Introduced `test_synthesize_audio_helpers.py` to validate helper functions in audio synthesis, including human text extraction and audio relative path generation. - Created `test_training_llava_audio_helpers.py` for testing pure helpers in the training script, focusing on functions related to token span finding and audio configuration. - Enhanced `test_whisper_conversion.py` with additional tests for handling unmapped encoder keys and verifying conversion shapes. - Added `test_whisper_encoder_forward.py` to ensure correct forward-pass behavior in the Whisper encoder, including output shape contracts and position embedding checks. - Implemented `test_whisper_layer_specs.py` to confirm that attention mask types remain consistent and validate linear layer implementations in Whisper's architecture. Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test 67a106e |
…rove clarity Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test cc9e68e |
|
/ok to test 4c8f254 |
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test 42fb555 |
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
There was a problem hiding this comment.
should this modeling code live in the examples folder?
There was a problem hiding this comment.
@cuichenx Good question. I guess I can move this to the models folder but we are not qualifying the Whisper model as a stand-alone model in the MBridge with this PR. Wdyt?
There was a problem hiding this comment.
@cuichenx what's the move for where modeling code lives?
I saw it in several places.. CLIP is in MCore, QWen3VL is in MBridge. Any thoughts on where we should drop this audio encoder? As Kamran mentioned, it won't be a standalone model...
…meters in training script Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
… learning rates Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test 1fc43e1 |
… 5.6 Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test 43c65a7 |
…NVIDIA-NeMo#3520) Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
What does this PR do ?
Adds end-to-end audio modality support to the MIMO LLaVA example, alongside the existing CLIP vision branch. Specifically:
Changelog
audiofield.Also from this PR (#3531):
--deterministicflag to the MIMO LLaVA and LLaVA-audiotraining scripts (
megatron_mimo_training_llava.py,megatron_mimo_training_llava_audio.py). When set, vision, language, audio,and projection configs run in FP32 with unfused attention, loss fusion
disabled, and full activation recompute so runs are bitwise reproducible.
deterministicargument through the relevant config and modelspec helpers so each sub-module is configured consistently.
examples/models/megatron_mimo/(run_hetero_llava.sh,run_hetero_llava_audio.sh, and the three*_parallelism_tests*.shvariants) to forward
--deterministicand export the env vars required fordeterministic execution (e.g. cuBLAS / NCCL settings).
untouched.
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information