[Model] Add Voxtral (speech-to-text) model support#21635
[Model] Add Voxtral (speech-to-text) model support#21635mickqian merged 6 commits intosgl-project:mainfrom
Conversation
|
Known issue with TP>1 concurrent multimodal requests: When running Voxtral-Small-24B with This is a pre-existing framework issue in sglang's TP broadcast path ( |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Voxtral (speech-to-text) model. The implementation includes the VoxtralForConditionalGeneration model architecture, which integrates a Whisper encoder with a Llama decoder via an MLP adapter, and a corresponding VoxtralMultimodalProcessor for handling audio inputs. Additionally, the PR includes necessary patches in the utility functions to ensure MistralCommon tokenizers are compatible with the Hugging Face API used throughout the codebase. Feedback was provided regarding the use of a broad exception handler in the multimodal processor's tokenization logic, which could lead to silent failures and incorrect model behavior.
| except Exception: | ||
| input_ids = tokenizer.encode(input_text) |
There was a problem hiding this comment.
Using a broad except Exception here can mask underlying issues and lead to silent failures. If tokenizer.apply_chat_template fails for reasons other than not being implemented (e.g., a malformed messages object), the code will fall back to tokenizer.encode(input_text). This fallback is likely to produce incorrect tokenization as it loses the chat structure and special tokens, which can cause incorrect model behavior. It would be safer to catch more specific exceptions or to log a warning and re-raise the exception if it's unexpected, to avoid silent and hard-to-debug errors.
5c813cc to
8737fba
Compare
|
Hi @JustinTong0323 @yhyang201 @yuan-luo, would you have time to take a look at this PR? Thanks! |
|
/tag-and-rerun-ci |
Add inference support for Mistral AI's Voxtral speech-to-text models: - Voxtral-Mini-3B-2507 (TP=1) - Voxtral-Small-24B-2507 (TP=2) Architecture: Whisper Encoder + MLP Projector + Llama Decoder New files: - srt/models/voxtral.py: model definition with standalone Whisper encoder, mel spectrogram computation, frame downsampling, and MLP projector - srt/multimodal/processors/voxtral.py: audio loading, token count computation, and input_ids construction with [AUDIO] tokens Modified files: - srt/configs/model_config.py: register VoxtralForConditionalGeneration - srt/utils/hf_transformers_utils.py: MistralCommonTokenizer compatibility patch (chat_template, apply_chat_template, decode, convert_tokens_to_ids)
Address review feedback: the fallback to tokenizer.encode() should only trigger on prompt parsing issues, not mask unexpected errors.
Address review: simpler and mutates kwargs in-place.
Address review: adopt base_processor's load_mm_data for standard audio loading (thread pool, format handling, resampling). process_and_combine_mm_data cannot be used because HF VoxtralProcessor.__call__ does not support audio.
120d91d to
dd8c073
Compare
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Sync 4 upstream commits that modified hf_transformers_utils.py: - Voxtral MistralCommon tokenizer support (sgl-project#21635): retry logic for MistralCommon tokenizers that reject standard HF kwargs, plus _patch_mistral_common_tokenizer for API compatibility - Mistral embedding fix (sgl-project#21913): skip restoring add_eos_token for fast tokenizers to avoid divergence from HF reference - RunAI object storage (sgl-project#17948): add is_runai_obj_uri checks in get_config and get_tokenizer - Fix is_base_mistral CI patch (sgl-project#21729): replace _patch_mistral_regex classmethod instead of module-level is_base_mistral
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Summary
Add inference support for Mistral AI's Voxtral speech-to-text model family:
mistralai/Voxtral-Mini-3B-2507(TP=1, 8.9GB VRAM)mistralai/Voxtral-Small-24B-2507(TP=2)Architecture: Whisper Encoder → Frame Downsampling (4:1) → MLP Projector → Llama Decoder
Voxtral is already supported in vLLM. This PR brings equivalent support to SGLang.
srt/models/voxtral.py: standalone Whisper encoder, mel spectrogram, frame downsampling, MLP projector + Llama LLMsrt/multimodal/processors/voxtral.py: audio loading,[AUDIO]token computation, input_ids constructionsrt/configs/model_config.py: register architecturesrt/utils/hf_transformers_utils.py: MistralCommonTokenizer compatibility patchUsage
Test results
Voxtral-Mini-3B (TP=1)
Voxtral-Small-24B (TP=2)
ASR Benchmark (earnings22 dataset, 50 samples)
Note: Voxtral is a multimodal LLM, not a pure ASR model like Whisper. It tends to clean up filler words and make semantic corrections, which inflates WER compared to verbatim transcription models.
Test plan