Skip to content

[Model] Add Voxtral (speech-to-text) model support#21635

Merged
mickqian merged 6 commits intosgl-project:mainfrom
LiYomi:feat/voxtral-support
Apr 5, 2026
Merged

[Model] Add Voxtral (speech-to-text) model support#21635
mickqian merged 6 commits intosgl-project:mainfrom
LiYomi:feat/voxtral-support

Conversation

@LiYomi
Copy link
Copy Markdown
Contributor

@LiYomi LiYomi commented Mar 29, 2026

Summary

Add inference support for Mistral AI's Voxtral speech-to-text model family:

  • mistralai/Voxtral-Mini-3B-2507 (TP=1, 8.9GB VRAM)
  • mistralai/Voxtral-Small-24B-2507 (TP=2)

Architecture: Whisper Encoder → Frame Downsampling (4:1) → MLP Projector → Llama Decoder

Voxtral is already supported in vLLM. This PR brings equivalent support to SGLang.

  • New srt/models/voxtral.py: standalone Whisper encoder, mel spectrogram, frame downsampling, MLP projector + Llama LLM
  • New srt/multimodal/processors/voxtral.py: audio loading, [AUDIO] token computation, input_ids construction
  • Modified srt/configs/model_config.py: register architecture
  • Modified srt/utils/hf_transformers_utils.py: MistralCommonTokenizer compatibility patch

Usage

# 3B model
sglang serve --model-path mistralai/Voxtral-Mini-3B-2507

# 24B model
sglang serve --model-path mistralai/Voxtral-Small-24B-2507 --tp 2
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="mistralai/Voxtral-Mini-3B-2507",
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,<base64>"}},
            {"type": "text", "text": "Transcribe this audio."},
        ],
    }],
)

Test results

Voxtral-Mini-3B (TP=1)

Transcription: He hoped there would be stew for dinner, turnips and carrots
and bruised potatoes and fat mutton pieces to be ladled out in thick
peppered flour fattened sauce.
Tokens: prompt=384, completion=39

Voxtral-Small-24B (TP=2)

Transcription: He hoped there would be stew for dinner, turnips and carrots
and bruised potatoes and fat mutton pieces to be ladled out in thick,
peppered, flour-fatened sauce.
Tokens: prompt=384, completion=40

ASR Benchmark (earnings22 dataset, 50 samples)

Model WER Throughput
Voxtral-Mini-3B (TP=1) 20.72% 15.4 req/s

Note: Voxtral is a multimodal LLM, not a pure ASR model like Whisper. It tends to clean up filler words and make semantic corrections, which inflates WER compared to verbatim transcription models.

Test plan

  • Voxtral-Mini-3B-2507: base64 audio transcription
  • Voxtral-Mini-3B-2507: URL audio transcription
  • Voxtral-Mini-3B-2507: text-only chat (no audio)
  • Voxtral-Small-24B-2507: base64 audio transcription
  • Voxtral-Small-24B-2507: URL audio transcription
  • Voxtral-Small-24B-2507: text-only chat (no audio)
  • ASR WER benchmark on earnings22 (3B)

@LiYomi
Copy link
Copy Markdown
Contributor Author

LiYomi commented Mar 29, 2026

Known issue with TP>1 concurrent multimodal requests:

When running Voxtral-Small-24B with --tp 2, concurrent audio requests can trigger a FileNotFoundError in SharedMemory:

File "sglang/srt/managers/scheduler.py", in recv_requests
    data = pickle.loads(serialized_data)
File "sglang/srt/managers/mm_utils.py", in __setstate__
    self._shm_handle = shared_memory.SharedMemory(name=self.shm_name)
FileNotFoundError: [Errno 2] No such file or directory: '/psm_xxx'

This is a pre-existing framework issue in sglang's TP broadcast path (broadcast_pyobj via SharedMemory), not introduced by this PR. Sequential requests work fine. TP=1 (3B model) has no issues at any concurrency level.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Voxtral (speech-to-text) model. The implementation includes the VoxtralForConditionalGeneration model architecture, which integrates a Whisper encoder with a Llama decoder via an MLP adapter, and a corresponding VoxtralMultimodalProcessor for handling audio inputs. Additionally, the PR includes necessary patches in the utility functions to ensure MistralCommon tokenizers are compatible with the Hugging Face API used throughout the codebase. Feedback was provided regarding the use of a broad exception handler in the multimodal processor's tokenization logic, which could lead to silent failures and incorrect model behavior.

Comment on lines +125 to +126
except Exception:
input_ids = tokenizer.encode(input_text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a broad except Exception here can mask underlying issues and lead to silent failures. If tokenizer.apply_chat_template fails for reasons other than not being implemented (e.g., a malformed messages object), the code will fall back to tokenizer.encode(input_text). This fallback is likely to produce incorrect tokenization as it loses the chat structure and special tokens, which can cause incorrect model behavior. It would be safer to catch more specific exceptions or to log a warning and re-raise the exception if it's unexpected, to avoid silent and hard-to-debug errors.

@LiYomi LiYomi force-pushed the feat/voxtral-support branch from 5c813cc to 8737fba Compare March 29, 2026 17:40
Comment thread python/sglang/srt/multimodal/processors/voxtral.py
Comment thread python/sglang/srt/utils/hf_transformers_utils.py Outdated
@LiYomi LiYomi requested a review from mickqian March 31, 2026 04:20
@LiYomi
Copy link
Copy Markdown
Contributor Author

LiYomi commented Apr 3, 2026

Hi @JustinTong0323 @yhyang201 @yuan-luo, would you have time to take a look at this PR? Thanks!

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Apr 3, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 3, 2026
Comment thread python/sglang/srt/multimodal/processors/voxtral.py Outdated
LiYomi and others added 6 commits April 4, 2026 15:36
Add inference support for Mistral AI's Voxtral speech-to-text models:
- Voxtral-Mini-3B-2507 (TP=1)
- Voxtral-Small-24B-2507 (TP=2)

Architecture: Whisper Encoder + MLP Projector + Llama Decoder

New files:
- srt/models/voxtral.py: model definition with standalone Whisper
  encoder, mel spectrogram computation, frame downsampling, and
  MLP projector
- srt/multimodal/processors/voxtral.py: audio loading, token count
  computation, and input_ids construction with [AUDIO] tokens

Modified files:
- srt/configs/model_config.py: register VoxtralForConditionalGeneration
- srt/utils/hf_transformers_utils.py: MistralCommonTokenizer
  compatibility patch (chat_template, apply_chat_template, decode,
  convert_tokens_to_ids)
Address review feedback: the fallback to tokenizer.encode() should
only trigger on prompt parsing issues, not mask unexpected errors.
Address review: simpler and mutates kwargs in-place.
Address review: adopt base_processor's load_mm_data for standard
audio loading (thread pool, format handling, resampling).
process_and_combine_mm_data cannot be used because HF
VoxtralProcessor.__call__ does not support audio.
@LiYomi LiYomi force-pushed the feat/voxtral-support branch from 120d91d to dd8c073 Compare April 4, 2026 15:40
@mickqian mickqian merged commit 71544f0 into sgl-project:main Apr 5, 2026
231 of 282 checks passed
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Fridge003 pushed a commit that referenced this pull request Apr 7, 2026
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
xiezhq-hermann pushed a commit to antgroup/sglang that referenced this pull request Apr 7, 2026
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
JustinTong0323 added a commit to JustinTong0323/sglang that referenced this pull request Apr 8, 2026
Sync 4 upstream commits that modified hf_transformers_utils.py:
- Voxtral MistralCommon tokenizer support (sgl-project#21635): retry logic for
  MistralCommon tokenizers that reject standard HF kwargs, plus
  _patch_mistral_common_tokenizer for API compatibility
- Mistral embedding fix (sgl-project#21913): skip restoring add_eos_token for
  fast tokenizers to avoid divergence from HF reference
- RunAI object storage (sgl-project#17948): add is_runai_obj_uri checks in
  get_config and get_tokenizer
- Fix is_base_mistral CI patch (sgl-project#21729): replace _patch_mistral_regex
  classmethod instead of module-level is_base_mistral
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants