[Model] Add Voxtral (speech-to-text) model support by LiYomi · Pull Request #21635 · sgl-project/sglang

LiYomi · 2026-03-29T17:15:32Z

Summary

Add inference support for Mistral AI's Voxtral speech-to-text model family:

mistralai/Voxtral-Mini-3B-2507 (TP=1, 8.9GB VRAM)
mistralai/Voxtral-Small-24B-2507 (TP=2)

Architecture: Whisper Encoder → Frame Downsampling (4:1) → MLP Projector → Llama Decoder

Voxtral is already supported in vLLM. This PR brings equivalent support to SGLang.

New srt/models/voxtral.py: standalone Whisper encoder, mel spectrogram, frame downsampling, MLP projector + Llama LLM
New srt/multimodal/processors/voxtral.py: audio loading, [AUDIO] token computation, input_ids construction
Modified srt/configs/model_config.py: register architecture
Modified srt/utils/hf_transformers_utils.py: MistralCommonTokenizer compatibility patch

Usage

# 3B model
sglang serve --model-path mistralai/Voxtral-Mini-3B-2507

# 24B model
sglang serve --model-path mistralai/Voxtral-Small-24B-2507 --tp 2

from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="mistralai/Voxtral-Mini-3B-2507",
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,<base64>"}},
            {"type": "text", "text": "Transcribe this audio."},
        ],
    }],
)

Test results

Voxtral-Mini-3B (TP=1)

Transcription: He hoped there would be stew for dinner, turnips and carrots
and bruised potatoes and fat mutton pieces to be ladled out in thick
peppered flour fattened sauce.
Tokens: prompt=384, completion=39

Voxtral-Small-24B (TP=2)

Transcription: He hoped there would be stew for dinner, turnips and carrots
and bruised potatoes and fat mutton pieces to be ladled out in thick,
peppered, flour-fatened sauce.
Tokens: prompt=384, completion=40

ASR Benchmark (earnings22 dataset, 50 samples)

Model	WER	Throughput
Voxtral-Mini-3B (TP=1)	20.72%	15.4 req/s

Note: Voxtral is a multimodal LLM, not a pure ASR model like Whisper. It tends to clean up filler words and make semantic corrections, which inflates WER compared to verbatim transcription models.

Test plan

Voxtral-Mini-3B-2507: base64 audio transcription
Voxtral-Mini-3B-2507: URL audio transcription
Voxtral-Mini-3B-2507: text-only chat (no audio)
Voxtral-Small-24B-2507: base64 audio transcription
Voxtral-Small-24B-2507: URL audio transcription
Voxtral-Small-24B-2507: text-only chat (no audio)
ASR WER benchmark on earnings22 (3B)

LiYomi · 2026-03-29T17:17:31Z

Known issue with TP>1 concurrent multimodal requests:

When running Voxtral-Small-24B with --tp 2, concurrent audio requests can trigger a FileNotFoundError in SharedMemory:

File "sglang/srt/managers/scheduler.py", in recv_requests
    data = pickle.loads(serialized_data)
File "sglang/srt/managers/mm_utils.py", in __setstate__
    self._shm_handle = shared_memory.SharedMemory(name=self.shm_name)
FileNotFoundError: [Errno 2] No such file or directory: '/psm_xxx'

This is a pre-existing framework issue in sglang's TP broadcast path (broadcast_pyobj via SharedMemory), not introduced by this PR. Sequential requests work fine. TP=1 (3B model) has no issues at any concurrency level.

gemini-code-assist

Code Review

This pull request introduces support for the Voxtral (speech-to-text) model. The implementation includes the VoxtralForConditionalGeneration model architecture, which integrates a Whisper encoder with a Llama decoder via an MLP adapter, and a corresponding VoxtralMultimodalProcessor for handling audio inputs. Additionally, the PR includes necessary patches in the utility functions to ensure MistralCommon tokenizers are compatible with the Hugging Face API used throughout the codebase. Feedback was provided regarding the use of a broad exception handler in the multimodal processor's tokenization logic, which could lead to silent failures and incorrect model behavior.

gemini-code-assist · 2026-03-29T17:18:46Z

+        except Exception:
+            input_ids = tokenizer.encode(input_text)


Using a broad except Exception here can mask underlying issues and lead to silent failures. If tokenizer.apply_chat_template fails for reasons other than not being implemented (e.g., a malformed messages object), the code will fall back to tokenizer.encode(input_text). This fallback is likely to produce incorrect tokenization as it loses the chat structure and special tokens, which can cause incorrect model behavior. It would be safer to catch more specific exceptions or to log a warning and re-raise the exception if it's unexpected, to avoid silent and hard-to-debug errors.

LiYomi · 2026-04-03T05:32:25Z

Hi @JustinTong0323 @yhyang201 @yuan-luo, would you have time to take a look at this PR? Thanks!

mickqian · 2026-04-03T09:12:17Z

/tag-and-rerun-ci

Add inference support for Mistral AI's Voxtral speech-to-text models: - Voxtral-Mini-3B-2507 (TP=1) - Voxtral-Small-24B-2507 (TP=2) Architecture: Whisper Encoder + MLP Projector + Llama Decoder New files: - srt/models/voxtral.py: model definition with standalone Whisper encoder, mel spectrogram computation, frame downsampling, and MLP projector - srt/multimodal/processors/voxtral.py: audio loading, token count computation, and input_ids construction with [AUDIO] tokens Modified files: - srt/configs/model_config.py: register VoxtralForConditionalGeneration - srt/utils/hf_transformers_utils.py: MistralCommonTokenizer compatibility patch (chat_template, apply_chat_template, decode, convert_tokens_to_ids)

Address review feedback: the fallback to tokenizer.encode() should only trigger on prompt parsing issues, not mask unexpected errors.

Address review: simpler and mutates kwargs in-place.

Address review: adopt base_processor's load_mm_data for standard audio loading (thread pool, format handling, resampling). process_and_combine_mm_data cannot be used because HF VoxtralProcessor.__call__ does not support audio.

Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>

Sync 4 upstream commits that modified hf_transformers_utils.py: - Voxtral MistralCommon tokenizer support (sgl-project#21635): retry logic for MistralCommon tokenizers that reject standard HF kwargs, plus _patch_mistral_common_tokenizer for API compatibility - Mistral embedding fix (sgl-project#21913): skip restoring add_eos_token for fast tokenizers to avoid divergence from HF reference - RunAI object storage (sgl-project#17948): add is_runai_obj_uri checks in get_config and get_tokenizer - Fix is_base_mistral CI patch (sgl-project#21729): replace _patch_mistral_regex classmethod instead of module-level is_base_mistral

Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>

LiYomi requested review from JustinTong0323, mickqian, yhyang201 and yuan-luo as code owners March 29, 2026 17:15

gemini-code-assist Bot reviewed Mar 29, 2026

View reviewed changes

LiYomi force-pushed the feat/voxtral-support branch from 5c813cc to 8737fba Compare March 29, 2026 17:40

mickqian requested changes Mar 30, 2026

View reviewed changes

Comment thread python/sglang/srt/multimodal/processors/voxtral.py

Comment thread python/sglang/srt/utils/hf_transformers_utils.py Outdated

LiYomi requested a review from mickqian March 31, 2026 04:20

mickqian approved these changes Mar 31, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 3, 2026

mickqian requested changes Apr 4, 2026

View reviewed changes

Comment thread python/sglang/srt/multimodal/processors/voxtral.py Outdated

LiYomi and others added 6 commits April 4, 2026 15:36

narrow broad except Exception to (ValueError, KeyError)

9c0244b

Address review feedback: the fallback to tokenizer.encode() should only trigger on prompt parsing issues, not mask unexpected errors.

use .pop() instead of dict comprehension for kwarg filtering

a3f49f7

Address review: simpler and mutates kwargs in-place.

refactor: use load_mm_data for audio loading

00c4545

Address review: adopt base_processor's load_mm_data for standard audio loading (thread pool, format handling, resampling). process_and_combine_mm_data cannot be used because HF VoxtralProcessor.__call__ does not support audio.

style: fix black formatting

429ca9a

refactor: return MultimodalProcessorOutput instead of dict

dd8c073

LiYomi force-pushed the feat/voxtral-support branch from 120d91d to dd8c073 Compare April 4, 2026 15:40

mickqian approved these changes Apr 4, 2026

View reviewed changes

mickqian merged commit 71544f0 into sgl-project:main Apr 5, 2026
231 of 282 checks passed

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[model] support voxtral (speech-to-text) (sgl-project#21635)

2699492

Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>

Fridge003 pushed a commit that referenced this pull request Apr 7, 2026

[model] support voxtral (speech-to-text) (#21635)

0f10f13

Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>

xiezhq-hermann pushed a commit to antgroup/sglang that referenced this pull request Apr 7, 2026

[model] support voxtral (speech-to-text) (sgl-project#21635)

8a6601f

Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[model] support voxtral (speech-to-text) (sgl-project#21635)

2d436c3

Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Add Voxtral (speech-to-text) model support#21635

[Model] Add Voxtral (speech-to-text) model support#21635
mickqian merged 6 commits intosgl-project:mainfrom
LiYomi:feat/voxtral-support

LiYomi commented Mar 29, 2026

Uh oh!

LiYomi commented Mar 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 29, 2026

Uh oh!

Uh oh!

Uh oh!

LiYomi commented Apr 3, 2026

Uh oh!

mickqian commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LiYomi commented Mar 29, 2026

Summary

Usage

Test results

Voxtral-Mini-3B (TP=1)

Voxtral-Small-24B (TP=2)

ASR Benchmark (earnings22 dataset, 50 samples)

Test plan

Uh oh!

LiYomi commented Mar 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LiYomi commented Apr 3, 2026

Uh oh!

mickqian commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants