Feature Request: Audio input support in /v1/chat/completions for multimodal models (e.g. Gemma-4)
Is your feature request related to a problem? Please describe.
When using multimodal models like gemma-4-e2b-it-4bit or gemma-4-e4b-it-4bit that natively support audio input, it is currently not possible to pass audio data through the /v1/chat/completions endpoint. Sending audio via the input_audio content type (OpenAI-compatible format) results in the model receiving only the text portion of the message — the audio is silently ignored and never processed.
For example, the following request results in the model responding as if no audio was provided (only 17 prompt tokens, no audio tokens):
curl -X POST http://localhost:8005/v1/chat/completions \
-H "Authorization: Bearer <key>" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-e2b-it-4bit",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Please describe this audio"},
{
"type": "input_audio",
"input_audio": {
"data": "<base64-encoded-wav>",
"format": "wav"
}
}
]
}
]
}'
The model replies asking the user to provide audio, meaning the audio content block was not forwarded to the model at all.
Describe the solution you'd like
Support the input_audio content type in /v1/chat/completions for models that have audio input capability (e.g. Gemma-4 series). The audio should be decoded from base64 and passed to the model's processor alongside the text tokens, similar to how image input is already handled for vision models.
Ideally, this would follow the OpenAI audio input format:
{
"type": "input_audio",
"input_audio": {
"data": "<base64-encoded-audio>",
"format": "wav"
}
}
Supported formats should include at minimum wav and mp3.
Describe alternatives you've considered
-
Using /v1/audio/transcriptions first, then passing text to chat completions — This works as a workaround (e.g. using Qwen3-ASR-0.6B for transcription), but it requires two separate API calls, increases latency, and loses any non-verbal audio information that the multimodal model could otherwise interpret directly.
-
Using mlx_vlm.generate --audio CLI directly — This works perfectly (verified with gemma-4-e4b-it-4bit), confirming the underlying model supports audio input. The gap is only at the HTTP server layer.
Additional context
- Verified that
mlx_vlm.generate --audio test.wav --model gemma-4-e4b-it-4bit correctly processes audio, proving the model itself supports audio input natively.
- The
/v1/audio/transcriptions endpoint works correctly for dedicated ASR models (e.g. Qwen3-ASR-0.6B).
- This feature would unlock the full multimodal capability of Gemma-4 and similar models through the HTTP API, making it consistent with the CLI behavior already supported by
mlx_vlm.
Feature Request: Audio input support in /v1/chat/completions for multimodal models (e.g. Gemma-4)
Is your feature request related to a problem? Please describe.
When using multimodal models like
gemma-4-e2b-it-4bitorgemma-4-e4b-it-4bitthat natively support audio input, it is currently not possible to pass audio data through the/v1/chat/completionsendpoint. Sending audio via theinput_audiocontent type (OpenAI-compatible format) results in the model receiving only the text portion of the message — the audio is silently ignored and never processed.For example, the following request results in the model responding as if no audio was provided (only 17 prompt tokens, no audio tokens):
The model replies asking the user to provide audio, meaning the audio content block was not forwarded to the model at all.
Describe the solution you'd like
Support the
input_audiocontent type in/v1/chat/completionsfor models that have audio input capability (e.g. Gemma-4 series). The audio should be decoded from base64 and passed to the model's processor alongside the text tokens, similar to how image input is already handled for vision models.Ideally, this would follow the OpenAI audio input format:
{ "type": "input_audio", "input_audio": { "data": "<base64-encoded-audio>", "format": "wav" } }Supported formats should include at minimum
wavandmp3.Describe alternatives you've considered
Using
/v1/audio/transcriptionsfirst, then passing text to chat completions — This works as a workaround (e.g. usingQwen3-ASR-0.6Bfor transcription), but it requires two separate API calls, increases latency, and loses any non-verbal audio information that the multimodal model could otherwise interpret directly.Using
mlx_vlm.generate --audioCLI directly — This works perfectly (verified withgemma-4-e4b-it-4bit), confirming the underlying model supports audio input. The gap is only at the HTTP server layer.Additional context
mlx_vlm.generate --audio test.wav --model gemma-4-e4b-it-4bitcorrectly processes audio, proving the model itself supports audio input natively./v1/audio/transcriptionsendpoint works correctly for dedicated ASR models (e.g.Qwen3-ASR-0.6B).mlx_vlm.