feat: add Gemma 4 unified audio input support by sje397 · Pull Request #1671 · jundot/omlx

sje397 · 2026-06-05T05:11:23Z

Support audio input alongside images in VLM engine (BytesIO → numpy, has_multimodal gate now includes audio, image code guarded with num_images > 0)
Add gemma4_unified model type throughout adapter, API, and server layers
Add suppress_tokens mechanism in scheduler to prevent emission of multimodal placeholder tokens (<image|>, <audio|>)
Add end-of-turn token detection for Gemma 4 (<turn|>, ID 106) to prevent hallucinating past conversational turns
Add PIL feature extractor fallback for audio processor compatibility
Extract audio data from input_audio content blocks in messages
Bump mlx-vlm pin to 526c210 (includes Gemma4 unified prefill fixes)
Add tests for audio input, suppress tokens, scheduler, and VLM engine

Add Gemma 4 unified audio input support to the Anthropic and OpenAI endpoints, enabling audio alongside images and text in multimodal requests. Extends the VLM engine to process audio data alongside images, adds suppress-token logic for multimodal placeholder tokens, and adds end-of-turn detection for Gemma 4.

Dependency Chain

Upstream mlx-vlm pin: 041f889 → 526c210
Picks up mlx-vlm#1292 — video input support for Gemma 4 12B, which includes the Gemma4 unified prefill fixes needed for audio. Also includes #1291 (APC fix for single requests) and unrelated commits for PaddleOCR/Nemotron/Ideogram.

Changes

Anthropic Endpoint (`/v1/messages`)

ContentBlockInputAudio model: data (base64) + format (default "wav")
convert_anthropic_to_internal() extended to preserve input_audio blocks when preserve_images=True, extracting audio data for VLM processing
_build_message_from_parts() handles input_audio alongside images

OpenAI Endpoint (`/v1/chat/completions`)

InputAudio model: data (base64 or data URI) + format (default "wav")
ContentPart.type extended to accept "input_audio" with InputAudio field
Audio content parts parsed through the same internal pipeline as Anthropic

VLM Engine

has_multimodal gate now includes audio (not just images)
Image processing guarded with num_images > 0 to avoid empty-image errors with audio-only requests
Audio data extracted from input_audio content blocks in messages → BytesIO → numpy
PIL feature extractor fallback for audio processor compatibility

Scheduler

Suppress-tokens mechanism prevents emission of multimodal placeholder tokens (<image|>, <audio|>) that would otherwise leak through logits post-processing
End-of-turn detection for Gemma 4 (<turn|>, ID 106) prevents hallucinating past conversational turns

Model Adapter

gemma4_unified model type added throughout adapter, API, and server layers

Test Coverage

Layer	File	Tests	Coverage
Anthropic conversion	`test_anthropic_adapter.py`	3	`ContentBlockInputAudio` preserve/drop/audio-only
API utils (shared)	`test_api_utils.py`	4	Audio content block extraction from messages
Image/audio utils	`test_image_utils.py`	6	Base64, raw bytes, paths, data URIs, mixed
OpenAI models	`test_openai_models.py`	9	`InputAudio` creation/validation/serialization; `ContentPart` with `type="input_audio"`; `Message` with mixed audio+text
OpenAI adapter	`test_openai_adapter.py`	2	`parse_request` with audio content (mixed + audio-only), verifies no crash
VLM engine	`test_vlm_engine.py`	~177 lines	Audio pipeline, multimodal gate, BytesIO→numpy
Scheduler	`test_scheduler.py`	~36 lines	Suppress tokens, end-of-turn detection
Gemma4 messages	`test_gemma4_messages.py`	~16 lines	`<turn

File Stats

18 files changed, +1054, -66

Production code (8 files): anthropic_models.py (+13), anthropic_utils.py (+59/-?), openai_models.py (+12), api/utils.py (+24/-?), engine/vlm.py (+152/-?), scheduler.py (+90), utils/image.py (+57/-?), adapter/gemma4.py (+9/-?)

Tests (8 files): test_anthropic_adapter.py (+123), test_api_utils.py (+43), test_image_utils.py (+140/-?), test_vlm_engine.py (+177/-?), test_scheduler.py (+36), test_gemma4_messages.py (+16), test_openai_models.py (+113), test_openai_adapter.py (+45)

Config (1 file): pyproject.toml (mlx-vlm pin bump)

Fixes #591

- Support audio input alongside images in VLM engine (BytesIO → numpy, has_multimodal gate now includes audio, image code guarded with num_images > 0) - Add gemma4_unified model type throughout adapter, API, and server layers - Add suppress_tokens mechanism in scheduler to prevent emission of multimodal placeholder tokens (<image|>, <audio|>) - Add end-of-turn token detection for Gemma 4 (<turn|>, ID 106) to prevent hallucinating past conversational turns - Add PIL feature extractor fallback for audio processor compatibility - Extract audio data from input_audio content blocks in messages - Bump mlx-vlm pin to 526c210 (includes Gemma4 unified prefill fixes) - Add tests for audio input, suppress tokens, scheduler, and VLM engine

jundot · 2026-06-05T06:41:28Z

Thanks for the PR. I reviewed the audio input path for OpenAI/Anthropic, the VLM preprocessing changes, suppress-token handling, and Gemma 4 turn stop behavior; the focused tests and a local Gemma 4 12B audio smoke test passed for me.

The CI failure is just a small test fixture gap in a partial Scheduler mock, not a runtime issue with this change. This looks good to me, and I'm going to merge it; I'll fold that test fix into a follow-up.

sje397 · 2026-06-05T06:47:37Z

You're very welcome. Thanks for merging, and thanks for oMLX.

monroewilliams · 2026-06-05T18:57:10Z

Thanks for getting this in! I was working on similar changes for a local experiment, but I wasn't handling the end-of-turn token proerly and was getting some misbehavior from the model. Your changes cleaned all that up. :)

I did run into one issue with dependency imports, which I've filed as #1688

jundot merged commit 0d4197f into jundot:main Jun 5, 2026
0 of 4 checks passed

monroewilliams mentioned this pull request Jun 5, 2026

Audio input causes HTTP 500 errors: mlx-vlm pin imports resample_audio from mlx_audio.utils, but the pinned mlx-audio moved it to mlx_audio.stt.utils #1688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Gemma 4 unified audio input support#1671

feat: add Gemma 4 unified audio input support#1671
jundot merged 1 commit into
jundot:mainfrom
sje397:feat/audio-input-support

sje397 commented Jun 5, 2026

Uh oh!

jundot commented Jun 5, 2026

Uh oh!

Uh oh!

sje397 commented Jun 5, 2026

Uh oh!

monroewilliams commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sje397 commented Jun 5, 2026

Dependency Chain

Changes

Anthropic Endpoint (/v1/messages)

OpenAI Endpoint (/v1/chat/completions)

VLM Engine

Scheduler

Model Adapter

Test Coverage

File Stats

Uh oh!

jundot commented Jun 5, 2026

Uh oh!

Uh oh!

sje397 commented Jun 5, 2026

Uh oh!

monroewilliams commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Anthropic Endpoint (`/v1/messages`)

OpenAI Endpoint (`/v1/chat/completions`)