Skip to content

feat: add Gemma 4 unified audio input support#1671

Merged
jundot merged 1 commit into
jundot:mainfrom
sje397:feat/audio-input-support
Jun 5, 2026
Merged

feat: add Gemma 4 unified audio input support#1671
jundot merged 1 commit into
jundot:mainfrom
sje397:feat/audio-input-support

Conversation

@sje397

@sje397 sje397 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor
  • Support audio input alongside images in VLM engine (BytesIO → numpy, has_multimodal gate now includes audio, image code guarded with num_images > 0)
  • Add gemma4_unified model type throughout adapter, API, and server layers
  • Add suppress_tokens mechanism in scheduler to prevent emission of multimodal placeholder tokens (<image|>, <audio|>)
  • Add end-of-turn token detection for Gemma 4 (<turn|>, ID 106) to prevent hallucinating past conversational turns
  • Add PIL feature extractor fallback for audio processor compatibility
  • Extract audio data from input_audio content blocks in messages
  • Bump mlx-vlm pin to 526c210 (includes Gemma4 unified prefill fixes)
  • Add tests for audio input, suppress tokens, scheduler, and VLM engine

Add Gemma 4 unified audio input support to the Anthropic and OpenAI endpoints, enabling audio alongside images and text in multimodal requests. Extends the VLM engine to process audio data alongside images, adds suppress-token logic for multimodal placeholder tokens, and adds end-of-turn detection for Gemma 4.

Dependency Chain

Upstream mlx-vlm pin: 041f889526c210
Picks up mlx-vlm#1292 — video input support for Gemma 4 12B, which includes the Gemma4 unified prefill fixes needed for audio. Also includes #1291 (APC fix for single requests) and unrelated commits for PaddleOCR/Nemotron/Ideogram.

Changes

Anthropic Endpoint (/v1/messages)

  • ContentBlockInputAudio model: data (base64) + format (default "wav")
  • convert_anthropic_to_internal() extended to preserve input_audio blocks when preserve_images=True, extracting audio data for VLM processing
  • _build_message_from_parts() handles input_audio alongside images

OpenAI Endpoint (/v1/chat/completions)

  • InputAudio model: data (base64 or data URI) + format (default "wav")
  • ContentPart.type extended to accept "input_audio" with InputAudio field
  • Audio content parts parsed through the same internal pipeline as Anthropic

VLM Engine

  • has_multimodal gate now includes audio (not just images)
  • Image processing guarded with num_images > 0 to avoid empty-image errors with audio-only requests
  • Audio data extracted from input_audio content blocks in messages → BytesIO → numpy
  • PIL feature extractor fallback for audio processor compatibility

Scheduler

  • Suppress-tokens mechanism prevents emission of multimodal placeholder tokens (<image|>, <audio|>) that would otherwise leak through logits post-processing
  • End-of-turn detection for Gemma 4 (<turn|>, ID 106) prevents hallucinating past conversational turns

Model Adapter

  • gemma4_unified model type added throughout adapter, API, and server layers

Test Coverage

Layer File Tests Coverage
Anthropic conversion test_anthropic_adapter.py 3 ContentBlockInputAudio preserve/drop/audio-only
API utils (shared) test_api_utils.py 4 Audio content block extraction from messages
Image/audio utils test_image_utils.py 6 Base64, raw bytes, paths, data URIs, mixed
OpenAI models test_openai_models.py 9 InputAudio creation/validation/serialization; ContentPart with type="input_audio"; Message with mixed audio+text
OpenAI adapter test_openai_adapter.py 2 parse_request with audio content (mixed + audio-only), verifies no crash
VLM engine test_vlm_engine.py ~177 lines Audio pipeline, multimodal gate, BytesIO→numpy
Scheduler test_scheduler.py ~36 lines Suppress tokens, end-of-turn detection
Gemma4 messages test_gemma4_messages.py ~16 lines `<turn

File Stats

18 files changed, +1054, -66

Production code (8 files): anthropic_models.py (+13), anthropic_utils.py (+59/-?), openai_models.py (+12), api/utils.py (+24/-?), engine/vlm.py (+152/-?), scheduler.py (+90), utils/image.py (+57/-?), adapter/gemma4.py (+9/-?)

Tests (8 files): test_anthropic_adapter.py (+123), test_api_utils.py (+43), test_image_utils.py (+140/-?), test_vlm_engine.py (+177/-?), test_scheduler.py (+36), test_gemma4_messages.py (+16), test_openai_models.py (+113), test_openai_adapter.py (+45)

Config (1 file): pyproject.toml (mlx-vlm pin bump)

Fixes #591

- Support audio input alongside images in VLM engine (BytesIO → numpy,
  has_multimodal gate now includes audio, image code guarded with
  num_images > 0)
- Add gemma4_unified model type throughout adapter, API, and server layers
- Add suppress_tokens mechanism in scheduler to prevent emission of
  multimodal placeholder tokens (<image|>, <audio|>)
- Add end-of-turn token detection for Gemma 4 (<turn|>, ID 106) to
  prevent hallucinating past conversational turns
- Add PIL feature extractor fallback for audio processor compatibility
- Extract audio data from input_audio content blocks in messages
- Bump mlx-vlm pin to 526c210 (includes Gemma4 unified prefill fixes)
- Add tests for audio input, suppress tokens, scheduler, and VLM engine
@jundot

jundot commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Thanks for the PR. I reviewed the audio input path for OpenAI/Anthropic, the VLM preprocessing changes, suppress-token handling, and Gemma 4 turn stop behavior; the focused tests and a local Gemma 4 12B audio smoke test passed for me.

The CI failure is just a small test fixture gap in a partial Scheduler mock, not a runtime issue with this change. This looks good to me, and I'm going to merge it; I'll fold that test fix into a follow-up.

@jundot jundot merged commit 0d4197f into jundot:main Jun 5, 2026
0 of 4 checks passed
@sje397

sje397 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

You're very welcome. Thanks for merging, and thanks for oMLX.

@monroewilliams

Copy link
Copy Markdown
Contributor

Thanks for getting this in! I was working on similar changes for a local experiment, but I wasn't handling the end-of-turn token proerly and was getting some misbehavior from the model. Your changes cleaned all that up. :)

I did run into one issue with dependency imports, which I've filed as #1688

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Audio input support in /v1/chat/completions for multimodal models (e.g. Gemma-4)

4 participants