feat: add Gemma 4 unified audio input support#1671
Conversation
- Support audio input alongside images in VLM engine (BytesIO → numpy, has_multimodal gate now includes audio, image code guarded with num_images > 0) - Add gemma4_unified model type throughout adapter, API, and server layers - Add suppress_tokens mechanism in scheduler to prevent emission of multimodal placeholder tokens (<image|>, <audio|>) - Add end-of-turn token detection for Gemma 4 (<turn|>, ID 106) to prevent hallucinating past conversational turns - Add PIL feature extractor fallback for audio processor compatibility - Extract audio data from input_audio content blocks in messages - Bump mlx-vlm pin to 526c210 (includes Gemma4 unified prefill fixes) - Add tests for audio input, suppress tokens, scheduler, and VLM engine
|
Thanks for the PR. I reviewed the audio input path for OpenAI/Anthropic, the VLM preprocessing changes, suppress-token handling, and Gemma 4 turn stop behavior; the focused tests and a local Gemma 4 12B audio smoke test passed for me. The CI failure is just a small test fixture gap in a partial Scheduler mock, not a runtime issue with this change. This looks good to me, and I'm going to merge it; I'll fold that test fix into a follow-up. |
|
You're very welcome. Thanks for merging, and thanks for oMLX. |
|
Thanks for getting this in! I was working on similar changes for a local experiment, but I wasn't handling the end-of-turn token proerly and was getting some misbehavior from the model. Your changes cleaned all that up. :) I did run into one issue with dependency imports, which I've filed as #1688 |
Add Gemma 4 unified audio input support to the Anthropic and OpenAI endpoints, enabling audio alongside images and text in multimodal requests. Extends the VLM engine to process audio data alongside images, adds suppress-token logic for multimodal placeholder tokens, and adds end-of-turn detection for Gemma 4.
Dependency Chain
Upstream mlx-vlm pin:
041f889→526c210Picks up mlx-vlm#1292 — video input support for Gemma 4 12B, which includes the Gemma4 unified prefill fixes needed for audio. Also includes #1291 (APC fix for single requests) and unrelated commits for PaddleOCR/Nemotron/Ideogram.
Changes
Anthropic Endpoint (
/v1/messages)ContentBlockInputAudiomodel:data(base64) +format(default"wav")convert_anthropic_to_internal()extended to preserveinput_audioblocks whenpreserve_images=True, extracting audio data for VLM processing_build_message_from_parts()handlesinput_audioalongside imagesOpenAI Endpoint (
/v1/chat/completions)InputAudiomodel:data(base64 or data URI) +format(default"wav")ContentPart.typeextended to accept"input_audio"withInputAudiofieldVLM Engine
has_multimodalgate now includes audio (not just images)num_images > 0to avoid empty-image errors with audio-only requestsinput_audiocontent blocks in messages → BytesIO → numpyScheduler
<image|>,<audio|>) that would otherwise leak through logits post-processing<turn|>, ID 106) prevents hallucinating past conversational turnsModel Adapter
gemma4_unifiedmodel type added throughout adapter, API, and server layersTest Coverage
test_anthropic_adapter.pyContentBlockInputAudiopreserve/drop/audio-onlytest_api_utils.pytest_image_utils.pytest_openai_models.pyInputAudiocreation/validation/serialization;ContentPartwithtype="input_audio";Messagewith mixed audio+texttest_openai_adapter.pyparse_requestwith audio content (mixed + audio-only), verifies no crashtest_vlm_engine.pytest_scheduler.pytest_gemma4_messages.pyFile Stats
Production code (8 files):
anthropic_models.py(+13),anthropic_utils.py(+59/-?),openai_models.py(+12),api/utils.py(+24/-?),engine/vlm.py(+152/-?),scheduler.py(+90),utils/image.py(+57/-?),adapter/gemma4.py(+9/-?)Tests (8 files):
test_anthropic_adapter.py(+123),test_api_utils.py(+43),test_image_utils.py(+140/-?),test_vlm_engine.py(+177/-?),test_scheduler.py(+36),test_gemma4_messages.py(+16),test_openai_models.py(+113),test_openai_adapter.py(+45)Config (1 file):
pyproject.toml(mlx-vlm pin bump)Fixes #591