fix: force gemma4_unified detection as VLM#1744
Conversation
|
Thanks for the PR. The model discovery change makes sense to me, but I do not think the feature extractor mapping should change.
It is not equivalent to I can merge the narrower discovery fix once that mapping change is removed or covered by a runtime-compatible test. |
001769d to
5c9d897
Compare
5c9d897 to
389ffff
Compare
|
Thank you @jundot I have updated the PR. |
Gemma4 unified models always include vision+audio capabilities, so they should never fall back to LLM detection even when vision_config is absent from config.json (e.g. some quantized variants). - Add special case in detect_model_type for gemma4_unified → VLM - Update test to reflect that gemma4_unified is always VLM
389ffff to
396c752
Compare
|
Thanks for updating this. The narrowed discovery-only change looks right: gemma4_unified should stay on the mlx-vlm path even when quantized configs omit vision_config, and the unified audio feature extractor mapping is preserved. CI is green, so this looks good to me, and I am going to merge it. |
Summary
Forces
gemma4_unifieddetection as VLM in model discovery.Problem
gemma4_unifiedmodels were detected as LLM when theirconfig.jsonlackedvision_config(e.g., some quantized variants). Since unified models always include vision+audio capabilities by definition, they should never fall back to LLM.Changes
omlx/model_discovery.py— Forcegemma4_unified→ VLM detection regardless ofvision_configpresence.tests/test_model_discovery.py— Updatedtest_detect_text_only_gemma4_unified_as_llm→test_detect_gemma4_unified_without_vision_config_as_vlmsincegemma4_unifiedis always VLM.Notes
vlm.pyis not changed —Gemma4UnifiedAudioFeatureExtractorcorrectly resolves tomlx_vlm.models.gemma4_unified.processing_gemma4_unified.Gemma4UnifiedAudioFeatureExtractorin the pinned mlx-vlm (commit54c9a11). That class differs fromGemma4AudioFeatureExtractor(which produces mel spectrograms) — the unified extractor outputs raw waveform chunks withaudio_samples_per_token=640.gemma4_unifiedis an architecture type (not a size label). Currently only the 12B uses it; future models with the same unified audio+vision architecture would also use this type.