You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we have supported text, vision, and audio.
Repeated MMMU benchmark runs range between 53.6 - 55.5, consistent with the the benchmark reported in the original paper (55).
Known limitations: (See Execution Plan before for full list):
Token: Phi4MM supports two types of image token conventions (<|image1|> and <|endoftext10|>), currently we only support the latter. If you use the default chat template, it will automatically pick up the supported one.
Update
Currently we have supported text, vision, and audio.
Repeated MMMU benchmark runs range between 53.6 - 55.5, consistent with the the benchmark reported in the original paper (55).
Known limitations: (See Execution Plan before for full list):
<|image1|>and<|endoftext10|>), currently we only support the latter. If you use the default chat template, it will automatically pick up the supported one.Audio capabilities: currently we do not support audio at all.Fixed with Feat: Support audio in Phi4-mm model #8048LoRA / Image quality: Phi4MM depends on LoRA for full image capability, but there is some compatibility issues with the native SGL LORA solution. We are working on solving it by refactoring / generalizing SGL LoRA capabilities.Fixed with Refactor LoRA handling to support adapter tensors in fused format #6585, Fix incorrect LoRA weight loading for fused gate_up_proj #6734, Support LoRA in TestOpenAIVisionServer and fix fused kv_proj loading bug. #6861)Motivation
Supporting the Phi4 Multimodal model (https://huggingface.co/microsoft/Phi-4-multimodal-instruct in SGL.
Execution Plan:
<image_1>)Related resources
No response