Skip to content

[Feature] Phi-4-MM support #6544

@lifuhuang

Description

@lifuhuang

Update

Currently we have supported text, vision, and audio.

Repeated MMMU benchmark runs range between 53.6 - 55.5, consistent with the the benchmark reported in the original paper (55).

Known limitations: (See Execution Plan before for full list):

  1. Token: Phi4MM supports two types of image token conventions (<|image1|> and <|endoftext10|>), currently we only support the latter. If you use the default chat template, it will automatically pick up the supported one.
  2. Audio capabilities: currently we do not support audio at all. Fixed with Feat: Support audio in Phi4-mm model #8048
  3. LoRA / Image quality: Phi4MM depends on LoRA for full image capability, but there is some compatibility issues with the native SGL LORA solution. We are working on solving it by refactoring / generalizing SGL LoRA capabilities. Fixed with Refactor LoRA handling to support adapter tensors in fused format #6585, Fix incorrect LoRA weight loading for fused gate_up_proj #6734, Support LoRA in TestOpenAIVisionServer and fix fused kv_proj loading bug. #6861)

Motivation

Supporting the Phi4 Multimodal model (https://huggingface.co/microsoft/Phi-4-multimodal-instruct in SGL.

Execution Plan:

Related resources

No response

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions