Skip to content

model : add support for Phi4ForCausalLMV#20168

Merged
CISC merged 12 commits intoggml-org:masterfrom
dranger003:phi4-siglip
Mar 11, 2026
Merged

model : add support for Phi4ForCausalLMV#20168
CISC merged 12 commits intoggml-org:masterfrom
dranger003:phi4-siglip

Conversation

@dranger003
Copy link
Contributor

@dranger003 dranger003 commented Mar 6, 2026

Add support for microsoft/Phi-4-reasoning-vision-15B.
This is reusing the existing Phi-3 text path for the decoder and adds a mmproj/mtmd path for Phi SigLIP2 vision.

I uploaded some converted weights for testing https://huggingface.co/dranger003/Phi-4-reasoning-vision-15B-GGUF.
The model generates coherent text and proper image descriptions, including OCR (see below).

DISCLAIMER: GPT-5.4 was used to help with this PR.

image image

EDIT: More information about this model here
https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/

Several open-source multimodal language models have adapted their methodologies accordingly, e.g., Gemma3(opens in new tab) uses pan-and-scan and NVILA(opens in new tab) uses Dynamic S2. However, their trade-offs are difficult to understand across different datasets and hyperparameters. To this end, we conducted an ablation study of several techniques. We trained a smaller 5 billion parameter Phi-4 based proxy model on a dataset of 10 million image-text pairs, primarily composed of computer-use and GUI grounding data. We compared with Dynamic S2, which resizes images to a rectangular resolution that minimizes distortion while admitting a tiling by 384×384 squares; Multi-crop, which splits the image into potentially overlapping 384×384 squares and concatenates their encoded features on the token dimension; Multi-crop with S2, which broadens the receptive field by cropping into 1536×1536 squares before applying S2; and Dynamic resolution using the Naflex variant of SigLIP-2, a natively dynamic-resolution encoder with adjustable patch counts.

Our primary finding is that dynamic resolution vision encoders perform the best and especially well on high-resolution data. It is particularly interesting to compare dynamic resolution with 2048 vs 3600 maximum tokens: the latter roughly corresponds to native HD 720p resolution and enjoys a substantial boost on high-resolution benchmarks, particularly ScreenSpot-Pro. Reinforcing the high-resolution trend, we find that multi-crop with S2 outperforms standard multi-crop despite using fewer visual tokens (i.e., fewer crops overall). The dynamic resolution technique produces the most tokens on average; due to their tiling subroutine, S2-based methods are constrained by the original image resolution and often only use about half the maximum tokens. From these experiments we choose the SigLIP-2 Naflex variant as our vision encoder.

@dranger003 dranger003 requested review from CISC and ngxson as code owners March 6, 2026 15:58
@github-actions github-actions bot added examples python python script changes labels Mar 6, 2026
Copy link
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can be simplify further. Phi-4 siglip is not very breakthrough in term of architecture, it reuse a lot of things already existing in the code base, so there is no need to add too many new code paths for it.

Copy link
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving this PR because this model is too trivial to support.

I can be a bit harsh here, but I want to make it clear: I don't recommend contributions where the author cannot properly respond to trivial questions (proof: #20168 (comment) and #20168 (comment)). This shows the author put too little effort into their own work.

Many other contributors are willing to spend time understanding the code, even when it's AI-generated, and we welcome that kind of contribution. What we don't encourage is the type of PR where more than half of the work ends up being done by the reviewers.

@dranger003
Copy link
Contributor Author

I'm approving this PR because this model is too trivial to support.

I can be a bit harsh here, but I want to make it clear: I don't recommend contributions where the author cannot properly respond to trivial questions (proof: #20168 (comment) and #20168 (comment)). This shows the author put too little effort into their own work.

Many other contributors are willing to spend time understanding the code, even when it's AI-generated, and we welcome that kind of contribution. What we don't encourage is the type of PR where more than half of the work ends up being done by the reviewers.

Thanks for your honesty @ngxson. I was genuinely trying to help and I understand this is likely wasting your time more than if I didn't contribute to the project using AI to help me.

@xyehya
Copy link

xyehya commented Mar 11, 2026

Working now?

@dranger003
Copy link
Contributor Author

Working now?

Yes, still working here.

@CISC CISC merged commit fdb1764 into ggml-org:master Mar 11, 2026
81 of 82 checks passed
ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
* Add support for Phi4ForCausalLMV.

* Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd.

* Rename contants + fix tokenizer label

* Clean-ups.

* Fix GGUF export.

* Set tokenizer.ggml.pre explicitly.

* Default vocab name rather than forcing it.

* Clean-ups.

* Fix indent.

* Fix subscriptable error.

* remov overcomplicated code path

* Clean-ups.

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
tekintian added a commit to tekintian/llama.cpp that referenced this pull request Mar 12, 2026
* 'master' of github.com:ggml-org/llama.cpp: (33 commits)
  convert : better mtp check and fix return [no ci] (ggml-org#20419)
  vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379)
  New conversations now auto-select the first loaded model (ggml-org#20403)
  ggml-virtgpu: Fix some build commands (ggml-org#20341)
  metal : avoid divisions in bin kernel (ggml-org#20426)
  ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154)
  vulkan: fix l2_norm epsilon handling (ggml-org#20350)
  vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296)
  vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059)
  opencl: use larger workgroup size for get_rows (ggml-org#20316)
  opencl: add cumsum op (ggml-org#18981)
  hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392)
  common/parser: add GigaChatV3/3.1 models support (ggml-org#19931)
  model : add support for Phi4ForCausalLMV (ggml-org#20168)
  graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427)
  common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416)
  ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230)
  llama : enable chunked fused GDN path (ggml-org#20340)
  llama : whitespace cleanup (ggml-org#20422)
  ggml : add NVFP4 quantization type support (ggml-org#19769)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants