model : add support for Phi4ForCausalLMV by dranger003 · Pull Request #20168 · ggml-org/llama.cpp

dranger003 · 2026-03-06T15:58:10Z

Add support for microsoft/Phi-4-reasoning-vision-15B.
This is reusing the existing Phi-3 text path for the decoder and adds a mmproj/mtmd path for Phi SigLIP2 vision.

I uploaded some converted weights for testing https://huggingface.co/dranger003/Phi-4-reasoning-vision-15B-GGUF.
The model generates coherent text and proper image descriptions, including OCR (see below).

DISCLAIMER: GPT-5.4 was used to help with this PR.

EDIT: More information about this model here
https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/

Several open-source multimodal language models have adapted their methodologies accordingly, e.g., Gemma3(opens in new tab) uses pan-and-scan and NVILA(opens in new tab) uses Dynamic S2. However, their trade-offs are difficult to understand across different datasets and hyperparameters. To this end, we conducted an ablation study of several techniques. We trained a smaller 5 billion parameter Phi-4 based proxy model on a dataset of 10 million image-text pairs, primarily composed of computer-use and GUI grounding data. We compared with Dynamic S2, which resizes images to a rectangular resolution that minimizes distortion while admitting a tiling by 384×384 squares; Multi-crop, which splits the image into potentially overlapping 384×384 squares and concatenates their encoded features on the token dimension; Multi-crop with S2, which broadens the receptive field by cropping into 1536×1536 squares before applying S2; and Dynamic resolution using the Naflex variant of SigLIP-2, a natively dynamic-resolution encoder with adjustable patch counts.

Our primary finding is that dynamic resolution vision encoders perform the best and especially well on high-resolution data. It is particularly interesting to compare dynamic resolution with 2048 vs 3600 maximum tokens: the latter roughly corresponds to native HD 720p resolution and enjoys a substantial boost on high-resolution benchmarks, particularly ScreenSpot-Pro. Reinforcing the high-resolution trend, we find that multi-crop with S2 outperforms standard multi-crop despite using fewer visual tokens (i.e., fewer crops overall). The dynamic resolution technique produces the most tokens on average; due to their tiling subroutine, S2-based methods are constrained by the original image resolution and often only use about half the maximum tokens. From these experiments we choose the SigLIP-2 Naflex variant as our vision encoder.

tools/mtmd/models/phi4-siglip.cpp

gguf-py/gguf/constants.py

convert_hf_to_gguf.py

…t) and matching HF NaFlex resize behavior in mtmd.

ngxson

I believe this can be simplify further. Phi-4 siglip is not very breakthrough in term of architecture, it reuse a lot of things already existing in the code base, so there is no need to add too many new code paths for it.

tools/mtmd/clip.cpp

convert_hf_to_gguf.py

ngxson

I'm approving this PR because this model is too trivial to support.

I can be a bit harsh here, but I want to make it clear: I don't recommend contributions where the author cannot properly respond to trivial questions (proof: #20168 (comment) and #20168 (comment)). This shows the author put too little effort into their own work.

Many other contributors are willing to spend time understanding the code, even when it's AI-generated, and we welcome that kind of contribution. What we don't encourage is the type of PR where more than half of the work ends up being done by the reviewers.

convert_hf_to_gguf.py

dranger003 · 2026-03-08T00:53:48Z

I'm approving this PR because this model is too trivial to support.

I can be a bit harsh here, but I want to make it clear: I don't recommend contributions where the author cannot properly respond to trivial questions (proof: #20168 (comment) and #20168 (comment)). This shows the author put too little effort into their own work.

Many other contributors are willing to spend time understanding the code, even when it's AI-generated, and we welcome that kind of contribution. What we don't encourage is the type of PR where more than half of the work ends up being done by the reviewers.

Thanks for your honesty @ngxson. I was genuinely trying to help and I understand this is likely wasting your time more than if I didn't contribute to the project using AI to help me.

xyehya · 2026-03-11T12:19:42Z

Working now?

dranger003 · 2026-03-11T22:41:57Z

Working now?

Yes, still working here.

* Add support for Phi4ForCausalLMV. * Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd. * Rename contants + fix tokenizer label * Clean-ups. * Fix GGUF export. * Set tokenizer.ggml.pre explicitly. * Default vocab name rather than forcing it. * Clean-ups. * Fix indent. * Fix subscriptable error. * remov overcomplicated code path * Clean-ups. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...

dranger003 requested review from CISC and ngxson as code owners March 6, 2026 15:58

ngxson reviewed Mar 6, 2026

View reviewed changes

tools/mtmd/models/phi4-siglip.cpp Outdated Show resolved Hide resolved

ngxson reviewed Mar 6, 2026

View reviewed changes

gguf-py/gguf/constants.py Outdated Show resolved Hide resolved

github-actions bot added examples python python script changes labels Mar 6, 2026

dranger003 force-pushed the phi4-siglip branch from 87d7714 to d58d48d Compare March 6, 2026 18:59

CISC reviewed Mar 6, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

CISC reviewed Mar 6, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

dranger003 added 7 commits March 6, 2026 18:30

Add support for Phi4ForCausalLMV.

799c79d

Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layou…

109c541

…t) and matching HF NaFlex resize behavior in mtmd.

Rename contants + fix tokenizer label

d40e121

Clean-ups.

d3d4ec6

Fix GGUF export.

a6cb496

Set tokenizer.ggml.pre explicitly.

ba728b8

Default vocab name rather than forcing it.

d89ba94

dranger003 force-pushed the phi4-siglip branch from 73453f8 to d89ba94 Compare March 6, 2026 23:31

ngxson requested changes Mar 6, 2026

View reviewed changes

tools/mtmd/clip.cpp Outdated Show resolved Hide resolved

tools/mtmd/clip.cpp Outdated Show resolved Hide resolved

tools/mtmd/clip.cpp Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

dranger003 added 2 commits March 6, 2026 19:50

Clean-ups.

5cf5489

Fix indent.

4c1372e

loci-dev mentioned this pull request Mar 7, 2026

UPSTREAM PR #20168: model : add support for Phi4ForCausalLMV auroralabs-loci/llama.cpp#1230

Open

dranger003 requested a review from ngxson March 7, 2026 18:37

dranger003 and others added 2 commits March 7, 2026 13:41

Fix subscriptable error.

1deec41

remov overcomplicated code path

0bab060

ngxson approved these changes Mar 7, 2026

View reviewed changes

CISC reviewed Mar 7, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

Clean-ups.

9cf0e56

CISC approved these changes Mar 8, 2026

View reviewed changes

CISC merged commit fdb1764 into ggml-org:master Mar 11, 2026
81 of 82 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model : add support for Phi4ForCausalLMV#20168

model : add support for Phi4ForCausalLMV#20168
CISC merged 12 commits intoggml-org:masterfrom
dranger003:phi4-siglip

dranger003 commented Mar 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson left a comment

Uh oh!

Uh oh!

Uh oh!

dranger003 commented Mar 8, 2026

Uh oh!

xyehya commented Mar 11, 2026

Uh oh!

dranger003 commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dranger003 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dranger003 commented Mar 8, 2026

Uh oh!

xyehya commented Mar 11, 2026

Uh oh!

dranger003 commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dranger003 commented Mar 6, 2026 •

edited

Loading