Skip to content

mtmd: allow QwenVL to process larger image by default#17020

Merged
ngxson merged 1 commit intoggml-org:masterfrom
ngxson:xsn/qwen_allow_large_img_default
Nov 5, 2025
Merged

mtmd: allow QwenVL to process larger image by default#17020
ngxson merged 1 commit intoggml-org:masterfrom
ngxson:xsn/qwen_allow_large_img_default

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Nov 5, 2025

Since we have flash attn enabled by default now, QwenVL can use max=4096 tokens with less memory

Also removed some obsoleted comments

@ngxson ngxson requested a review from ggerganov November 5, 2025 10:26
@ngxson ngxson merged commit 92bb84f into ggml-org:master Nov 5, 2025
67 of 68 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Nov 5, 2025
* origin/master: (21 commits)
vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (ggml-org#16919)
examples(gguf): GGUF example outputs (ggml-org#17025)
mtmd: allow QwenVL to process larger image by default (ggml-org#17020)
server : do not default to multiple slots with speculative decoding (ggml-org#17017)
mtmd: improve struct initialization (ggml-org#16981)
docs: Clarify the endpoint that webui uses (ggml-org#17001)
model : add openPangu-Embedded (ggml-org#16941)
ggml webgpu: minor set rows optimization (ggml-org#16810)
sync : ggml
ggml : fix conv2d_dw SVE path (ggml/1380)
CUDA: update ops.md (ggml-org#17005)
opencl: update doc (ggml-org#17011)
refactor: replace sprintf with snprintf for safer string handling in dump functions (ggml-org#16913)
vulkan: remove the need for the dryrun (ggml-org#16826)
server : do context shift only while generating (ggml-org#17000)
readme : update hot topics (ggml-org#17002)
ggml-cpu : bicubic interpolation (ggml-org#16891)
ci : apply model label to models (ggml-org#16994)
chore : fix models indent after refactor (ggml-org#16992)
Fix garbled output with REPACK at high thread counts (ggml-org#16956)
...
SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 29, 2025
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (ggml-org#16206)

* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add test model

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	convert_hf_to_gguf_update.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/gguf_writer.py
#	src/llama-vocab.cpp
#	src/llama-vocab.h

* mtmd : support home-cooked Mistral Small Omni (ggml-org#14928)

* model : add LightOnOCR-1B model (ggml-org#16764)

* model : add LightOnOCR-1B model

* add test
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py

* mtmd : fix idefics3 preprocessing (ggml-org#16806)

* mtmd : fix idefics3 preprocessing

* disable granite test

* fix test for granite

* model: Add support for CogVLM model (ggml-org#15002)

* Added GGUF mappings for CogVLM model

* Add tensor mapping for CogVLM visual encoder

* Add CogVLM to conversion script, no vision part yet

* Added CogVLM vision model to conversion script

* Add graph for CogVLM CLIP model

* Add graph for CogVLM

* Fixes for CogVLM. Now compiles.

* Model now runs

* Fixes for cogvlm graph

* Account for graph context change after rebase

* Changes for whitespace

* Changes in convert script according to comments

* Switch CogVLM LLM graph to merged QKV tensor

* Use rope_type variable instead of direct definition

* Change CogVLM CLIP encoder to use SWIGLU

* Switch CogVLM CLIP to use merged QKV

* Apply rebase edits and remove ggml_cont call that is now unnecessary

* clean up

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	examples/mtmd/clip.cpp
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama-model.h

* mtmd: refactor preprocessing + support max/min pixels (ggml-org#16878)

* mtmd: refactor preprocessing + support max/min pixels

* fix mlp type

* implement mix/max pixels

* improve hparams

* better image preproc for qwen

* fix

* fix out of bound composite

* fix (2)

* fix token calculation

* get_merge_kernel_size()

* fix llama4 and lfm2

* gonna fix them all

* use simple resize for qwen

* qwen: increase min tokens

* no resize if dst size == src size

* restore to initial min/max tokens value for qwen
# Conflicts:
#	examples/mtmd/clip.cpp

* clip : use FA (ggml-org#16837)

* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* model: add Janus Pro for image understanding (ggml-org#16906)

* Add support for Janus Pro

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address reviewer suggestions

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add JANUS_PRO constant

* Update clip model handling

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Refactor JANUS_PRO handling in clip.cpp

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* em whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py

* mtmd: pad mask for qwen2.5vl (ggml-org#16954)

* mtmd: pad mask for qwen2.5vl

* improve

* mtmd: add --image-min/max-tokens (ggml-org#16921)

* mtmd: improve struct initialization (ggml-org#16981)

* mtmd: allow QwenVL to process larger image by default (ggml-org#17020)

* Disable flash attention

* mtmd : fix embedding size for image input (ggml-org#17123)

* mtmd: fix patch_size initialized to random value in audio models (ggml-org#17128)

* mtmd: fix patch_size initialized to random value in audio models

* add default hparams

* add llama_model_n_embd_inp

* Fix load qwen3 vl

Change batch size

* Add description

* Fix cli build error

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants