clip : use FA by ggerganov · Pull Request #16837 · ggml-org/llama.cpp

ggerganov · 2025-10-29T09:33:55Z

Sample implementation for using FA in the CLIP. Reduces memory usage and improves performance.

Testing with Gemma 12B, using llama-server and 2 images:

# before
alloc_compute_meta:      Metal compute buffer size =  1132.00 MiB
alloc_compute_meta:        CPU compute buffer size =     9.19 MiB

srv  process_chun: image processed in 1653 ms
srv  process_chun: image processed in 1093 ms

# after
alloc_compute_meta:      Metal compute buffer size =   121.25 MiB
alloc_compute_meta:        CPU compute buffer size =     9.19 MiB

srv  process_chun: image processed in 1386 ms
srv  process_chun: image processed in 810 ms

TODO:

Add FA sizes to other backends (f.ex. Gemma uses non-standard head size of 72)

ggerganov · 2025-10-29T09:36:16Z

@ngxson Feel free to use this PR as a starting point for enabling FA generally in the CLIP. I think the main thing that we are missing is backend support for any unusual head sizes that might occur with vision models. I added Metal support for HS=72 as an example.

ngxson · 2025-10-29T10:25:07Z

Thanks for the initial work. I actually planned to work on flash attn this week/next week, so this will help me a lot.

Btw, how do we know if we can use flash attn for a given cgraph? I don't quite remember how llama.cpp check if the current model can use flash attn or not.

ggerganov · 2025-10-29T11:58:50Z

Btw, how do we know if we can use flash attn for a given cgraph?

Maybe you are thinking about the logic in libllama to for checking when to enable FA?

llama.cpp/src/llama-context.cpp

Lines 291 to 331 in 41ebbfd

    
           // resolve automatic Flash Attention use 
        
           if (params.flash_attn_type == LLAMA_FLASH_ATTN_TYPE_AUTO) { 
        
               auto * gf = graph_reserve(1, n_seqs, n_outputs, mctx.get(), true); 
        
               if (!gf) { 
        
                   throw std::runtime_error("failed to split graph for Flash Attention check"); 
        
               } 
        
               const size_t prefix_len = strlen(LLAMA_TENSOR_NAME_FATTN) + 1; 
        
               bool fa_device_mismatch = false; 
        
               for (int i = 0; i < ggml_graph_n_nodes(gf); i++) { 
        
                   ggml_tensor * n = ggml_graph_node(gf, i); 
        
                   if (n->op != GGML_OP_FLASH_ATTN_EXT) { 
        
                       continue; 
        
                   } 
        
                   ggml_backend_dev_t device_fa = ggml_backend_get_device( 
        
                       ggml_backend_sched_get_tensor_backend(sched.get(), n)); 
        
                   // TODO: instead of the tensor names, use a map to keep track of which (FA) tensors belong to which layer 
        
                   GGML_ASSERT(strncmp(n->name, LLAMA_TENSOR_NAME_FATTN "-", prefix_len) == 0); 
        
                   const int il = std::stoi(n->name + prefix_len); 
        
                   ggml_backend_dev_t device_kv = model.dev_layer(il); 
        
                   if (device_fa != device_kv) { 
        
                       LLAMA_LOG_WARN("%s: layer %d is assigned to device %s but the Flash Attention tensor " 
        
                           "is assigned to device %s (usually due to missing support)\n", 
        
                           __func__, il, ggml_backend_dev_name(device_kv), ggml_backend_dev_name(device_fa)); 
        
                       // FIXME: fa_device_mismatch logic is wrong for --no-kv-offload, but this is broken anyways 
        
                       fa_device_mismatch = true; 
        
                       break; 
        
                   } 
        
               } 
        
               if (fa_device_mismatch) { 
        
                   cparams.flash_attn = false; 
        
                   LLAMA_LOG_WARN("%s: Flash Attention was auto, set to disabled\n", __func__); 
        
                   if (ggml_is_quantized(params.type_v)) { 
        
                       throw std::runtime_error("quantized V cache was requested, but this requires Flash Attention"); 
        
                   } 
        
               } else { 
        
                   cparams.flash_attn = true; 
        
                   LLAMA_LOG_INFO("%s: Flash Attention was auto, set to enabled\n", __func__); 
        
               } 
        
           }

I think for CLIP it can safely be always enabled.

ngxson · 2025-10-29T13:46:38Z

Hmm so for example, in case I completely replace the clip's build_attn with flash attn, if a certain head size is not supported by the backend, it will be fallback to CPU, right?

ggerganov · 2025-10-29T13:56:40Z

Yes. CPU FA is always supported. It could be a problem because we might not notice that the CPU fallback is being triggered in some cases. But still, it seems like the better default IMO.

ngxson · 2025-10-29T14:07:55Z

In such case, I think we still need the logic to check if the GPU backend supports flash attn or not.

I agree that flash attn should be the default, I think I can safely reuse the same llama_flash_attn_type enum (with LLAMA_FLASH_ATTN_TYPE_AUTO as the default value) and also reuse the detection logic from llama.cpp

ggerganov · 2025-10-30T16:22:18Z

We can also print a big warning each time the CLIP scheduler runs with more than 1 graph splits. This way we will immediately spot cases where the implementation uses an unsupported operator.

Demonstrated in a4b54f2

Sample output from llama-server:

0.02.977.454 I main: server is listening on http://127.0.0.1:8013 - starting the main loop
0.02.977.454 I srv  update_slots: all slots are idle
0.10.164.226 I srv  params_from_: Chat format: Content-only
0.10.164.254 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.10.164.280 I slot launch_slot_: id  0 | task 0 | processing task
0.10.164.285 I slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 289
0.10.164.372 I slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
0.10.164.377 I slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 27, n_tokens = 27, progress = 0.093426
0.10.168.064 I slot update_slots: id  0 | task 0 | n_past = 27, memory_seq_rm [27, end)
0.10.168.067 I srv  process_chun: processing image...
encoding image slice...
clip_image_batch_encode: *****************************************************************
clip_image_batch_encode: WARNING: the CLIP graph uses unsupported operators by the backend
clip_image_batch_encode:          the performance will be suboptimal                      
clip_image_batch_encode:                                                                  
clip_image_batch_encode: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118
clip_image_batch_encode: *****************************************************************
image slice encoded in 13058 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 4 ms
0.23.229.992 I srv  process_chun: image processed in 13062 ms
0.23.230.214 I slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 289, n_tokens = 6, progress = 1.000000
0.23.230.240 I slot update_slots: id  0 | task 0 | prompt done, n_past = 289, n_tokens = 6

ngxson · 2025-11-01T22:59:23Z

tools/mtmd/clip.cpp

+                LOG_WRN("%s: flash attention not supported, memory usage will increase\n", __func__);
+                // TODO: maybe log more details about why flash attention is not supported


@ggerganov I implemented a simple solution to auto-enable flash attn only when the backend support it. Probably we should make this LOG_WRN to be more prominent. Also, which kind of info do you think should be displayed here?

Some users potentially already using models with shapes not supported by GPU flash attn. Falling back to CPU will suddenly make it very slow and thus not a good UX overall. The auto mode + prominent is a better solution as also it encourage users to "voluntary" report certain info back to us - less forcefully for them.

Also, which kind of info do you think should be displayed here?

We can print the actual tensor (shape, strides, types) for which FA is not supported.

ngxson · 2025-11-01T23:06:39Z

tools/mtmd/clip.cpp

+        for (int i = 0; i < ggml_graph_n_nodes(gf); i++) {
+            ggml_tensor * node = ggml_graph_node(gf, i);
+            if (node->op == GGML_OP_FLASH_ATTN_EXT) {
+                if (!ggml_backend_supports_op(ctx_clip.backend, node)) {
+                    return false;
+                }
+            }
+        }


I tested this by temporary modify the code to always return false in ggml_metal_device_supports_op, works well for now, but I'm not sure if there are any edge cases.

Currently, mtmd only support 2 backends at the same time: CPU and one GPU backend

ggerganov · 2025-11-02T09:19:03Z

I extended the warmup logic to print all ops that are not supported by the accelerated backend of the CLIP context. For example, we are now informed that the Metal backend does not support the UPSCALE op for Qwen3 VL:

alloc_compute_meta: warmup with image size = 512 x 512
alloc_compute_meta:      Metal compute buffer size =    38.02 MiB
alloc_compute_meta:        CPU compute buffer size =    16.02 MiB
alloc_compute_meta: graph splits = 3, nodes = 766
warmup: flash attention is enabled
warmup: op          UPSCALE is not supported by the CLIP backend: type = f32, ne = [32 32 1024 1]

ngxson · 2025-11-02T11:10:26Z

tools/mtmd/clip.cpp

+            if (!unsupported_ops.empty()) {
+                LOG_WRN("%s: *****************************************************************\n", __func__);
+                LOG_WRN("%s: WARNING: the CLIP graph uses unsupported operators by the backend\n", __func__);
+                LOG_WRN("%s:          the performance will be suboptimal                      \n", __func__);
+                LOG_WRN("%s:          list of unsupported ops (backend=%s):\n", __func__, ggml_backend_name(ctx_clip.backend));
+                for (const auto & op : unsupported_ops) {
+                    LOG_WRN("%s: %16s: type = %s, ne = [%d %d %d %d]\n", __func__,
+                            ggml_op_name(op.op->op),
+                            ggml_type_name(op.op->type),
+                            op.op->ne[0], op.op->ne[1], op.op->ne[2], op.op->ne[3]);
+                }
+                LOG_WRN("%s: flash attention is %s\n", __func__, 
+                    (ctx_clip.flash_attn_type == CLIP_FLASH_ATTN_TYPE_ENABLED) ? "enabled" : "disabled");
+                LOG_WRN("%s: please report this on github as an issue\n", __func__);
+                LOG_WRN("%s: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118\n", __func__);
+                LOG_WRN("%s: *****************************************************************\n", __func__);
+            }


I improved these messages to make them more prominent while giving the user an instruction of what to do. Lmk if this looks good to you

ngxson · 2025-11-02T11:11:02Z

tools/mtmd/clip.cpp

+                static auto print_shape = [](const char * fn, const char * name, ggml_tensor * t) {
+                    LOG_WRN("%s:   %s: type = %s, ne = [%d %d %d %d], nb = [%d %d %d %d]\n", fn,
+                            name, ggml_type_name(t->type),
+                            t->ne[0], t->ne[1], t->ne[2], t->ne[3],
+                            t->nb[0], t->nb[1], t->nb[2], t->nb[3]);
+                };
+                print_shape(__func__, " dst", op);
+                print_shape(__func__, "src0", op->src[0]);
+                print_shape(__func__, "src1", op->src[1]);
+                print_shape(__func__, "src2", op->src[2]);


Re. what to print when flash attn is not support, I'm printing tensor shapes, type, and stride

This should be good - usually the head size (i.e. src[0]->ne[0]) is the thing that is not supported.

* origin/master: (169 commits) opencl: support imrope (ggml-org#16914) fix: Viewing multiple PDF attachments (ggml-org#16974) model-conversion : pass config to from_pretrained (ggml-org#16963) server : add props.model_alias (ggml-org#16943) ggml: CUDA: add head size 72 for flash-attn (ggml-org#16962) mtmd: add --image-min/max-tokens (ggml-org#16921) mtmd: pad mask for qwen2.5vl (ggml-org#16954) ggml : LoongArch fixes (ggml-org#16958) sync: minja (glm 4.6 & minmax m2 templates) (ggml-org#16949) SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (ggml-org#16869) feat(webui): improve LaTeX rendering with currency detection (ggml-org#16508) test-backend-ops : fix segfault in moe-expert-reduce test in support mode and coverage (ggml-org#16936) ci : disable failing riscv cross build (ggml-org#16952) model: add Janus Pro for image understanding (ggml-org#16906) clip : use FA (ggml-org#16837) server : support unified cache across slots (ggml-org#16736) common : move gpt-oss reasoning processing to init params (ggml-org#16937) docs: remove llama_sampler_accept reference in sampling sample usage (ggml-org#16920) CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (ggml-org#16917) devops: fix failing s390x docker build (ggml-org#16918) ...

adhusch · 2025-12-12T10:31:07Z

For Voxtral Mini:

warmup: ***************************************************************** 
warmup: WARNING: the CLIP graph uses unsupported operators by the backend
warmup:          the performance will be suboptimal
warmup:          list of unsupported ops (backend=CUDA0):
warmup:          POOL_1D: type = f32, ne = [750 1280 1 1]
warmup: flash attention is enabled
warmup: please report this on github as an issue
warmup: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118
warmup: *****************************************************************

* model : Granite docling + Idefics3 preprocessing (SmolVLM) (ggml-org#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (ggml-org#14928) * model : add LightOnOCR-1B model (ggml-org#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (ggml-org#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (ggml-org#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (ggml-org#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (ggml-org#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * model: add Janus Pro for image understanding (ggml-org#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (ggml-org#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (ggml-org#16921) * mtmd: improve struct initialization (ggml-org#16981) * mtmd: allow QwenVL to process larger image by default (ggml-org#17020) * Disable flash attention * mtmd : fix embedding size for image input (ggml-org#17123) * mtmd: fix patch_size initialized to random value in audio models (ggml-org#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: firecoperana <firecoperana>

elfarolab · 2025-12-30T14:35:26Z

@ggerganov @ngxson

Hello,
Sorry for the interruption, I know you are very busy.

On Jetson Orin, CUDA enabled.

model: LFM2 Audio 1.5B.

Reporting because spotted warnings into the debug output of llama-server.
I hope it will be helpful, in case you need, I can run any further test or patches.

Dec 30 14:27:26 xyz llama-server[7841]: [33405] Dec 30 14:27:26 xyz llama-server[7841]: [33405] Dec 30 14:27:26 xyz llama-server[7841]: [33405] Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: Dec 30 14:27:26 xyz llama-server[7841]: [33405] Dec 30 14:27:26 xyz llama-server[7841]: [33405] Dec 30 14:27:26 xyz llama-server[7841]: [33405] Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: flash attention is enabled
warmup: *****************************************************************
warmup: WARNING: the CLIP graph uses unsupported operators by the backend
the performance will be suboptimal
list of unsupported ops (backend=CUDA0):
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
UNARY: type = f32, ne = [512 375 1 1]
warmup: flash attention is enabled
warmup: please report this on github as an issue
warmup: ref: #16837 (comment)
warmup: *****************************************************************

Thank you so much to everybody

* clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

uzvisa-crypto · 2026-01-23T18:51:22Z

Hello, I got this message in LM Studio log:

2026-01-23 21:36:14 [DEBUG]
 Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | REPACK = 1 |
2026-01-23 21:36:14 [DEBUG]
 llama_model_load_from_file_impl: using device Metal (Apple M1 Pro) (unknown id) - 12123 MiB free
2026-01-23 21:36:14 [DEBUG]
 llama_model_loader: loaded meta data with 32 key-value pairs and 523 tensors from /Users/ruslanag/.lmstudio/models/lmstudio-community/GLM-4.6V-Flash-GGUF/GLM-4.6V-Flash-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 2
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.600000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.800000
llama_model_loader: - kv   5:                               general.name str              = Zai org_GLM 4.6V Flash
llama_model_loader: - kv   6:                           general.finetune str              = 4.6V-Flash
llama_model_loader: - kv   7:                           general.basename str              = zai-org_GLM
llama_model_loader: - kv   8:                         general.size_label str              = 9.4B
llama_model_loader: - kv   9:                           glm4.block_count u32              = 40
llama_model_loader: - kv  10:                        glm4.context_length u32              = 131072
llama_model_loader: - kv  11:                      glm4.embedding_length u32              = 4096
llama_model_loader: - kv  12:                   glm4.feed_forward_length u32              = 13696
llama_model_loader: - kv  13:                  glm4.attention.head_count u32              = 32
llama_model_loader: - kv  14:               glm4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:               glm4.rope.dimension_sections arr[i32,4]       = [8, 12, 12, 0]
llama_model_loader: - kv  16:                        glm4.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  17:      glm4.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  glm4.rope.dimension_count u32              = 64
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = glm4
2026-01-23 21:36:14 [DEBUG]
 llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
2026-01-23 21:36:14 [DEBUG]
 llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2026-01-23 21:36:14 [DEBUG]
 llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  26:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151329
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  281 tensors
llama_model_loader: - type q5_0:   20 tensors
llama_model_loader: - type q8_0:   20 tensors
llama_model_loader: - type q4_K:  181 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.73 GiB (5.24 BPW)
2026-01-23 21:36:14 [DEBUG]
 load: 0 unused tokens
2026-01-23 21:36:14 [DEBUG]
 load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')
2026-01-23 21:36:14 [DEBUG]
 load: special tokens cache size = 36
2026-01-23 21:36:14 [DEBUG]
 load: token to piece cache size = 0.9713 MB
print_info: arch                  = glm4
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 131072
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 40
print_info: n_head                = 32
print_info: n_head_kv             = 2
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 16
print_info: n_embd_k_gqa          = 256
print_info: n_embd_v_gqa          = 256
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 13696
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 8
print_info: rope scaling          = linear
print_info: freq_base_train       = 500000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 131072
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [8, 12, 12, 0]
print_info: model type            = 9B
print_info: model params          = 9.40 B
print_info: general.name          = Zai org_GLM 4.6V Flash
print_info: vocab type            = BPE
print_info: n_vocab               = 151552
print_info: n_merges              = 318088
print_info: BOS token             = 151329 '<|endoftext|>'
2026-01-23 21:36:14 [DEBUG]
 print_info: EOS token             = 151329 '<|endoftext|>'
print_info: EOT token             = 151336 '<|user|>'
print_info: UNK token             = 151329 '<|endoftext|>'
print_info: PAD token             = 151329 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151347 '<|code_prefix|>'
print_info: FIM SUF token         = 151349 '<|code_suffix|>'
print_info: FIM MID token         = 151348 '<|code_middle|>'
print_info: EOG token             = 151329 '<|endoftext|>'
print_info: EOG token             = 151336 '<|user|>'
print_info: max token length      = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
2026-01-23 21:36:17 [DEBUG]
 load_tensors: offloading output layer to GPU
load_tensors: offloading 39 repeating layers to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   333.00 MiB
load_tensors: Metal_Mapped model buffer size =  5872.00 MiB
2026-01-23 21:36:17 [DEBUG]
 common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|user|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
2026-01-23 21:36:17 [DEBUG]
 ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     0.58 MiB
2026-01-23 21:36:17 [DEBUG]
 llama_kv_cache:      Metal KV buffer size =   320.00 MiB
2026-01-23 21:36:17 [DEBUG]
 llama_kv_cache: size =  320.00 MiB (  8192 cells,  40 layers,  1/1 seqs), K (f16):  160.00 MiB, V (f16):  160.00 MiB
sched_reserve: reserving ...
2026-01-23 21:36:17 [DEBUG]
 sched_reserve:      Metal compute buffer size =   304.00 MiB
sched_reserve:        CPU compute buffer size =    24.02 MiB
sched_reserve: graph nodes  = 1487
sched_reserve: graph splits = 2
sched_reserve: reserve took 4.13 ms, sched copies = 1
2026-01-23 21:36:17 [DEBUG]
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
2026-01-23 21:36:17 [DEBUG]
 GgmlThreadpools: llama threadpool init = n_threads = 7
2026-01-23 21:36:17 [DEBUG]
 clip_model_loader: model name:   Zai org_GLM 4.6V Flash
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    182
clip_model_loader: n_kv:         24

clip_model_loader: has vision encoder
2026-01-23 21:36:17 [DEBUG]
 ggml_metal_init: allocating
2026-01-23 21:36:17 [DEBUG]
 ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
clip_ctx: CLIP using Metal backend
load_hparams: projector:          glm4v
load_hparams: n_embd:             1536
load_hparams: n_head:             12
load_hparams: n_ff:               13696
load_hparams: n_layer:            24
load_hparams: ffn_op:             silu
load_hparams: projection_dim:     4096

--- vision hparams ---
load_hparams: image_size:         336
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels:   6272
load_hparams: image_max_pixels:   3211264

load_hparams: model size:         1704.17 MiB
load_hparams: metadata size:      0.06 MiB
2026-01-23 21:36:19 [DEBUG]
 warmup: warmup with image size = 1288 x 1288
2026-01-23 21:36:19 [DEBUG]
 alloc_compute_meta:      Metal compute buffer size =   496.07 MiB
alloc_compute_meta:        CPU compute buffer size =    68.71 MiB
alloc_compute_meta: graph splits = 3, nodes = 632
warmup: flash attention is enabled
warmup: *****************************************************************
warmup: WARNING: the CLIP graph uses unsupported operators by the backend
warmup:          the performance will be suboptimal                      
warmup:          list of unsupported ops (backend=Metal):
warmup:          UPSCALE: type = f32, ne = [92 92 1536 1]
warmup: flash attention is enabled
warmup: please report this on github as an issue
warmup: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118
warmup: *****************************************************************

Thank you.

* clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

github-actions bot added testing Everything test related examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 29, 2025

theo77186 mentioned this pull request Oct 29, 2025

llama: store mrope data in KV cell #16825

Merged

clip : use FA

3aa835b

ggerganov force-pushed the gg/clip-fa branch from eb03be3 to 7ffaf90 Compare October 30, 2025 16:15

cont : add warning about unsupported ops

a4b54f2

ggerganov force-pushed the gg/clip-fa branch from 7ffaf90 to a4b54f2 Compare October 30, 2025 16:20

ggerganov mentioned this pull request Nov 1, 2025

mtmd: refactor preprocessing + support max/min pixels #16878

Merged

ngxson added 2 commits November 1, 2025 23:08

Merge branch 'master' into gg/clip-fa

19116a4

implement "auto" mode for clip flash attn

b4955f0

ngxson reviewed Nov 1, 2025

View reviewed changes

github-actions bot added the server label Nov 1, 2025

ggerganov added 2 commits November 2, 2025 10:13

clip : print more detailed op support info during warmup

bdb43f6

cont : remove obsolete comment [no ci]

29330dc

ggerganov marked this pull request as ready for review November 2, 2025 09:17

ggerganov requested a review from slaren as a code owner November 2, 2025 09:17

improve debugging message

b67a168

ngxson reviewed Nov 2, 2025

View reviewed changes

ngxson and others added 2 commits November 2, 2025 12:12

trailing space

cdb3dea

metal : remove stray return

d441c31

ggerganov requested a review from ngxson November 2, 2025 16:24

ngxson approved these changes Nov 2, 2025

View reviewed changes

ngxson merged commit 2f966b8 into master Nov 2, 2025
66 of 72 checks passed

ddh0 mentioned this pull request Nov 2, 2025

Eval bug: mtmd: "flash attention is disabled / please report this on github as an issue" #16950

Closed

ag2s20150909 mentioned this pull request Nov 3, 2025

Eval bug: cpu: try reducing --n-gpu-layers if you're running out of VRAM #16955

Closed

baldpope mentioned this pull request Nov 13, 2025

Eval bug: segmentation fault when reading jpg through server webui #17242

Closed

spf1983 mentioned this pull request Nov 14, 2025

Eval bug: Qwen3-VL-8B freezes on image processing tasks #17012

Closed

bli1348 mentioned this pull request Dec 9, 2025

Eval bug: Qwen3-VL crashes llama-server when ecoding image slice #17881

Open

alexp700 mentioned this pull request Dec 27, 2025

Eval bug: Crashing on Mac M3 Ultra Qwen3-VL 235B #18414

Closed

emeric254 mentioned this pull request Dec 30, 2025

bug: AMD GPU: Error: Model appears to have crashed! janhq/jan#6628

Open

3 tasks

hejjyahoo mentioned this pull request Jan 1, 2026

vulkan llama.cpp (linux) 1.66.0+ does not load models with AMD MD RYZEN AI MAX+ 395 w/ Radeon 8060S @ 32x 5.185GHz lmstudio-ai/lmstudio-bug-tracker#1343

Open

elfarolab mentioned this pull request Jan 18, 2026

[Do Not Merge] model : LFM2.5-Audio-1.5B #18641

Draft

6 tasks

Hisir0909 mentioned this pull request Feb 12, 2026

Misc. bug: [SYCL] Missing kernels for Qwen3-VL vision encoder (UPSCALE, mRoPE, etc.) on Intel Arc A770 #19543

Open

michaelmarziani mentioned this pull request Mar 1, 2026

Misc. bug: Unsupported operation (backend=MTL0): UPSCALE: type = f32, ne = [92 92 1152 1] #20011

Closed

caffeinatedbits mentioned this pull request Mar 11, 2026

ggml-cuda : add flash attention support for head size 88 (Llama 4 Vision) #20375

Open

		LOG_WRN("%s: flash attention not supported, memory usage will increase\n", __func__);
		// TODO: maybe log more details about why flash attention is not supported

Conversation

ggerganov commented Oct 29, 2025

Uh oh!

ggerganov commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Oct 29, 2025

Uh oh!

ggerganov commented Oct 29, 2025

Uh oh!

ngxson commented Oct 29, 2025

Uh oh!

ggerganov commented Oct 29, 2025

Uh oh!

ngxson commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 30, 2025

Uh oh!

ngxson Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Nov 2, 2025

Uh oh!

ngxson Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adhusch commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elfarolab commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uzvisa-crypto commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggerganov commented Oct 29, 2025 •

edited

Loading

ngxson commented Oct 29, 2025 •

edited

Loading

adhusch commented Dec 12, 2025 •

edited

Loading

elfarolab commented Dec 30, 2025 •

edited

Loading