Conversation
|
@ngxson Feel free to use this PR as a starting point for enabling FA generally in the CLIP. I think the main thing that we are missing is backend support for any unusual head sizes that might occur with vision models. I added Metal support for HS=72 as an example. |
|
Thanks for the initial work. I actually planned to work on flash attn this week/next week, so this will help me a lot. Btw, how do we know if we can use flash attn for a given cgraph? I don't quite remember how llama.cpp check if the current model can use flash attn or not. |
Maybe you are thinking about the logic in llama.cpp/src/llama-context.cpp Lines 291 to 331 in 41ebbfd I think for CLIP it can safely be always enabled. |
|
Hmm so for example, in case I completely replace the clip's |
|
Yes. CPU FA is always supported. It could be a problem because we might not notice that the CPU fallback is being triggered in some cases. But still, it seems like the better default IMO. |
|
In such case, I think we still need the logic to check if the GPU backend supports flash attn or not. I agree that flash attn should be the default, I think I can safely reuse the same |
|
We can also print a big warning each time the CLIP scheduler runs with more than 1 graph splits. This way we will immediately spot cases where the implementation uses an unsupported operator. Demonstrated in a4b54f2 Sample output from |
tools/mtmd/clip.cpp
Outdated
| LOG_WRN("%s: flash attention not supported, memory usage will increase\n", __func__); | ||
| // TODO: maybe log more details about why flash attention is not supported |
There was a problem hiding this comment.
@ggerganov I implemented a simple solution to auto-enable flash attn only when the backend support it. Probably we should make this LOG_WRN to be more prominent. Also, which kind of info do you think should be displayed here?
Some users potentially already using models with shapes not supported by GPU flash attn. Falling back to CPU will suddenly make it very slow and thus not a good UX overall. The auto mode + prominent is a better solution as also it encourage users to "voluntary" report certain info back to us - less forcefully for them.
There was a problem hiding this comment.
Also, which kind of info do you think should be displayed here?
We can print the actual tensor (shape, strides, types) for which FA is not supported.
| for (int i = 0; i < ggml_graph_n_nodes(gf); i++) { | ||
| ggml_tensor * node = ggml_graph_node(gf, i); | ||
| if (node->op == GGML_OP_FLASH_ATTN_EXT) { | ||
| if (!ggml_backend_supports_op(ctx_clip.backend, node)) { | ||
| return false; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
I tested this by temporary modify the code to always return false in ggml_metal_device_supports_op, works well for now, but I'm not sure if there are any edge cases.
Currently, mtmd only support 2 backends at the same time: CPU and one GPU backend
|
I extended the warmup logic to print all ops that are not supported by the accelerated backend of the CLIP context. For example, we are now informed that the Metal backend does not support the UPSCALE op for Qwen3 VL: |
| if (!unsupported_ops.empty()) { | ||
| LOG_WRN("%s: *****************************************************************\n", __func__); | ||
| LOG_WRN("%s: WARNING: the CLIP graph uses unsupported operators by the backend\n", __func__); | ||
| LOG_WRN("%s: the performance will be suboptimal \n", __func__); | ||
| LOG_WRN("%s: list of unsupported ops (backend=%s):\n", __func__, ggml_backend_name(ctx_clip.backend)); | ||
| for (const auto & op : unsupported_ops) { | ||
| LOG_WRN("%s: %16s: type = %s, ne = [%d %d %d %d]\n", __func__, | ||
| ggml_op_name(op.op->op), | ||
| ggml_type_name(op.op->type), | ||
| op.op->ne[0], op.op->ne[1], op.op->ne[2], op.op->ne[3]); | ||
| } | ||
| LOG_WRN("%s: flash attention is %s\n", __func__, | ||
| (ctx_clip.flash_attn_type == CLIP_FLASH_ATTN_TYPE_ENABLED) ? "enabled" : "disabled"); | ||
| LOG_WRN("%s: please report this on github as an issue\n", __func__); | ||
| LOG_WRN("%s: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118\n", __func__); | ||
| LOG_WRN("%s: *****************************************************************\n", __func__); | ||
| } |
There was a problem hiding this comment.
I improved these messages to make them more prominent while giving the user an instruction of what to do. Lmk if this looks good to you
| static auto print_shape = [](const char * fn, const char * name, ggml_tensor * t) { | ||
| LOG_WRN("%s: %s: type = %s, ne = [%d %d %d %d], nb = [%d %d %d %d]\n", fn, | ||
| name, ggml_type_name(t->type), | ||
| t->ne[0], t->ne[1], t->ne[2], t->ne[3], | ||
| t->nb[0], t->nb[1], t->nb[2], t->nb[3]); | ||
| }; | ||
| print_shape(__func__, " dst", op); | ||
| print_shape(__func__, "src0", op->src[0]); | ||
| print_shape(__func__, "src1", op->src[1]); | ||
| print_shape(__func__, "src2", op->src[2]); |
There was a problem hiding this comment.
Re. what to print when flash attn is not support, I'm printing tensor shapes, type, and stride
There was a problem hiding this comment.
This should be good - usually the head size (i.e. src[0]->ne[0]) is the thing that is not supported.
* origin/master: (169 commits) opencl: support imrope (ggml-org#16914) fix: Viewing multiple PDF attachments (ggml-org#16974) model-conversion : pass config to from_pretrained (ggml-org#16963) server : add props.model_alias (ggml-org#16943) ggml: CUDA: add head size 72 for flash-attn (ggml-org#16962) mtmd: add --image-min/max-tokens (ggml-org#16921) mtmd: pad mask for qwen2.5vl (ggml-org#16954) ggml : LoongArch fixes (ggml-org#16958) sync: minja (glm 4.6 & minmax m2 templates) (ggml-org#16949) SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (ggml-org#16869) feat(webui): improve LaTeX rendering with currency detection (ggml-org#16508) test-backend-ops : fix segfault in moe-expert-reduce test in support mode and coverage (ggml-org#16936) ci : disable failing riscv cross build (ggml-org#16952) model: add Janus Pro for image understanding (ggml-org#16906) clip : use FA (ggml-org#16837) server : support unified cache across slots (ggml-org#16736) common : move gpt-oss reasoning processing to init params (ggml-org#16937) docs: remove llama_sampler_accept reference in sampling sample usage (ggml-org#16920) CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (ggml-org#16917) devops: fix failing s390x docker build (ggml-org#16918) ...
|
For Voxtral Mini: |
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (ggml-org#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (ggml-org#14928) * model : add LightOnOCR-1B model (ggml-org#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (ggml-org#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (ggml-org#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (ggml-org#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (ggml-org#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * model: add Janus Pro for image understanding (ggml-org#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (ggml-org#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (ggml-org#16921) * mtmd: improve struct initialization (ggml-org#16981) * mtmd: allow QwenVL to process larger image by default (ggml-org#17020) * Disable flash attention * mtmd : fix embedding size for image input (ggml-org#17123) * mtmd: fix patch_size initialized to random value in audio models (ggml-org#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: firecoperana <firecoperana>
|
Hello, On Jetson Orin, CUDA enabled. model: LFM2 Audio 1.5B. Reporting because spotted warnings into the debug output of llama-server. Dec 30 14:27:26 xyz llama-server[7841]: [33405] warmup: flash attention is enabled Thank you so much to everybody |
* clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
|
Hello, I got this message in LM Studio log: Thank you. |
* clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
ref #13231 (comment)
Sample implementation for using FA in the CLIP. Reduces memory usage and improves performance.
Testing with Gemma 12B, using
llama-serverand 2 images:TODO: