Name and Version
Latest master branch (recent pull as of early 8:30AM PST 1, April 2026 for the new turboquant rotation feature).
(Note: The issue is NOT present in older build: 8263 (59db9a3) with GNU 13.3.0 for Linux x86_64)
Operating systems
Linux
GGML backends
CUDA
Hardware
12th Gen Intel® Core™ i5-1235U
NVIDIA GeForce RTX 5060 Ti
Models
Qwen3.5-9b
Problem description & steps to reproduce
When running a Qwen3.5-9b with Flash Attention enabled (built with -DGGML_CUDA_FA_ALL_QUANTS=ON and runtime FA enabled), the vision projector operations (FLASH_ATTN_EXT: type = f32) are unsupported by the CUDA FA backend.
In recent builds with the unified graph and stricter routing, this unsupported f32 operation triggers a complete fallback to the CPU for image processing. This causes a massive performance regression for Vision-Language tasks (e.g., image slice encoding jumps to ~10-13 seconds on CPU).
Steps to reproduce:
- Build
llama.cpp with CUDA and -DGGML_CUDA_FA_ALL_QUANTS=ON. See below for full build command
- Start
llama-server loading both a Qwen3.5 text model and its mmproj vision projector.
- Submit an image for processing.
- Observe the warmup warnings about unsupported operators and the massive spike in image encoding time.
Feature/Fix Request:
In older builds (e.g., 59db9a357), the backend gracefully handled this without punting the entire operation to the CPU (allocating only ~25MB of CPU compute buffer instead of ~111MB), processing the image on the GPU in <1 second.
Could we either:
- Add
f32 support to CUDA Flash Attention for these specific tensor shapes [72 16 8464 1].
- Reinstate the graceful fallback to standard CUDA matrix math for the vision projector when FA is globally enabled, rather than falling completely back to the CPU.
- Add a command-line flag to explicitly disable Flash Attention only for the
mmproj/CLIP graph while keeping it enabled for text generation.
Llama.cpp build command:
docker build -t llama-server:cuda13.1-custom \
--build-arg UBUNTU_VERSION=24.04 \
--build-arg CUDA_VERSION=13.1.0 \
--build-arg CUDA_DOCKER_ARCH=120a-real \
--build-arg CMAKE_ARGS="-DGGML_CUDA_FA_ALL_QUANTS=ON" \
--target server \
-f .devops/cuda-new.Dockerfile .
First Bad Commit
I have not done a full git bisect, but I can confirm the last known GOOD commit I used where this fallback did not happen is 59db9a357 (Build 8263). The issue appears on recent master pulls (late March / early April).
Relevant log output
Startup log in latest build:
<details>
<summary>Logs</summary>
```console
warmup: flash attention is enabled
warmup: *****************************************************************
warmup: WARNING: the CLIP graph uses unsupported operators by the backend
warmup: the performance will be suboptimal
warmup: list of unsupported ops (backend=CUDA0):
warmup: FLASH_ATTN_EXT: type = f32, ne = [72 16 8464 1]
warmup: FLASH_ATTN_EXT: type = f32, ne = [72 16 8464 1]
warmup: FLASH_ATTN_EXT: type = f32, ne = [72 16 8464 1]
... [Repeated for all layers] ...
warmup: flash attention is enabled
warmup: please report this on github as an issue
warmup: ref: [https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118](https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118)
warmup: *****************************************************************
srv process_chun: processing image...
encoding image slice...
image slice encoded in 13495 ms
Performance Regression Metrics (Vision Processing)
Below is a direct comparison of the exact same prompt and image being processed on the same hardware (RTX 5060 Ti).
[BAD] Recent Master Build (FA enabled - forces f32 mmproj to CPU)
Because the strict routing throws the f32 image processing to the CPU, prompt evaluation speed tanks by over 90%, making real-time vision tasks impossible.
- Image Encoding Time: 13.4 seconds
- Prompt Eval Speed: 83.12 tokens per second
srv process_chun: processing image...
encoding image slice...
image slice encoded in 13495 ms
...
prompt eval time = 13894.91 ms / 1155 tokens ( 12.03 ms per token, 83.12 tokens per second)
eval time = 1557.47 ms / 99 tokens ( 15.73 ms per token, 63.56 tokens per second)
Old build Performance:
- Image Encoding Time: 0.5 seconds
- Prompt Eval Speed: 1138.10 tokens per second
srv process_chun: processing image...
encoding image slice...
image slice encoded in 504 ms
...
prompt eval time = 988.49 ms / 1125 tokens ( 0.88 ms per token, 1138.10 tokens per second)
eval time = 1316.86 ms / 86 tokens ( 15.31 ms per token, 65.31 tokens per second)
Name and Version
Latest
masterbranch (recent pull as of early 8:30AM PST 1, April 2026 for the new turboquant rotation feature).(Note: The issue is NOT present in older build: 8263 (59db9a3) with GNU 13.3.0 for Linux x86_64)
Operating systems
Linux
GGML backends
CUDA
Hardware
12th Gen Intel® Core™ i5-1235U
NVIDIA GeForce RTX 5060 Ti
Models
Qwen3.5-9b
Problem description & steps to reproduce
When running a Qwen3.5-9b with Flash Attention enabled (built with
-DGGML_CUDA_FA_ALL_QUANTS=ONand runtime FA enabled), the vision projector operations (FLASH_ATTN_EXT: type = f32) are unsupported by the CUDA FA backend.In recent builds with the unified graph and stricter routing, this unsupported
f32operation triggers a complete fallback to the CPU for image processing. This causes a massive performance regression for Vision-Language tasks (e.g., image slice encoding jumps to ~10-13 seconds on CPU).Steps to reproduce:
llama.cppwith CUDA and-DGGML_CUDA_FA_ALL_QUANTS=ON. See below for full build commandllama-serverloading both a Qwen3.5 text model and itsmmprojvision projector.Feature/Fix Request:
In older builds (e.g.,
59db9a357), the backend gracefully handled this without punting the entire operation to the CPU (allocating only ~25MB of CPU compute buffer instead of ~111MB), processing the image on the GPU in <1 second.Could we either:
f32support to CUDA Flash Attention for these specific tensor shapes[72 16 8464 1].mmproj/CLIP graph while keeping it enabled for text generation.Llama.cpp build command:
First Bad Commit
I have not done a full git bisect, but I can confirm the last known GOOD commit I used where this fallback did not happen is
59db9a357(Build 8263). The issue appears on recent master pulls (late March / early April).Relevant log output
Startup log in latest build:
Performance Regression Metrics (Vision Processing)
Below is a direct comparison of the exact same prompt and image being processed on the same hardware (RTX 5060 Ti).
[BAD] Recent Master Build (FA enabled - forces f32 mmproj to CPU)
Because the strict routing throws the
f32image processing to the CPU, prompt evaluation speed tanks by over 90%, making real-time vision tasks impossible.Old build Performance: