Skip to content

Bug/Feature Request: Flash Attention causes CPU fallback for Qwen3.5 vision projector (f32 ops) in recent builds #21272

@ampersandru

Description

@ampersandru

Name and Version

Latest master branch (recent pull as of early 8:30AM PST 1, April 2026 for the new turboquant rotation feature).

(Note: The issue is NOT present in older build: 8263 (59db9a3) with GNU 13.3.0 for Linux x86_64)

Operating systems

Linux

GGML backends

CUDA

Hardware

12th Gen Intel® Core™ i5-1235U
NVIDIA GeForce RTX 5060 Ti

Models

Qwen3.5-9b

Problem description & steps to reproduce

When running a Qwen3.5-9b with Flash Attention enabled (built with -DGGML_CUDA_FA_ALL_QUANTS=ON and runtime FA enabled), the vision projector operations (FLASH_ATTN_EXT: type = f32) are unsupported by the CUDA FA backend.

In recent builds with the unified graph and stricter routing, this unsupported f32 operation triggers a complete fallback to the CPU for image processing. This causes a massive performance regression for Vision-Language tasks (e.g., image slice encoding jumps to ~10-13 seconds on CPU).

Steps to reproduce:

  1. Build llama.cpp with CUDA and -DGGML_CUDA_FA_ALL_QUANTS=ON. See below for full build command
  2. Start llama-server loading both a Qwen3.5 text model and its mmproj vision projector.
  3. Submit an image for processing.
  4. Observe the warmup warnings about unsupported operators and the massive spike in image encoding time.

Feature/Fix Request:
In older builds (e.g., 59db9a357), the backend gracefully handled this without punting the entire operation to the CPU (allocating only ~25MB of CPU compute buffer instead of ~111MB), processing the image on the GPU in <1 second.

Could we either:

  1. Add f32 support to CUDA Flash Attention for these specific tensor shapes [72 16 8464 1].
  2. Reinstate the graceful fallback to standard CUDA matrix math for the vision projector when FA is globally enabled, rather than falling completely back to the CPU.
  3. Add a command-line flag to explicitly disable Flash Attention only for the mmproj/CLIP graph while keeping it enabled for text generation.

Llama.cpp build command:

docker build -t llama-server:cuda13.1-custom \
  --build-arg UBUNTU_VERSION=24.04 \
  --build-arg CUDA_VERSION=13.1.0 \
  --build-arg CUDA_DOCKER_ARCH=120a-real \
  --build-arg CMAKE_ARGS="-DGGML_CUDA_FA_ALL_QUANTS=ON" \
  --target server \
  -f .devops/cuda-new.Dockerfile .

First Bad Commit

I have not done a full git bisect, but I can confirm the last known GOOD commit I used where this fallback did not happen is 59db9a357 (Build 8263). The issue appears on recent master pulls (late March / early April).

Relevant log output

Startup log in latest build:

<details>
<summary>Logs</summary>
```console
warmup: flash attention is enabled
warmup: *****************************************************************
warmup: WARNING: the CLIP graph uses unsupported operators by the backend
warmup:          the performance will be suboptimal                      
warmup:          list of unsupported ops (backend=CUDA0):
warmup:   FLASH_ATTN_EXT: type = f32, ne = [72 16 8464 1]
warmup:   FLASH_ATTN_EXT: type = f32, ne = [72 16 8464 1]
warmup:   FLASH_ATTN_EXT: type = f32, ne = [72 16 8464 1]
... [Repeated for all layers] ...
warmup: flash attention is enabled
warmup: please report this on github as an issue
warmup: ref: [https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118](https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118)
warmup: *****************************************************************

srv  process_chun: processing image...
encoding image slice...
image slice encoded in 13495 ms

Performance Regression Metrics (Vision Processing)

Below is a direct comparison of the exact same prompt and image being processed on the same hardware (RTX 5060 Ti).

[BAD] Recent Master Build (FA enabled - forces f32 mmproj to CPU)
Because the strict routing throws the f32 image processing to the CPU, prompt evaluation speed tanks by over 90%, making real-time vision tasks impossible.

  • Image Encoding Time: 13.4 seconds
  • Prompt Eval Speed: 83.12 tokens per second
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 13495 ms
...
prompt eval time =   13894.91 ms /  1155 tokens (   12.03 ms per token,    83.12 tokens per second)
       eval time =    1557.47 ms /    99 tokens (   15.73 ms per token,    63.56 tokens per second)

Old build Performance:

  • Image Encoding Time: 0.5 seconds
  • Prompt Eval Speed: 1138.10 tokens per second
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 504 ms
...
prompt eval time =     988.49 ms /  1125 tokens (    0.88 ms per token,  1138.10 tokens per second)
       eval time =    1316.86 ms /    86 tokens (   15.31 ms per token,    65.31 tokens per second)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions