Bug/Feature Request: Flash Attention causes CPU fallback for Qwen3.5 vision projector (f32 ops) in recent builds

### Name and Version

Latest `master` branch (recent pull as of early 8:30AM PST 1, April 2026 for the new turboquant rotation feature).

(Note: The issue is NOT present in older build: 8263 (59db9a357) with GNU 13.3.0 for Linux x86_64)

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

12th Gen Intel® Core™ i5-1235U 
NVIDIA GeForce RTX 5060 Ti

### Models

Qwen3.5-9b

### Problem description & steps to reproduce

When running a Qwen3.5-9b with Flash Attention enabled (built with `-DGGML_CUDA_FA_ALL_QUANTS=ON` and runtime FA enabled), the vision projector operations (`FLASH_ATTN_EXT: type = f32`) are unsupported by the CUDA FA backend.

In recent builds with the unified graph and stricter routing, this unsupported `f32` operation triggers a complete fallback to the CPU for image processing. This causes a massive performance regression for Vision-Language tasks (e.g., image slice encoding jumps to ~10-13 seconds on CPU).

**Steps to reproduce:**
1. Build `llama.cpp` with CUDA and `-DGGML_CUDA_FA_ALL_QUANTS=ON`. See below for full build command
2. Start `llama-server` loading both a Qwen3.5 text model and its `mmproj` vision projector.
3. Submit an image for processing.
4. Observe the warmup warnings about unsupported operators and the massive spike in image encoding time.

**Feature/Fix Request:**
In older builds (e.g., `59db9a357`), the backend gracefully handled this without punting the entire operation to the CPU (allocating only ~25MB of CPU compute buffer instead of ~111MB), processing the image on the GPU in <1 second. 

Could we either:
1. Add `f32` support to CUDA Flash Attention for these specific tensor shapes `[72 16 8464 1]`.
2. Reinstate the graceful fallback to standard CUDA matrix math for the vision projector when FA is globally enabled, rather than falling completely back to the CPU.
3. Add a command-line flag to explicitly disable Flash Attention *only* for the `mmproj`/CLIP graph while keeping it enabled for text generation.

Llama.cpp build command:
```
docker build -t llama-server:cuda13.1-custom \
  --build-arg UBUNTU_VERSION=24.04 \
  --build-arg CUDA_VERSION=13.1.0 \
  --build-arg CUDA_DOCKER_ARCH=120a-real \
  --build-arg CMAKE_ARGS="-DGGML_CUDA_FA_ALL_QUANTS=ON" \
  --target server \
  -f .devops/cuda-new.Dockerfile .
```

### First Bad Commit

I have not done a full git bisect, but I can confirm the last known GOOD commit I used where this fallback did not happen is `59db9a357` (Build 8263). The issue appears on recent master pulls (late March / early April).

### Relevant log output
Startup log in latest build:
```
<details>
<summary>Logs</summary>
```console
warmup: flash attention is enabled
warmup: *****************************************************************
warmup: WARNING: the CLIP graph uses unsupported operators by the backend
warmup:          the performance will be suboptimal                      
warmup:          list of unsupported ops (backend=CUDA0):
warmup:   FLASH_ATTN_EXT: type = f32, ne = [72 16 8464 1]
warmup:   FLASH_ATTN_EXT: type = f32, ne = [72 16 8464 1]
warmup:   FLASH_ATTN_EXT: type = f32, ne = [72 16 8464 1]
... [Repeated for all layers] ...
warmup: flash attention is enabled
warmup: please report this on github as an issue
warmup: ref: [https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118](https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118)
warmup: *****************************************************************

srv  process_chun: processing image...
encoding image slice...
image slice encoded in 13495 ms
```

### Performance Regression Metrics (Vision Processing)

Below is a direct comparison of the exact same prompt and image being processed on the same hardware (RTX 5060 Ti). 

**[BAD] Recent Master Build (FA enabled - forces f32 mmproj to CPU)**
Because the strict routing throws the `f32` image processing to the CPU, prompt evaluation speed tanks by over 90%, making real-time vision tasks impossible.
* **Image Encoding Time:** 13.4 seconds
* **Prompt Eval Speed:** 83.12 tokens per second

```console
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 13495 ms
...
prompt eval time =   13894.91 ms /  1155 tokens (   12.03 ms per token,    83.12 tokens per second)
       eval time =    1557.47 ms /    99 tokens (   15.73 ms per token,    63.56 tokens per second)
```

## Old build Performance:
* **Image Encoding Time:** 0.5 seconds
* **Prompt Eval Speed:** 1138.10 tokens per second
```
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 504 ms
...
prompt eval time =     988.49 ms /  1125 tokens (    0.88 ms per token,  1138.10 tokens per second)
       eval time =    1316.86 ms /    86 tokens (   15.31 ms per token,    65.31 tokens per second)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug/Feature Request: Flash Attention causes CPU fallback for Qwen3.5 vision projector (f32 ops) in recent builds #21272

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Performance Regression Metrics (Vision Processing)

Old build Performance:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug/Feature Request: Flash Attention causes CPU fallback for Qwen3.5 vision projector (f32 ops) in recent builds #21272

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Performance Regression Metrics (Vision Processing)

Old build Performance:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions