Skip to content

ggml-cpu: Use tiled FA for prompt-processing#19012

Merged
am17an merged 6 commits intoggml-org:masterfrom
am17an:tile-fa-cpu
Jan 25, 2026
Merged

ggml-cpu: Use tiled FA for prompt-processing#19012
am17an merged 6 commits intoggml-org:masterfrom
am17an:tile-fa-cpu

Conversation

@am17an
Copy link
Contributor

@am17an am17an commented Jan 22, 2026

the FA performance is gimped on CPU for long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. The code is kept fairly simple to leave room for incremental optimizations. According to perf, most of the time is spent in ggml_vec_dot_f16 and ggml_vec_mad_f32, and about ~10% of the time in ggml_lookup_f16_to_f32

Model Test t/s a14b960 t/s tile-fa-cpu Speedup
gpt-oss 20B MXFP4 MoE pp512 237.25 244.39 1.03
gpt-oss 20B MXFP4 MoE pp512@d1024 205.31 224.61 1.09
gpt-oss 20B MXFP4 MoE pp512@d2048 171.80 209.63 1.22
gpt-oss 20B MXFP4 MoE pp512@d4096 134.60 185.10 1.38
gpt-oss 20B MXFP4 MoE pp512@d8192 59.69 143.43 2.40
gpt-oss 20B MXFP4 MoE pp512@d16384 25.12 97.29 3.87
llama 8B Q4_K_M pp512 201.83 199.66 0.99
llama 8B Q4_K_M pp512@d1024 160.35 175.60 1.10
llama 8B Q4_K_M pp512@d2048 134.40 158.26 1.18
llama 8B Q4_K_M pp512@d4096 57.76 130.85 2.27
llama 8B Q4_K_M pp512@d8192 28.01 95.79 3.42
llama 8B Q4_K_M pp512@d16384 14.24 62.50 4.39

TODO:

  • perf tuning on ARM

@am17an am17an requested a review from ggerganov as a code owner January 22, 2026 09:10
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 22, 2026
@am17an am17an force-pushed the tile-fa-cpu branch 2 times, most recently from 97afbcb to 41a0718 Compare January 22, 2026 09:14
the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.
@am17an
Copy link
Contributor Author

am17an commented Jan 23, 2026

Not sure why the server test case is crashing backend sampling on

@ggerganov
Copy link
Member

I can reproduce locally with a CPU-only build. I think there is a bug in the FA kernel that produces nan in some cases.

diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index e4228f232..3f5f97945 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -8511,6 +8511,10 @@ static void ggml_compute_forward_flash_attn_ext_tiled(
 
             // permute(0, 2, 1, 3)
             memcpy((char *) dst->data + (i3*ne2*ne1 + i2 + i1*ne1)*nb1, VKQ32 + tq * DV, nb1);
+
+            for (int k = 0; k < DV; k++) {
+                GGML_ASSERT(!isnan(VKQ32[tq * DV + k]));
+            }
         }
 
         ir += tile_rows;

Server test that fails:

LLAMA_ARG_BACKEND_SAMPLING=1 LLAMA_LOG_VERBOSITY=10 LLAMA_SERVER_BIN_PATH=/path/to/build-cpu/bin/llama-server ./tests.sh -s unit/test_completion.py::test_completion_prompt_cache

Haven't traced what is the cause yet.

@am17an
Copy link
Contributor Author

am17an commented Jan 23, 2026

@ggerganov the failure was actually what would be the optimization you mentioned anyway. When the row is all masked, we were accumulating V still which leads to NaNs. I added the change to skip those rows and confirmed locally that the test passes

@am17an am17an merged commit bcb4316 into ggml-org:master Jan 25, 2026
143 of 149 checks passed
@am17an am17an deleted the tile-fa-cpu branch January 25, 2026 15:26
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
* ggml-cpu: Use tiled FA for prompt-processing

the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.

* fix out of bounds for mask

* skip rows where there are all masks

* skip tile if mask is inf

* store mask in worksize

* check inf tile earlier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants