ggml-cpu: Use tiled FA for prompt-processing#19012
Merged
am17an merged 6 commits intoggml-org:masterfrom Jan 25, 2026
Merged
Conversation
97afbcb to
41a0718
Compare
1 task
the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.
ggerganov
reviewed
Jan 23, 2026
Contributor
Author
|
Not sure why the server test case is crashing backend sampling on |
Member
|
I can reproduce locally with a CPU-only build. I think there is a bug in the FA kernel that produces nan in some cases. diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index e4228f232..3f5f97945 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -8511,6 +8511,10 @@ static void ggml_compute_forward_flash_attn_ext_tiled(
// permute(0, 2, 1, 3)
memcpy((char *) dst->data + (i3*ne2*ne1 + i2 + i1*ne1)*nb1, VKQ32 + tq * DV, nb1);
+
+ for (int k = 0; k < DV; k++) {
+ GGML_ASSERT(!isnan(VKQ32[tq * DV + k]));
+ }
}
ir += tile_rows;Server test that fails: LLAMA_ARG_BACKEND_SAMPLING=1 LLAMA_LOG_VERBOSITY=10 LLAMA_SERVER_BIN_PATH=/path/to/build-cpu/bin/llama-server ./tests.sh -s unit/test_completion.py::test_completion_prompt_cacheHaven't traced what is the cause yet. |
Contributor
Author
|
@ggerganov the failure was actually what would be the optimization you mentioned anyway. When the row is all masked, we were accumulating V still which leads to NaNs. I added the change to skip those rows and confirmed locally that the test passes |
ggerganov
reviewed
Jan 24, 2026
ggerganov
reviewed
Jan 24, 2026
ggerganov
reviewed
Jan 24, 2026
ggerganov
approved these changes
Jan 25, 2026
shaofeiqi
pushed a commit
to qualcomm/llama.cpp
that referenced
this pull request
Feb 6, 2026
* ggml-cpu: Use tiled FA for prompt-processing the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. * fix out of bounds for mask * skip rows where there are all masks * skip tile if mask is inf * store mask in worksize * check inf tile earlier
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
the FA performance is gimped on CPU for long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. The code is kept fairly simple to leave room for incremental optimizations. According to perf, most of the time is spent in
ggml_vec_dot_f16andggml_vec_mad_f32, and about ~10% of the time inggml_lookup_f16_to_f32TODO: