ggml-cpu: Use tiled FA for prompt-processing by am17an · Pull Request #19012 · ggml-org/llama.cpp

am17an · 2026-01-22T09:10:50Z

the FA performance is gimped on CPU for long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. The code is kept fairly simple to leave room for incremental optimizations. According to perf, most of the time is spent in ggml_vec_dot_f16 and ggml_vec_mad_f32, and about ~10% of the time in ggml_lookup_f16_to_f32

Model	Test	t/s `a14b960`	t/s tile-fa-cpu	Speedup
gpt-oss 20B MXFP4 MoE	pp512	237.25	244.39	1.03
gpt-oss 20B MXFP4 MoE	pp512@d1024	205.31	224.61	1.09
gpt-oss 20B MXFP4 MoE	pp512@d2048	171.80	209.63	1.22
gpt-oss 20B MXFP4 MoE	pp512@d4096	134.60	185.10	1.38
gpt-oss 20B MXFP4 MoE	pp512@d8192	59.69	143.43	2.40
gpt-oss 20B MXFP4 MoE	pp512@d16384	25.12	97.29	3.87
llama 8B Q4_K_M	pp512	201.83	199.66	0.99
llama 8B Q4_K_M	pp512@d1024	160.35	175.60	1.10
llama 8B Q4_K_M	pp512@d2048	134.40	158.26	1.18
llama 8B Q4_K_M	pp512@d4096	57.76	130.85	2.27
llama 8B Q4_K_M	pp512@d8192	28.01	95.79	3.42
llama 8B Q4_K_M	pp512@d16384	14.24	62.50	4.39

TODO:

perf tuning on ARM

the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.

ggml/src/ggml-cpu/ops.cpp

am17an · 2026-01-23T15:24:08Z

Not sure why the server test case is crashing backend sampling on

ggerganov · 2026-01-23T15:30:51Z

I can reproduce locally with a CPU-only build. I think there is a bug in the FA kernel that produces nan in some cases.

diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index e4228f232..3f5f97945 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -8511,6 +8511,10 @@ static void ggml_compute_forward_flash_attn_ext_tiled(
 
             // permute(0, 2, 1, 3)
             memcpy((char *) dst->data + (i3*ne2*ne1 + i2 + i1*ne1)*nb1, VKQ32 + tq * DV, nb1);
+
+            for (int k = 0; k < DV; k++) {
+                GGML_ASSERT(!isnan(VKQ32[tq * DV + k]));
+            }
         }
 
         ir += tile_rows;

Server test that fails:

LLAMA_ARG_BACKEND_SAMPLING=1 LLAMA_LOG_VERBOSITY=10 LLAMA_SERVER_BIN_PATH=/path/to/build-cpu/bin/llama-server ./tests.sh -s unit/test_completion.py::test_completion_prompt_cache

Haven't traced what is the cause yet.

am17an · 2026-01-23T18:11:52Z

@ggerganov the failure was actually what would be the optimization you mentioned anyway. When the row is all masked, we were accumulating V still which leads to NaNs. I added the change to skip those rows and confirmed locally that the test passes

ggml/src/ggml-cpu/ops.cpp

ggml/src/ggml-cpu/ggml-cpu.c

ggml/src/ggml-cpu/ops.cpp

* ggml-cpu: Use tiled FA for prompt-processing the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. * fix out of bounds for mask * skip rows where there are all masks * skip tile if mask is inf * store mask in worksize * check inf tile earlier

am17an requested a review from ggerganov as a code owner January 22, 2026 09:10

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 22, 2026

am17an force-pushed the tile-fa-cpu branch 2 times, most recently from 97afbcb to 41a0718 Compare January 22, 2026 09:14

loci-dev mentioned this pull request Jan 22, 2026

UPSTREAM PR #19012: ggml-cpu: Use tiled FA for prompt-processing auroralabs-loci/llama.cpp#997

Open

1 task

ggml-cpu: Use tiled FA for prompt-processing

2f09b2d

the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.

am17an force-pushed the tile-fa-cpu branch from 41a0718 to 2f09b2d Compare January 23, 2026 10:23

ggerganov reviewed Jan 23, 2026

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Show resolved Hide resolved

ggml/src/ggml-cpu/ops.cpp Outdated Show resolved Hide resolved

fix out of bounds for mask

e30395e

skip rows where there are all masks

693935d

ggerganov reviewed Jan 24, 2026

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Outdated Show resolved Hide resolved

skip tile if mask is inf

d898d43

ggerganov reviewed Jan 24, 2026

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Show resolved Hide resolved

ggerganov reviewed Jan 24, 2026

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu.c Outdated Show resolved Hide resolved

store mask in worksize

dc30629

am17an force-pushed the tile-fa-cpu branch from c1dbc37 to dc30629 Compare January 24, 2026 15:10

ggerganov approved these changes Jan 25, 2026

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Show resolved Hide resolved

check inf tile earlier

17f7db5

am17an merged commit bcb4316 into ggml-org:master Jan 25, 2026
143 of 149 checks passed

am17an deleted the tile-fa-cpu branch January 25, 2026 15:26

am17an mentioned this pull request Jan 30, 2026

ggml-cpu: FA split across kv for faster TG #19209

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: Use tiled FA for prompt-processing#19012

ggml-cpu: Use tiled FA for prompt-processing#19012
am17an merged 6 commits intoggml-org:masterfrom
am17an:tile-fa-cpu

am17an commented Jan 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

am17an commented Jan 23, 2026

Uh oh!

ggerganov commented Jan 23, 2026

Uh oh!

am17an commented Jan 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

am17an commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Jan 23, 2026

Uh oh!

ggerganov commented Jan 23, 2026

Uh oh!

am17an commented Jan 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

am17an commented Jan 22, 2026 •

edited

Loading