CUDA: fix overflow in FA, tune performance#14840
CUDA: fix overflow in FA, tune performance#14840JohannesGaessler merged 1 commit intoggml-org:masterfrom
Conversation
ggml/src/ggml-cuda/fattn-vec-f32.cuh
Outdated
There was a problem hiding this comment.
What is the logic for choosing between int32_t and int64_t here? For example, why is int64_t nb23, but int32_t nb33?
There was a problem hiding this comment.
The mask is being broadcast across all attention heads so it's simply smaller than K/V. I suppose you could also use 64 bit for nb33, it should still be fine in terms of register pressure.
There was a problem hiding this comment.
More generally, the only offsets that are going to be really large are those that scale with the number of tokens, so the offsets between sequences.
ec05b08 to
d4209ee
Compare
* origin/master: docs : update HOWTO‑add‑model.md for ModelBase and new model classes (ggml-org#14874) ggml : remove invalid portPos specifiers from dot files (ggml-org#14838) context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (ggml-org#14870) mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (ggml-org#14503) rpc : check for null buffers in get/set/copy tensor endpoints (ggml-org#14868) sched : fix multiple evaluations of the same graph with pipeline parallelism (ggml-org#14855) musa: upgrade musa sdk to rc4.2.0 (ggml-org#14498) sync : ggml cmake : fix usage issues (ggml/1257) ggml-cpu : remove stdlib include from repack.cpp (ggml/1276) context : perform output reorder lazily upon access after sync (ggml-org#14853) chat : fix kimi-k2 chat template (ggml-org#14852) sycl: fixed semantics of block offset calculation (ggml-org#14814) llama : fix MiniCPM inference after Granite Four changes (ggml-org#14850) docs: add libcurl-dev install hint for Linux distros (ggml-org#14801) metal : fix fusion across different encoders (ggml-org#14849) sycl: fix undefined variable in work group size check (ggml-org#14843) convert : text-only support for GLM-4.1V-9B-Thinking (ggml-org#14823) CUDA: fix overflow in FA, tune performance (ggml-org#14840) CUDA: fix compilation with GGML_CUDA_F16 (ggml-org#14837)
|
Hi, not sure if this is expected, but I'm seeing a regression in PP performance with FA enabled on my RX 6800: vs. previous commit: I also found other unrelated regression, so keep in mind it may also influence benchmark results at the same time (see #14624). |
|
In my testing the new build is consistently faster:
|
|
Huh, interesting. I wonder if ROCm version could explain the difference; I haven't updated in a while (the install directory says 6.0.0, though I'm not sure if that's the specific release or just 6.x.x.). I'll try to update and see if it changes anything. Great job on the other regression btw.! |
|
After upgrading to ROCm 6.4.2, the PP speed with FA is now comparable between both commits. It's still lower than ROCm 6.0.0 + b5972 (~620 t/s vs. ~820), but I suppose it means the regression isn't directly related to this PR, but rather to some version-specific "something" that just happened to be triggered by this PR in the previous version. vs. |
|
Hi! I recently updated to a newer ROCm (7.0.3) and llama.cpp release and ran into an issue with increased VRAM usage. The new regression is that before this commit, the VRAM consumption with FA enabled is constant. With: After rebuilding with this commit, the VRAM usage keeps increasing, and around 20k tokens in, llama.cpp runs out of memory. Tweaking parameters to decrease VRAM usage only delay the failure – at the rate of memory usage growth, it seems that using the full context window would need extra 6 to 8 GB of memory than previously. Looking at the log, FA shows up as enabled, and the rate of the increase does not seem as steep as when FA is disabled either, so that does not seem to be the issue. So I just wanted to make sure if it is an expected consequence of this change, of whether it could be some sort of memory leak (perhaps "triggered" by the new ROCm version) that may need to be looked into? Thanks! Martin |
Due to numerical overflows the CUDA FlashAttention code on master does not work correctly for very long contexts (something like several million tokens across all sequences). This PR uses 64 bit math for those parts of the code susceptible to such problems: the K/V offsets between sequences and the calculation of K/V offsets within a sequence. For the vector kernel there was a performance regression on Pascal when simply casting the offsets to 64 bit, for this reason I'm adding a 32 bit offset after each iteration (turns out to be faster for Pascal/AMD anyways). I am not seeing any performance differences for the other kernels so I'm just casting the offsets to 64 bit. While working on this I noticed that at some point the tile FA kernels seem to have gotten faster than the vector kernels on my RX 6800 so I'm enabling them for AMD.
Performance changes