Skip to content

hexagon: dma optimizations (mostly fixing regressions)#21137

Merged
max-krasnyansky merged 3 commits intoggml-org:masterfrom
qualcomm:hexagon-dma-opts
Mar 29, 2026
Merged

hexagon: dma optimizations (mostly fixing regressions)#21137
max-krasnyansky merged 3 commits intoggml-org:masterfrom
qualcomm:hexagon-dma-opts

Conversation

@max-krasnyansky
Copy link
Copy Markdown
Member

Overview

Somehow I missed the significant perf regression when I did the last big DMA update.
I flipped the in-order bit in the dma descriptors and it turns out it causes a 3-5 TPS drop, especially for the token gen.
We don't really need true in order processing by the HW anyway as our pipelines are setup such that we explicitly wait for specific descriptors to complete (i.e enforcing the ordering that the kernels expect when they do dma_push/pop).

This PR also adds a neat little DMA cache that can be used in kernels that may need to re-fetch the data.
This is now used in the FA kernel for the Mask.

Before/after numbers on Gen3,4,5
M=../gguf/Llama-3.2-3B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-completion.sh -f ../surfing.txt -st -n 64

- Gen5
 prompt eval time =     992.17 ms /   205 tokens (    4.84 ms per token,   206.62 tokens per second)
        eval time =    2654.97 ms /    63 runs   (   42.14 ms per token,    23.73 tokens per second)

 prompt eval time =     979.93 ms /   205 tokens (    4.78 ms per token,   209.20 tokens per second)
        eval time =    2490.30 ms /    63 runs   (   39.53 ms per token,    25.30 tokens per second)

- Gen4 (S25+)
 prompt eval time =    1269.23 ms /   205 tokens (    6.19 ms per token,   161.52 tokens per second)
        eval time =    3049.34 ms /    63 runs   (   48.40 ms per token,    20.66 tokens per second)

 prompt eval time =    1264.30 ms /   205 tokens (    6.17 ms per token,   162.14 tokens per second)
        eval time =    2723.60 ms /    63 runs   (   43.23 ms per token,    23.13 tokens per second)

- Gen3 (S24U)
 prompt eval time =    1379.95 ms /   205 tokens (    6.73 ms per token,   148.56 tokens per second)
        eval time =    3495.07 ms /    63 runs   (   55.48 ms per token,    18.03 tokens per second)

 prompt eval time =    1390.60 ms /   205 tokens (    6.72 ms per token,   149.01  tokens per second)
        eval time =    2884.50 ms /    63 runs   (   45.79 ms per token,    21.84 tokens per second)

Requirements

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.
We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.
@max-krasnyansky max-krasnyansky requested a review from a team as a code owner March 28, 2026 23:48
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Mar 28, 2026
@max-krasnyansky max-krasnyansky merged commit f5d1c41 into ggml-org:master Mar 29, 2026
44 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants