hexagon: dma optimizations (mostly fixing regressions) by max-krasnyansky · Pull Request #21137 · ggml-org/llama.cpp

max-krasnyansky · 2026-03-28T23:48:13Z

Overview

Somehow I missed the significant perf regression when I did the last big DMA update.
I flipped the in-order bit in the dma descriptors and it turns out it causes a 3-5 TPS drop, especially for the token gen.
We don't really need true in order processing by the HW anyway as our pipelines are setup such that we explicitly wait for specific descriptors to complete (i.e enforcing the ordering that the kernels expect when they do dma_push/pop).

This PR also adds a neat little DMA cache that can be used in kernels that may need to re-fetch the data.
This is now used in the FA kernel for the Mask.

Before/after numbers on Gen3,4,5

M=../gguf/Llama-3.2-3B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-completion.sh -f ../surfing.txt -st -n 64

- Gen5
 prompt eval time =     992.17 ms /   205 tokens (    4.84 ms per token,   206.62 tokens per second)
        eval time =    2654.97 ms /    63 runs   (   42.14 ms per token,    23.73 tokens per second)

 prompt eval time =     979.93 ms /   205 tokens (    4.78 ms per token,   209.20 tokens per second)
        eval time =    2490.30 ms /    63 runs   (   39.53 ms per token,    25.30 tokens per second)

- Gen4 (S25+)
 prompt eval time =    1269.23 ms /   205 tokens (    6.19 ms per token,   161.52 tokens per second)
        eval time =    3049.34 ms /    63 runs   (   48.40 ms per token,    20.66 tokens per second)

 prompt eval time =    1264.30 ms /   205 tokens (    6.17 ms per token,   162.14 tokens per second)
        eval time =    2723.60 ms /    63 runs   (   43.23 ms per token,    23.13 tokens per second)

- Gen3 (S24U)
 prompt eval time =    1379.95 ms /   205 tokens (    6.73 ms per token,   148.56 tokens per second)
        eval time =    3495.07 ms /    63 runs   (   55.48 ms per token,    18.03 tokens per second)

 prompt eval time =    1390.60 ms /   205 tokens (    6.72 ms per token,   149.01  tokens per second)
        eval time =    2884.50 ms /    63 runs   (   45.79 ms per token,    21.84 tokens per second)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

I noticed that we were refetch the mask rows over and over. This simple cache avoids that.

We don't rely on true in order processing of the DMA descriptors anywhere. Turns out this mode caused significant regression of around 3-4 TPS during token gen.

…ompletions

max-krasnyansky added 3 commits March 28, 2026 15:56

hex-fa: add simple dma cache for Mask

341916c

I noticed that we were refetch the mask rows over and over. This simple cache avoids that.

hex-dma: unset in-order desc bit which caused signficant perf regression

c9b1364

We don't rely on true in order processing of the DMA descriptors anywhere. Turns out this mode caused significant regression of around 3-4 TPS during token gen.

hex-rope: update comment to clarify that we don't need in-order DMA c…

17a19c4

…ompletions

max-krasnyansky requested a review from a team as a code owner March 28, 2026 23:48

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Mar 28, 2026

lhez approved these changes Mar 29, 2026

View reviewed changes

ggerganov approved these changes Mar 29, 2026

View reviewed changes

max-krasnyansky merged commit f5d1c41 into ggml-org:master Mar 29, 2026
44 of 45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hexagon: dma optimizations (mostly fixing regressions)#21137

hexagon: dma optimizations (mostly fixing regressions)#21137
max-krasnyansky merged 3 commits intoggml-org:masterfrom
qualcomm:hexagon-dma-opts

max-krasnyansky commented Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

max-krasnyansky commented Mar 28, 2026

Overview

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants