hexagon: dma optimizations (mostly fixing regressions)#21137
Merged
max-krasnyansky merged 3 commits intoggml-org:masterfrom Mar 29, 2026
Merged
hexagon: dma optimizations (mostly fixing regressions)#21137max-krasnyansky merged 3 commits intoggml-org:masterfrom
max-krasnyansky merged 3 commits intoggml-org:masterfrom
Conversation
I noticed that we were refetch the mask rows over and over. This simple cache avoids that.
We don't rely on true in order processing of the DMA descriptors anywhere. Turns out this mode caused significant regression of around 3-4 TPS during token gen.
lhez
approved these changes
Mar 29, 2026
ggerganov
approved these changes
Mar 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Somehow I missed the significant perf regression when I did the last big DMA update.
I flipped the in-order bit in the dma descriptors and it turns out it causes a 3-5 TPS drop, especially for the token gen.
We don't really need true in order processing by the HW anyway as our pipelines are setup such that we explicitly wait for specific descriptors to complete (i.e enforcing the ordering that the kernels expect when they do dma_push/pop).
This PR also adds a neat little DMA cache that can be used in kernels that may need to re-fetch the data.
This is now used in the FA kernel for the Mask.
Before/after numbers on Gen3,4,5
Requirements