Skip to content

vulkan: Reduce temporary memory usage for TOP_K#17623

Merged
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:topk_memory
Dec 2, 2025
Merged

vulkan: Reduce temporary memory usage for TOP_K#17623
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:topk_memory

Conversation

@jeffbolznv
Copy link
Collaborator

  • Compute row size for the temp buffer based on the output of the first pass.
  • Update shader addressing math to use the output row size
  • Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.

- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
@0cc4m 0cc4m merged commit 61bde8e into ggml-org:master Dec 2, 2025
72 of 74 checks passed
khemchand-zetta pushed a commit to khemchand-zetta/llama.cpp that referenced this pull request Dec 4, 2025
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants