CUDA: use mmvq for mul-mat-id for small batch sizes#18958
CUDA: use mmvq for mul-mat-id for small batch sizes#18958am17an merged 4 commits intoggml-org:masterfrom
Conversation
Performance changes
This PR does not provide a universal speedup for <= 8 tokens, please adjust the kernel selection logic to reflect this, FP16 vs. FP32 vs. BF16 vs. quantized should be enough. |
GPT-OSS
|
|
There was a performance regression for the n=1 (decode) kernel. It's fixed now |
ggml/src/ggml-cuda/mmvq.cu
Outdated
| if constexpr (ncols_dst == 1) { | ||
| sample_dst *= !ids_stride; // sample_dst for ids is 0 | ||
| } |
There was a problem hiding this comment.
I think we could slightly simplify the code if instead we set parameters like stride_sample_x to 0 in host code if ids != nullptr. This should be explained by a comment in device code and done consistently with MMVF.
|
Btw, the tests are crashing: https://github.com/ggml-org/llama.cpp/actions/runs/21295510691/job/61300080266?pr=18958#step:3:17810 |
|
@ggerganov I'm aware, currently there is still a perf regression for n=1 case. I'm looking into it, will push once it's fixed |
|
Should be fixed now. on 5090 vs master @ggerganov I remember you saying that these small batch sizes are useful for agent use-cases but I'm not able to understand why |
I think at some point I noticed Claude Code sending requests in parallel. The OpenCode does not do it, but I think it's just a matter of time before it starts launching tasks in parallel. You can also run multiple agent session at the same time and this will help in such cases. And also, not directly related to agentic coding, but this is useful for tasks with |
|
@jacekpoplawski thanks for reporting. You're not doing anything wrong, it's just that this branch didn't have #19126 which provides the speed-up in master. Rebased now |
JohannesGaessler
left a comment
There was a problem hiding this comment.
If we're going to add a template specialization anyways I think it makes more sense to instead just add a template specialization for MUL_MAT_ID in general. Add a boolean like use_ids and use that instead of the ids pointer in the program logic as well as to determine how blockIdx should be interpreted.
|
The problem is that the n_tokens=1 path with ids sees a slowdown, so we would have to add something like this anyway, a template which disambiguates between n=1, and n > 1 |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Sorry, I misremembered which model had a performance regression on an RTX 3090. For now I think it's fine to just merge it like this.
* CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path
* CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path

Currently for batch_sizes > 1, we immediately move to mmq which is suboptimal for small batch sizes. Bring performance of batched bench in line (previously there was a dip at n_tokens = 2)
Micro-benchmark for test-backend-ops