ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps by ravel7524 · Pull Request #23349 · ggml-org/llama.cpp

ravel7524 · 2026-05-19T16:19:17Z

Overview

This PR changes the RDNA3 MMVQ warp count selection for GGML_TYPE_Q6_K with ncols_dst == 1 from 8 warps to 2 warps.

The change is limited to the MMVQ_PARAMETERS_RDNA3_0 path in ggml/src/ggml-cuda/mmvq.cu.

It does not change quantization logic or model output math. It only changes the launch geometry used for this RDNA3 Q6_K MMVQ case.

I tested this on an AMD Radeon Pro W7800 (gfx1100) with ROCm/HIP. This improved Q6_K decode throughput in repeated local benchmarks.

Additional information

Test system:

GPU: AMD Radeon Pro W7800 48GB, gfx1100
ROCm: 7.2.3 container
Backend: HIP

Models tested:

Qwen3.6-35B-A3B-UD-Q6_K
Qwen3-30B-A3B-Instruct-2507-Q6_K
Llama-3.2-3B-Instruct-Q6_K
Qwen2.5-1.5B-Instruct-Q6_K

Benchmark results on AMD Radeon Pro W7800 (gfx1100), ROCm/HIP, 3 runs:

Model	Build	Q6_K `n=1` perf	`tg128`	mixed prompt/decode
Qwen3.6-35B-A3B-UD-Q6_K	baseline, `8` warps	`71.26 us`, `1.65 TFLOPS`	`76.22 +/- 0.14 / 75.53 +/- 0.09 t/s`	`pp4096+tg128 1154.42 +/- 19.00 t/s`
Qwen3.6-35B-A3B-UD-Q6_K	patched, `2` warps	`68.26 us`, `1.72 TFLOPS`	`86.08 +/- 0.14 / 85.68 +/- 0.07 t/s`	`pp4096+tg128 1294.05 +/- 4.38 t/s`
Qwen3-30B-A3B-Instruct-2507-Q6_K	baseline, `8` warps	`74.05 us`, `1.59 TFLOPS`	`48.93 +/- 0.40 / 49.16 +/- 0.09 t/s`	`pp4096+tg128 590.07 +/- 0.57 t/s`
Qwen3-30B-A3B-Instruct-2507-Q6_K	patched, `2` warps	`70.61 us`, `1.66 TFLOPS`	`53.28 +/- 0.46 / 53.47 +/- 0.09 t/s`	`pp4096+tg128 594.41 +/- 12.46 t/s`
Llama-3.2-3B-Instruct-Q6_K	baseline, `8` warps	`70.02 us`, `1.68 TFLOPS`	`125.11 +/- 0.18 / 124.12 +/- 0.15 t/s`	`pp1024+tg128 901.24 +/- 2.03 t/s`
Llama-3.2-3B-Instruct-Q6_K	patched, `2` warps	`69.07 us`, `1.70 TFLOPS`	`145.05 +/- 0.14 / 144.48 +/- 0.04 t/s`	`pp1024+tg128 1025.25 +/- 0.60 t/s`
Qwen2.5-1.5B-Instruct-Q6_K	baseline, `8` warps	`71.56 us`, `1.64 TFLOPS`	`171.04 +/- 0.32 / 170.22 +/- 0.10 t/s`	`pp1024+tg128 1321.98 +/- 1.54 t/s`
Qwen2.5-1.5B-Instruct-Q6_K	patched, `2` warps	`67.61 us`, `1.74 TFLOPS`	`209.50 +/- 0.13 / 208.62 +/- 0.44 t/s`	`pp1024+tg128 1576.68 +/- 1.26 t/s`

The benchmark results show consistent decode improvements across two MoE Q6_K models and two dense Q6_K models.

Correctness check:

Baseline and patched runs passed test-backend-ops -o MUL_MAT -b ROCm0 -p q6_K, 11/11.

Requirements

Yes, I have read and agree with the contributing guidelines
AI usage disclosure: YES, AI was used assistively for experiment organization and wording review. I reviewed the code change, benchmark results, and final PR text myself, and I can explain the submitted change.

10%+ Improvement in tg speeds

JohannesGaessler

Please keep the order of types consistent with how they're declared in ggml.h and insert additional return statements instead. Otherwise LGTM.

Adjusted Order to keep consistency

ravel7524 · 2026-05-19T18:18:09Z

Ok I corrected the Order and just added the additional return statement

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

(cherry picked from commit b39a7bf)

* upstream/HEAD: (25 commits) metal : optimize pad + cpy (ggml-org#23354) snapdragon: update toolchain to v0.6 (ggml-org#23369) ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349) opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (ggml-org#23303) hexagon: add MROPE and IMROPE support in HTP rope op (ggml-org#23317) refactor: Chat Screen UI rendering (ggml-org#23333) github: mention --log-file in issue templates (ggml-org#23277) common: fix --help for --verbosity (ggml-org#23278) common: fix --fit verbosity with --verbosity 4 (ggml-org#23282) convert : update mtp related help (ggml-org#23334) hexagon: enable support for NORM op (ggml-org#23319) model : clarify MTP layer comment in qwen35.cpp [no ci] (ggml-org#23338) llama : MTP clean-up (ggml-org#23269) ui: Bump packages + address build warnings (ggml-org#23300) ci : install libssl-dev (ggml-org#23325) ci : install server kleidiai runner dependencies (ggml-org#23259) server-context: guarantee there is at least 1 token to decode (ggml-org#23280) server : print graphs reused in slot timings (ggml-org#23279) save-load-state : refactor tests and improve readability (ggml-org#23196) llama-eval : add per-task summary stats (ggml-org#23151) ...

Add handling for GGML_TYPE_Q6_K in mmvq.cu

85b8ac2

10%+ Improvement in tg speeds

ravel7524 requested a review from a team as a code owner May 19, 2026 16:19

JohannesGaessler approved these changes May 19, 2026

View reviewed changes

Add case for GGML_TYPE_IQ4_NL in mvvq.cu

c2f2807

Adjusted Order to keep consistency

IMbackK approved these changes May 19, 2026

View reviewed changes

JohannesGaessler reviewed May 19, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/mmvq.cu Outdated

Update ggml/src/ggml-cuda/mmvq.cu

a323144

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

IMbackK approved these changes May 19, 2026

View reviewed changes

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 19, 2026

am17an merged commit b39a7bf into ggml-org:master May 20, 2026
50 checks passed

dbrain pushed a commit to dbrain/hbd-llama-cpp-turboquant that referenced this pull request May 21, 2026

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)

0271c00

nyo16 mentioned this pull request May 21, 2026

Bump llama.cpp to 52fb93a2b (30 commits) nyo16/llama_cpp_ex#42

Merged

4 tasks

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)

bc0a0d0

srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)

170f870

carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)

b66ce7c

(cherry picked from commit b39a7bf)

arthw mentioned this pull request May 26, 2026

ci : reduce (disable SYCL and CANN builds/releases) #23705

Merged

2 tasks

yaohengxu mentioned this pull request May 27, 2026

mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … #23729

Merged

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)

98ebc53

turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)

4e35fb6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps#23349

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps#23349
am17an merged 3 commits into
ggml-org:masterfrom
ravel7524:master

ravel7524 commented May 19, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

ravel7524 commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ravel7524 commented May 19, 2026

Overview

Additional information

Requirements

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

ravel7524 commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants