Skip to content

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps#23349

Merged
am17an merged 3 commits into
ggml-org:masterfrom
ravel7524:master
May 20, 2026
Merged

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps#23349
am17an merged 3 commits into
ggml-org:masterfrom
ravel7524:master

Conversation

@ravel7524

Copy link
Copy Markdown
Contributor

Overview

This PR changes the RDNA3 MMVQ warp count selection for GGML_TYPE_Q6_K with ncols_dst == 1 from 8 warps to 2 warps.

The change is limited to the MMVQ_PARAMETERS_RDNA3_0 path in ggml/src/ggml-cuda/mmvq.cu.

It does not change quantization logic or model output math. It only changes the launch geometry used for this RDNA3 Q6_K MMVQ case.

I tested this on an AMD Radeon Pro W7800 (gfx1100) with ROCm/HIP. This improved Q6_K decode throughput in repeated local benchmarks.

Additional information

Test system:

  • GPU: AMD Radeon Pro W7800 48GB, gfx1100
  • ROCm: 7.2.3 container
  • Backend: HIP

Models tested:

  • Qwen3.6-35B-A3B-UD-Q6_K
  • Qwen3-30B-A3B-Instruct-2507-Q6_K
  • Llama-3.2-3B-Instruct-Q6_K
  • Qwen2.5-1.5B-Instruct-Q6_K

Benchmark results on AMD Radeon Pro W7800 (gfx1100), ROCm/HIP, 3 runs:

Model Build Q6_K n=1 perf tg128 mixed prompt/decode
Qwen3.6-35B-A3B-UD-Q6_K baseline, 8 warps 71.26 us, 1.65 TFLOPS 76.22 +/- 0.14 / 75.53 +/- 0.09 t/s pp4096+tg128 1154.42 +/- 19.00 t/s
Qwen3.6-35B-A3B-UD-Q6_K patched, 2 warps 68.26 us, 1.72 TFLOPS 86.08 +/- 0.14 / 85.68 +/- 0.07 t/s pp4096+tg128 1294.05 +/- 4.38 t/s
Qwen3-30B-A3B-Instruct-2507-Q6_K baseline, 8 warps 74.05 us, 1.59 TFLOPS 48.93 +/- 0.40 / 49.16 +/- 0.09 t/s pp4096+tg128 590.07 +/- 0.57 t/s
Qwen3-30B-A3B-Instruct-2507-Q6_K patched, 2 warps 70.61 us, 1.66 TFLOPS 53.28 +/- 0.46 / 53.47 +/- 0.09 t/s pp4096+tg128 594.41 +/- 12.46 t/s
Llama-3.2-3B-Instruct-Q6_K baseline, 8 warps 70.02 us, 1.68 TFLOPS 125.11 +/- 0.18 / 124.12 +/- 0.15 t/s pp1024+tg128 901.24 +/- 2.03 t/s
Llama-3.2-3B-Instruct-Q6_K patched, 2 warps 69.07 us, 1.70 TFLOPS 145.05 +/- 0.14 / 144.48 +/- 0.04 t/s pp1024+tg128 1025.25 +/- 0.60 t/s
Qwen2.5-1.5B-Instruct-Q6_K baseline, 8 warps 71.56 us, 1.64 TFLOPS 171.04 +/- 0.32 / 170.22 +/- 0.10 t/s pp1024+tg128 1321.98 +/- 1.54 t/s
Qwen2.5-1.5B-Instruct-Q6_K patched, 2 warps 67.61 us, 1.74 TFLOPS 209.50 +/- 0.13 / 208.62 +/- 0.44 t/s pp1024+tg128 1576.68 +/- 1.26 t/s

The benchmark results show consistent decode improvements across two MoE Q6_K models and two dense Q6_K models.

Correctness check:

  • Baseline and patched runs passed test-backend-ops -o MUL_MAT -b ROCm0 -p q6_K, 11/11.

Requirements

  • Yes, I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, AI was used assistively for experiment organization and wording review. I reviewed the code change, benchmark results, and final PR text myself, and I can explain the submitted change.

10%+ Improvement in tg speeds
@ravel7524 ravel7524 requested a review from a team as a code owner May 19, 2026 16:19

@JohannesGaessler JohannesGaessler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the order of types consistent with how they're declared in ggml.h and insert additional return statements instead. Otherwise LGTM.

Adjusted Order to keep consistency
@ravel7524

Copy link
Copy Markdown
Contributor Author

Ok I corrected the Order and just added the additional return statement

Comment thread ggml/src/ggml-cuda/mmvq.cu Outdated
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 19, 2026
@am17an am17an merged commit b39a7bf into ggml-org:master May 20, 2026
50 checks passed
dbrain pushed a commit to dbrain/hbd-llama-cpp-turboquant that referenced this pull request May 21, 2026
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request Jun 11, 2026
* upstream/HEAD: (25 commits)
  metal : optimize pad + cpy (ggml-org#23354)
  snapdragon: update toolchain to v0.6 (ggml-org#23369)
  ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)
  opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (ggml-org#23303)
  hexagon: add MROPE and IMROPE support in HTP rope op (ggml-org#23317)
  refactor: Chat Screen UI rendering (ggml-org#23333)
  github: mention --log-file in issue templates (ggml-org#23277)
  common: fix --help for --verbosity (ggml-org#23278)
  common: fix --fit verbosity with --verbosity 4 (ggml-org#23282)
  convert : update mtp related help (ggml-org#23334)
  hexagon: enable support for NORM op (ggml-org#23319)
  model : clarify MTP layer comment in qwen35.cpp [no ci] (ggml-org#23338)
  llama : MTP clean-up (ggml-org#23269)
  ui: Bump packages + address build warnings (ggml-org#23300)
  ci : install libssl-dev (ggml-org#23325)
  ci : install server kleidiai runner dependencies (ggml-org#23259)
  server-context: guarantee there is at least 1 token to decode (ggml-org#23280)
  server : print graphs reused in slot timings (ggml-org#23279)
  save-load-state : refactor tests and improve readability (ggml-org#23196)
  llama-eval : add per-task summary stats (ggml-org#23151)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants