fix vulkan ggml_acc only works in 3d but not 4d by ymcki · Pull Request #19426 · ggml-org/llama.cpp

ymcki · 2026-02-08T08:23:04Z

Make sure to read the contributing guidelines before submitting a PR

Discovered ggml_acc for vulkan only works in 3d not 4d while working on
#18792

So fixed it and also added test case to test-backend-ops

tests/test-backend-ops.cpp

ggml/src/ggml-vulkan/vulkan-shaders/acc.comp

tests/test-backend-ops.cpp

ymcki · 2026-02-09T06:14:29Z

The latest commit seems to fix the potential problems pointed out by Jeff

ggerganov · 2026-02-09T14:11:22Z

The metal failures will be resolved with #19427 after merging this PR.

ggerganov · 2026-02-09T14:17:29Z

Added TODO for the CUDA implementation to support this

jeffbolznv · 2026-02-09T15:00:41Z

ggml/src/ggml-vulkan/vulkan-shaders/acc.comp

-    if (ox < p.ne10 && oy < p.ne11 && oz < p.ne12) {
-        data_d[get_doffset() + dst_idx(i00, i01, i02, i03)] = D_TYPE(FLOAT_TYPE(data_a[get_aoffset() + src0_idx(i00, i01, i02, i03)]) + FLOAT_TYPE(data_b[get_boffset() + ox + oy * p.ne10 + oz * p.ne10 * p.ne11]));
+    if (i0 < p.ne10 && i1 < p.ne11 && i2 < p.ne12 && i3 < p.ne13) {
+        data_d[get_doffset() + dst_idx(i00, i01, i02, i03)] = D_TYPE(FLOAT_TYPE(data_a[get_aoffset() + src0_idx(i00, i01, i02, i03)]) + FLOAT_TYPE(data_b[get_boffset() + i0 + i1 * p.nb11 + i2 * p.nb12 + i3 * p.nb13]));


You could use src1_idx here and it would also handle permuted tensors (by applying nb10). Would be nice to test permuted tensors, too.

jeffbolznv · 2026-02-09T15:03:46Z

ggml/src/ggml-vulkan/vulkan-shaders/acc.comp

+    const uint i0 = rem1 % p.nb01;

    uint i00, i01, i02, i03;
    get_indices(idx, i00, i01, i02, i03);


Looking at the CPU reference, seems like there should only be one set of indices and then you can apply offset/sizeof(float) to the final index for src0 and dst.

ggml/src/ggml-vulkan/vulkan-shaders/acc.comp

jeffbolznv · 2026-02-09T15:07:28Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+    int offset = dst->op_params[3] / src0_type_size; // offset in bytes

    ggml_vk_op_f32<vk_op_binary_push_constants>(ctx, subctx, src0, src1, nullptr, nullptr, dst, GGML_OP_ACC, {
        (uint32_t)ggml_nelements(src0),


I think this should be src1?

My suggestion was to change line 9810, not 9807. I think this change is needed to make it possible to remove the bound check.

I just tried changing 9810 but not 9807 and remove boundary check, it still fails.

Changing 9810 not 9807 and keep boundary check, it also fails.

…s suggestion except to keep the boundary check

sync to latest

ymcki · 2026-02-10T02:14:12Z

Latest commit should address Jeff's suggestions except for the boundary check. Without the boundary check, the code simply doesn't pass the tests.

ymcki · 2026-02-11T10:04:27Z

Committed a version without the check that passes my tests and measured the performance by adding my test cases to test_cases_perf

  ACC(type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1],stride_dim=-1):             73710 runs -    14.36 us/run -       50 kB/run -    3.32 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[256,16,2,3],stride_dim=-1):             81900 runs -    12.65 us/run -      300 kB/run -   22.61 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[128,16,2,3],stride_dim=-1):             81900 runs -    12.72 us/run -      252 kB/run -   18.89 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[256,16,2,3],stride_dim=1):              81890 runs -    12.64 us/run -      305 kB/run -   23.01 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[128,16,2,3],stride_dim=2):              81890 runs -    12.67 us/run -      268 kB/run -   20.18 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[64,16,2,3],stride_dim=3):               81890 runs -    12.75 us/run -      228 kB/run -   17.05 GB/s

However, it seems to be significantly slower than previous code:

  ACC(type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1],stride_dim=-1):            114660 runs -     9.21 us/run -       50 kB/run -    5.18 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[256,16,2,3],stride_dim=-1):            122850 runs -     8.44 us/run -      300 kB/run -   33.89 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[128,16,2,3],stride_dim=-1):            122850 runs -     8.53 us/run -      252 kB/run -   28.19 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[256,16,2,3],stride_dim=1):             122835 runs -     8.54 us/run -      305 kB/run -   34.05 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[128,16,2,3],stride_dim=2):             122835 runs -     8.56 us/run -      268 kB/run -   29.88 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[64,16,2,3],stride_dim=3):              122835 runs -     8.40 us/run -      228 kB/run -   25.88 GB/s

I think it is better to revert to the previous boundary check version unless Jeff knows how to speed it up.

jeffbolznv · 2026-02-11T16:08:05Z

Sorry, yes, it's better to do it in a single shader and have the branch. I somehow missed that there was an else case and that it was an "inside vs outside" check rather than just an out of bounds check.

ymcki · 2026-02-12T02:47:09Z

Back to boundary check version but keep the test cases added in perf.

0cc4m

LGTM

* fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> (cherry picked from commit 0e21991)

* fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

fix vulkan ggml_acc only works in 3d but not 4d

d4e0103

ymcki requested review from 0cc4m and ggerganov as code owners February 8, 2026 08:23

ymcki mentioned this pull request Feb 8, 2026

Unified delta net handling for Qwen3Next and Kimi Linear models #18792

Closed

ggerganov reviewed Feb 8, 2026

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

ggerganov mentioned this pull request Feb 8, 2026

metal : fix ACC op #19427

Merged

removed clamp in test_acc_block

3679d22

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 8, 2026

jeffbolznv reviewed Feb 8, 2026

View reviewed changes

ggml/src/ggml-vulkan/vulkan-shaders/acc.comp Outdated Show resolved Hide resolved

ggml/src/ggml-vulkan/vulkan-shaders/acc.comp Outdated Show resolved Hide resolved

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

use the correct stride and its test case

635a72a

cuda : fix "supports op" condition

8045d54

jeffbolznv reviewed Feb 9, 2026

View reviewed changes

github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Feb 9, 2026

ymcki and others added 3 commits February 10, 2026 10:04

change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'…

f9f59b0

…s suggestion except to keep the boundary check

Merge branch 'ggml-org:master' into vulkan_acc

570b717

Merge branch 'vulkan_acc' of github.com:ymcki/llama.cpp into vulkan_acc

a3ab50c

sync to latest

version without boundary check

81ddff9

revert back to boundary check version

120a311

Merge branch 'ggml-org:master' into vulkan_acc

9ee0263

jeffbolznv approved these changes Feb 12, 2026

View reviewed changes

jeffbolznv mentioned this pull request Feb 13, 2026

vulkan: support GGML_OP_SET #19584

Merged

ggerganov approved these changes Feb 13, 2026

View reviewed changes

0cc4m approved these changes Feb 13, 2026

View reviewed changes

0cc4m merged commit 0e21991 into ggml-org:master Feb 13, 2026
71 of 78 checks passed

ymcki deleted the vulkan_acc branch February 13, 2026 13:44

Conversation

ymcki commented Feb 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ymcki commented Feb 9, 2026

Uh oh!

ggerganov commented Feb 9, 2026

Uh oh!

ggerganov commented Feb 9, 2026

Uh oh!

jeffbolznv Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffbolznv Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ymcki Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ymcki commented Feb 10, 2026

Uh oh!

ymcki commented Feb 11, 2026

Uh oh!

jeffbolznv commented Feb 11, 2026

Uh oh!

ymcki commented Feb 12, 2026

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants