Skip to content

fix vulkan ggml_acc only works in 3d but not 4d#19426

Merged
0cc4m merged 10 commits intoggml-org:masterfrom
ymcki:vulkan_acc
Feb 13, 2026
Merged

fix vulkan ggml_acc only works in 3d but not 4d#19426
0cc4m merged 10 commits intoggml-org:masterfrom
ymcki:vulkan_acc

Conversation

@ymcki
Copy link
Contributor

@ymcki ymcki commented Feb 8, 2026

Make sure to read the contributing guidelines before submitting a PR

Discovered ggml_acc for vulkan only works in 3d not 4d while working on
#18792

So fixed it and also added test case to test-backend-ops

@ggerganov ggerganov mentioned this pull request Feb 8, 2026
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 8, 2026
@ymcki
Copy link
Contributor Author

ymcki commented Feb 9, 2026

The latest commit seems to fix the potential problems pointed out by Jeff

@ggerganov
Copy link
Member

The metal failures will be resolved with #19427 after merging this PR.

@ggerganov
Copy link
Member

Added TODO for the CUDA implementation to support this

if (ox < p.ne10 && oy < p.ne11 && oz < p.ne12) {
data_d[get_doffset() + dst_idx(i00, i01, i02, i03)] = D_TYPE(FLOAT_TYPE(data_a[get_aoffset() + src0_idx(i00, i01, i02, i03)]) + FLOAT_TYPE(data_b[get_boffset() + ox + oy * p.ne10 + oz * p.ne10 * p.ne11]));
if (i0 < p.ne10 && i1 < p.ne11 && i2 < p.ne12 && i3 < p.ne13) {
data_d[get_doffset() + dst_idx(i00, i01, i02, i03)] = D_TYPE(FLOAT_TYPE(data_a[get_aoffset() + src0_idx(i00, i01, i02, i03)]) + FLOAT_TYPE(data_b[get_boffset() + i0 + i1 * p.nb11 + i2 * p.nb12 + i3 * p.nb13]));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use src1_idx here and it would also handle permuted tensors (by applying nb10). Would be nice to test permuted tensors, too.

const uint i0 = rem1 % p.nb01;

uint i00, i01, i02, i03;
get_indices(idx, i00, i01, i02, i03);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the CPU reference, seems like there should only be one set of indices and then you can apply offset/sizeof(float) to the final index for src0 and dst.

int offset = dst->op_params[3] / src0_type_size; // offset in bytes

ggml_vk_op_f32<vk_op_binary_push_constants>(ctx, subctx, src0, src1, nullptr, nullptr, dst, GGML_OP_ACC, {
(uint32_t)ggml_nelements(src0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be src1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was to change line 9810, not 9807. I think this change is needed to make it possible to remove the bound check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried changing 9810 but not 9807 and remove boundary check, it still fails.

Changing 9810 not 9807 and keep boundary check, it also fails.

@github-actions github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Feb 9, 2026
@ymcki
Copy link
Contributor Author

ymcki commented Feb 10, 2026

Latest commit should address Jeff's suggestions except for the boundary check. Without the boundary check, the code simply doesn't pass the tests.

@ymcki
Copy link
Contributor Author

ymcki commented Feb 11, 2026

Committed a version without the check that passes my tests and measured the performance by adding my test cases to test_cases_perf

  ACC(type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1],stride_dim=-1):             73710 runs -    14.36 us/run -       50 kB/run -    3.32 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[256,16,2,3],stride_dim=-1):             81900 runs -    12.65 us/run -      300 kB/run -   22.61 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[128,16,2,3],stride_dim=-1):             81900 runs -    12.72 us/run -      252 kB/run -   18.89 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[256,16,2,3],stride_dim=1):              81890 runs -    12.64 us/run -      305 kB/run -   23.01 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[128,16,2,3],stride_dim=2):              81890 runs -    12.67 us/run -      268 kB/run -   20.18 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[64,16,2,3],stride_dim=3):               81890 runs -    12.75 us/run -      228 kB/run -   17.05 GB/s

However, it seems to be significantly slower than previous code:

  ACC(type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1],stride_dim=-1):            114660 runs -     9.21 us/run -       50 kB/run -    5.18 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[256,16,2,3],stride_dim=-1):            122850 runs -     8.44 us/run -      300 kB/run -   33.89 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[128,16,2,3],stride_dim=-1):            122850 runs -     8.53 us/run -      252 kB/run -   28.19 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[256,16,2,3],stride_dim=1):             122835 runs -     8.54 us/run -      305 kB/run -   34.05 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[128,16,2,3],stride_dim=2):             122835 runs -     8.56 us/run -      268 kB/run -   29.88 GB/s
  ACC(type=f32,ne_a=[256,17,2,3],ne_b=[64,16,2,3],stride_dim=3):              122835 runs -     8.40 us/run -      228 kB/run -   25.88 GB/s

I think it is better to revert to the previous boundary check version unless Jeff knows how to speed it up.

@jeffbolznv
Copy link
Collaborator

Sorry, yes, it's better to do it in a single shader and have the branch. I somehow missed that there was an else case and that it was an "inside vs outside" check rather than just an out of bounds check.

@ymcki
Copy link
Contributor Author

ymcki commented Feb 12, 2026

Back to boundary check version but keep the test cases added in perf.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@0cc4m 0cc4m merged commit 0e21991 into ggml-org:master Feb 13, 2026
71 of 78 checks passed
@ymcki ymcki deleted the vulkan_acc branch February 13, 2026 13:44
ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Feb 16, 2026
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
(cherry picked from commit 0e21991)
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants