Misc. bug: Exceeding vulkan driver limits for shared memory

### Name and Version

e77056f9b25c ("CUDA: use fastdiv for batch index split in get_rows (#22650)")

### Operating systems

_No response_

### Which llama.cpp modules do you know to be affected?

_No response_

### Command line

```shell
`/home/anholt/src/llama.cpp/build-aarch64/bin/llama-bench -ngl 99 -m /home/anholt/.cache/llama.cpp/ggml-org_Nomic-Embed-Text-V2-GGUF_nomic-embed-text-v2-moe-q8_0.gguf`
```

### Problem description & steps to reproduce

On turnip, once I set `integerDotProduct4x8BitPackedSignedAccelerated` to enable int dot support, I end up with Vulkan validation failures that end up with the driver failing compiling `matmul_q8_0_q8_1_l`.  gdb excerpt included.

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

(gdb) r
Starting program: /home/anholt/src/llama.cpp/build-aarch64/bin/llama-bench -ngl 99 -m /home/anholt/.cache/llama.cpp/ggml-org_Nomic-Embed-Text-V2-GGUF_nomic-embed-text-v2-moe-q8_0.gguf
[...]
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno X1-85 (turnip Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 128 | shared memory: 32768 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
[...]
Validation Error: [ VUID-RuntimeSpirv-Workgroup-06530 ] | MessageID = 0xac32b098
vkCreateComputePipelines(): pCreateInfos[0].stage SPIR-V uses 35840 bytes of shared memory, which is more than maxComputeSharedMemorySize (32768).
The Vulkan spec states: The sum of size in bytes for variables and padding in the Workgroup Storage Class in the GLCompute Execution Model must be less than or equal to maxComputeSharedMemorySize (https://docs.vulkan.org/spec/latest/appendices/spirvenv.html#VUID-RuntimeSpirv-Workgroup-06530)
Objects: 1
    [0] VkShaderModule 0x310000000031

TypeToDescriptorTypeSet: Starting with type id 798 opcode 32, dtid 0, trid 798
TypeToDescriptorTypeSet: Starting with type id 168 opcode 32, dtid 0, trid 168
TypeToDescriptorTypeSet: Starting with type id 53 opcode 32, dtid 0, trid 53
MESA: error: Compute shader ((null)) which has workgroup barrier cannot be used because it's impossible to have enough (2) concurrent waves (0 due to shared, 64 due to branchstack).
[Switching to Thread 0xffffecfeb4a0 (LWP 22379)]

Thread 11 "llama-bench" hit Breakpoint 1, ir3_get_reg_independent_max_waves (v=v@entry=0xffffd83a5260, 
    double_threadsize=double_threadsize@entry=false) at ../src/freedreno/ir3/ir3.c:300
300	         exit(1);
(gdb) bt
#0  ir3_get_reg_independent_max_waves (v=v@entry=0xffffd83a5260, double_threadsize=double_threadsize@entry=false)
    at ../src/freedreno/ir3/ir3.c:300
#1  0x0000fffff11da2dc in calc_target_full_pressure (v=0xffffd83a5260, pressure=<optimized out>)
    at ../src/freedreno/ir3/ir3_ra.c:2537
#2  ir3_ra (v=v@entry=0xffffd83a5260) at ../src/freedreno/ir3/ir3_ra.c:2875
#3  0x0000fffff1189ed0 in ir3_compile_shader_nir (compiler=<optimized out>, shader=shader@entry=0xffffd80c59e0, 
    so=so@entry=0xffffd83a5260) at ../src/freedreno/ir3/ir3_compiler_nir.c:6111
#4  0x0000fffff11e5c24 in compile_variant (shader=shader@entry=0xffffd80c59e0, v=v@entry=0xffffd83a5260)
    at ../src/freedreno/ir3/ir3_shader.c:453
#5  0x0000fffff11e6078 in create_variant (shader=0xffffd80c59e0, key=0xffffecfea360, write_disasm=<optimized out>, 
    mem_ctx=0x0) at ../src/freedreno/ir3/ir3_shader.c:629
#6  0x0000fffff10c149c in tu_shader_create (dev=dev@entry=0xaaaab2b6b3c0, shader_out=shader_out@entry=0xffffecfea300, 
    nir=<optimized out>, key=key@entry=0xffffecfea320, info=info@entry=0xffffecfea2f8, ir3_key=<optimized out>, 
    key_data=key_data@entry=0xffffecfea340, key_size=key_size@entry=32, layout=<optimized out>, 
    layout@entry=0xffffd8003fa0, executable_info=<optimized out>, executable_info@entry=false)
    at ../src/freedreno/vulkan/tu_shader.cc:3152
#7  0x0000fffff1078ccc in tu_compute_pipeline_create<(chip)7> (device=0xaaaab2b6b3c0, pipelineCache=0x0, 
    pCreateInfo=0xffffd80b35f8, flags=64, pAllocator=0x0, pPipeline=0xffffecfea900)
    at ../src/freedreno/vulkan/tu_pipeline.cc:4976
#8  tu_CreateComputePipelines<(chip)7> (device=0xaaaab2b6b3c0, pipelineCache=<optimized out>, count=<optimized out>, 
    pCreateInfos=<optimized out>, pAllocator=<optimized out>, pPipelines=<optimized out>)
    at ../src/freedreno/vulkan/tu_pipeline.cc:5058
#9  0x0000ffffef66da2c in vvl::dispatch::Device::CreateComputePipelines (this=0xaaaab2a73e80, device=<optimized out>, 
    pipelineCache=<optimized out>, createInfoCount=1, pCreateInfos=0xffffecfea930, pAllocator=0x0, 
    pPipelines=<optimized out>)
    at /home/anholt/src/Vulkan-ValidationLayers/layers/chassis/dispatch_object_manual.cpp:2310
#10 0x0000ffffef661070 in vulkan_layer_chassis::CreateComputePipelines (device=0xaaaab2b6b3c0, pipelineCache=0x0, 
    createInfoCount=1, pCreateInfos=0xffffecfea930, pAllocator=0x0, pPipelines=0xffffecfea900)
    at /home/anholt/src/Vulkan-ValidationLayers/layers/chassis/chassis_manual.cpp:617
#11 0x0000fffff32697a0 in vk::Device::createComputePipeline<vk::detail::DispatchLoaderDynamic, true> (
    this=0xaaaab2a3b758, pipelineCache=..., createInfo=..., allocator=..., d=...)
    at /usr/include/vulkan/vulkan_funcs.hpp:3830
#12 ggml_vk_create_pipeline_func (device=std::shared_ptr<vk_device_struct> (use count 10, weak count 2) = {...}, 
    pipeline=std::shared_ptr<vk_pipeline_struct> (use count 2, weak count 0) = {...}, spv_size=<optimized out>, 
    spv_data=<optimized out>, entrypoint="main", parameter_count=<optimized out>, wg_denoms=..., 
    specialization_constants=..., disable_robustness=<optimized out>, require_full_subgroups=<optimized out>, 
    required_subgroup_size=<optimized out>) at /home/anholt/src/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:2284
[...]
(gdb) frame 12
list
#12 ggml_vk_create_pipeline_func (device=std::shared_ptr<vk_device_struct> (use count 10, weak count 2) = {...}, 
    pipeline=std::shared_ptr<vk_pipeline_struct> (use count 2, weak count 0) = {...}, spv_size=<optimized out>, 
    spv_data=<optimized out>, entrypoint="main", parameter_count=<optimized out>, wg_denoms=..., 
    specialization_constants=..., disable_robustness=<optimized out>, require_full_subgroups=<optimized out>, 
    required_subgroup_size=<optimized out>) at /home/anholt/src/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:2284
2284	        pipeline->pipeline = device->device.createComputePipeline(VK_NULL_HANDLE, compute_pipeline_create_info).value;
(gdb) p pipeline->name
$1 = "matmul_q8_0_q8_1_l"

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Exceeding vulkan driver limits for shared memory #22690

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Misc. bug: Exceeding vulkan driver limits for shared memory #22690

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions