included from program_source:4:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:389:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:3765:5: error: static_assert failed due to requirement '(mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.m % 16) == 0 || (mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.n % 16) == 0' "At least one of M or N must be a multiple of 16"
static_assert((descriptor.m % 16) == 0 || (descriptor.n % 16) == 0, "At least one of M or N must be a multiple of 16");
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:415:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>>' requested here
__mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
^
program_source:26:8: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>>::run<metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>, void>' requested here
mm.run(sB, sA, cT);
^
Name and Version
$ ./build/bin/llama-cli --version
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel 0x1042e1ca0 | th_max = 1024 | th_width = 32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel 0x1042e34a0 | th_max = 1024 | th_width = 32
ggml_metal_library_init: using embedded metal library
do we have tensor: 1ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.011 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10 (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 26800.60 MB
version: 8648 (3380d3c)
built with AppleClang 21.0.0.21000099 for Darwin arm64
Operating systems
Mac
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
I was testing out turbo3 on my M5 MBA 32GB, macOS 26.4 and found that it wasn't enabling the tensor API when I saw this in the llama logs:
I enabled verbose and saw that there was an assert fail in Apple's MetalPerformancePrimitives framework causing the dummy kernel compile to fail with
I ran the code through Mistral Vibe and it suggested a change to the dummy kernel code to change the matmul2d function call at
llama-cpp-turboquant/ggml/src/ggml-metal/ggml-metal-device.m
Line 732 in 43f7d3d
to
matmul2d_descriptor(16, 16, dynamic_extent)Recompiling llama with that change got tensor working for me and I did some performance testing with Claude and found prefill (prompt tok/s) speed greatly benefitted from the tensor API on my machine. Here's the results I got from pointing Claude Code at the local llama server and prompting it with "hej"
An interesting thing we noted was that under turbo3 with tensor, the Swedish "hello" prompt caused Qwen to spiral and start either hallucinating a summarization task I didn't ask it to do or switch to Chinese and start giving me a lesson on Chinese... in Chinese... which lead to the 5.9k generated tokens. These issues went away when I both ran turbo3 non-tensor and turbo4+tensor. I gave the results to Claude and it chalked up to possibly being an artifact of the +1.06% perplexity causing knock-on effects when paired with Claude Code's 20k+ token system prompt.
I thought this was an interesting fix but I honestly have no idea if the fix (workaround?) that Mistral gave me is even close to being correct so I thought I'd file an issue to see what people thought. I included some of my additional log outputs in case that helps.
First Bad Commit
EDIT: Gitlens points the kernel code's commit to originating from the original llama project at 5b180c3
Relevant log output
Verbose Log Snippet
Output from enabling verbose metal print