Skip to content

Misc. bug: Tensor not enabled on M5 Macbook Air #26

@mysteryman612

Description

@mysteryman612

Name and Version

$ ./build/bin/llama-cli --version
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel 0x1042e1ca0 | th_max = 1024 | th_width = 32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel 0x1042e34a0 | th_max = 1024 | th_width = 32
ggml_metal_library_init: using embedded metal library
do we have tensor: 1ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.011 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10 (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 26800.60 MB
version: 8648 (3380d3c)
built with AppleClang 21.0.0.21000099 for Darwin arm64

Operating systems

Mac

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./build/bin/llama-server \
  -m ~/.lmstudio/models/0xsero/Qwen-3.5-28B-A3B-REAP-Q4_K_M/Qwen-3.5-28B-A3B-REAP-Q4_K_M.gguf \
  --alias "model-turbo" \
  --jinja -ngl 99 -c 262144 -fa on \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  -np 1 --metrics --host 0.0.0.0 --port 8080 --metrics

Problem description & steps to reproduce

I was testing out turbo3 on my M5 MBA 32GB, macOS 26.4 and found that it wasn't enabling the tensor API when I saw this in the llama logs:

ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_init_from_source: error compiling source
ggml_metal_device_init: - the tensor API is not supported in this environment - disabling

I enabled verbose and saw that there was an assert fail in Apple's MetalPerformancePrimitives framework causing the dummy kernel compile to fail with

error: static_assert failed due to requirement '(mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.m % 16) == 0 || (mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.n % 16) == 0' "At least one of M or N must be a multiple of 16"
    static_assert((descriptor.m % 16) == 0 || (descriptor.n % 16) == 0, "At least one of M or N must be a multiple of 16");

I ran the code through Mistral Vibe and it suggested a change to the dummy kernel code to change the matmul2d function call at

" matmul2d_descriptor(8, 8, dynamic_extent), \n"

to matmul2d_descriptor(16, 16, dynamic_extent)

Recompiling llama with that change got tensor working for me and I did some performance testing with Claude and found prefill (prompt tok/s) speed greatly benefitted from the tensor API on my machine. Here's the results I got from pointing Claude Code at the local llama server and prompting it with "hej"

Metric Turbo3 + Tensor Turbo3 No Tensor Turbo4 + Tensor
Prefill tok/s 391.0 283.8 372.4
Prefill time 58.3s 80.3s 61.2s
Gen tok/s 32.3 35.6 36.3
Gen tokens 5,905 171 1,056

An interesting thing we noted was that under turbo3 with tensor, the Swedish "hello" prompt caused Qwen to spiral and start either hallucinating a summarization task I didn't ask it to do or switch to Chinese and start giving me a lesson on Chinese... in Chinese... which lead to the 5.9k generated tokens. These issues went away when I both ran turbo3 non-tensor and turbo4+tensor. I gave the results to Claude and it chalked up to possibly being an artifact of the +1.06% perplexity causing knock-on effects when paired with Claude Code's 20k+ token system prompt.

I thought this was an interesting fix but I honestly have no idea if the fix (workaround?) that Mistral gave me is even close to being correct so I thought I'd file an issue to see what people thought. I included some of my additional log outputs in case that helps.

First Bad Commit

EDIT: Gitlens points the kernel code's commit to originating from the original llama project at 5b180c3

Relevant log output

Verbose Log Snippet

Output from enabling verbose metal print

included from program_source:4:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:389:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:3765:5: error: static_assert failed due to requirement '(mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.m % 16) == 0 || (mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.n % 16) == 0' "At least one of M or N must be a multiple of 16"
    static_assert((descriptor.m % 16) == 0 || (descriptor.n % 16) == 0, "At least one of M or N must be a multiple of 16");
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:415:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>>' requested here
    __mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
                       ^
program_source:26:8: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>>::run<metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>, void>' requested here
    mm.run(sB, sA, cT); 
       ^
$ ./build/bin/llama-server \
  -m ~/.lmstudio/models/lmstudio-community/Ministral-3-3B-Instruct-2512-GGUF/Ministral-3-3B-Instruct-2512-Q4_K_M.gguf \ 
  --alias "model-turbo" \
  --jinja -ngl 99 -c 262144 -fa on \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  -np 1 --metrics --host 0.0.0.0 --port 8080
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_init_from_source: error compiling source
ggml_metal_device_init: - the tensor API is not supported in this environment - disabling
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 using 4-mag LUT (pre-M5 hardware)
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 26800.60 MB
build: 8631 (ca2524617) with AppleClang 21.0.0.21000099 for Darwin arm64
system info: n_threads = 4, n_threads_batch = 4, total_threads = 10

system_info: n_threads = 4 (n_threads_batch = 4) / 10 | MTL : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | SME = 1 | ACCELERATE = 1 | REPACK = 1 | 

Running without SSL
init: using 9 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/Users/user/.lmstudio/models/lmstudio-community/Ministral-3-3B-Instruct-2512-GGUF/Ministral-3-3B-Instruct-2512-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 8645 MiB of device memory vs. 25558 MiB of free device memory
llama_params_fit_impl: will leave 16912 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.11 seconds
llama_model_load_from_file_impl: using device MTL0 (Apple M5) (unknown id) - 25558 MiB free
llama_model_loader: loaded meta data with 50 key-value pairs and 236 tensors from /Users/user/.lmstudio/models/lmstudio-community/Ministral-3-3B-Instruct-2512-GGUF/Ministral-3-3B-Instruct-2512-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mistral3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = mistralai_Ministral 3 3B Instruct 251...
llama_model_loader: - kv   3:                            general.version str              = 2512
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = mistralai_Ministral-3
llama_model_loader: - kv   6:                         general.size_label str              = 3B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Ministral 3 3B Base 2512
llama_model_loader: - kv  10:               general.base_model.0.version str              = 2512
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Mistralai
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Mini...
llama_model_loader: - kv  13:                               general.tags arr[str,1]       = ["mistral-common"]
llama_model_loader: - kv  14:                          general.languages arr[str,11]      = ["en", "fr", "es", "de", "it", "pt", ...
llama_model_loader: - kv  15:                       mistral3.block_count u32              = 26
llama_model_loader: - kv  16:                    mistral3.context_length u32              = 262144
llama_model_loader: - kv  17:                  mistral3.embedding_length u32              = 3072
llama_model_loader: - kv  18:               mistral3.feed_forward_length u32              = 9216
llama_model_loader: - kv  19:              mistral3.attention.head_count u32              = 32
llama_model_loader: - kv  20:           mistral3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:  mistral3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  22:              mistral3.attention.key_length u32              = 128
llama_model_loader: - kv  23:            mistral3.attention.value_length u32              = 128
llama_model_loader: - kv  24:                        mistral3.vocab_size u32              = 131072
llama_model_loader: - kv  25:              mistral3.rope.dimension_count u32              = 128
llama_model_loader: - kv  26:                 mistral3.rope.scaling.type str              = yarn
llama_model_loader: - kv  27:               mistral3.rope.scaling.factor f32              = 16.000000
llama_model_loader: - kv  28:       mistral3.rope.scaling.yarn_beta_fast f32              = 32.000000
llama_model_loader: - kv  29:       mistral3.rope.scaling.yarn_beta_slow f32              = 1.000000
llama_model_loader: - kv  30:  mistral3.rope.scaling.yarn_log_multiplier f32              = 1.000000
llama_model_loader: - kv  31: mistral3.rope.scaling.original_context_length u32              = 16384
llama_model_loader: - kv  32:                    mistral3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  33:       mistral3.attention.temperature_scale f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ ?...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  41:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  42:            tokenizer.ggml.padding_token_id u32              = 11
llama_model_loader: - kv  43:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  44:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  45:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  46:                    tokenizer.chat_template str              = {#- Default system message if no syst...
llama_model_loader: - kv  47:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  48:               general.quantization_version u32              = 2
llama_model_loader: - kv  49:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   53 tensors
llama_model_loader: - type q4_K:  156 tensors
llama_model_loader: - type q6_K:   27 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.99 GiB (4.99 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch                  = mistral3
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 3072
print_info: n_embd_inp            = 3072
print_info: n_layer               = 26
print_info: n_head                = 32
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 9216
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 0
print_info: rope scaling          = yarn
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 0.0625
print_info: n_ctx_orig_yarn       = 16384
print_info: rope_yarn_log_mul     = 1.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 3B
print_info: model params          = 3.43 B
print_info: general.name          = mistralai_Ministral 3 3B Instruct 2512 BF16
print_info: vocab type            = BPE
print_info: n_vocab               = 131072
print_info: n_merges              = 269443
print_info: BOS token             = 1 '<s>'
print_info: EOS token             = 2 '</s>'
print_info: UNK token             = 0 '<unk>'
print_info: PAD token             = 11 '<pad>'
print_info: LF token              = 1010 'Ċ'
print_info: EOG token             = 2 '</s>'
print_info: max token length      = 150
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 25 repeating layers to GPU
load_tensors: offloaded 27/27 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   315.00 MiB
load_tensors:  MTL0_Mapped model buffer size =  2039.54 MiB
............................................................................
common_init_result: added </s> logit bias = -inf
llama_context: constructing llama_context
llama_context: setting new yarn_attn_factor = 1.0000 (mscale == 1.0, mscale_all_dim = 1.0)
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 262144
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.0625
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M5
ggml_metal_init: picking default device: Apple M5
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     0.50 MiB
llama_kv_cache:       MTL0 KV buffer size =  5824.12 MiB
llama_kv_cache: TurboQuant rotation matrices initialized (128x128)
llama_kv_cache: size = 5824.00 MiB (262144 cells,  26 layers,  1/1 seqs), K (turbo3): 2912.00 MiB, V (turbo3): 2912.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:       MTL0 compute buffer size =   800.01 MiB
sched_reserve:        CPU compute buffer size =   524.02 MiB
sched_reserve: graph nodes  = 891
sched_reserve: graph splits = 2
sched_reserve: reserve took 15.64 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv    load_model: initializing slots, n_slots = 1
no implementations specified for speculative decoding
slot   load_model: id  0 | task -1 | speculative decoding context not initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 262144
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '[SYSTEM_PROMPT]You are a helpful assistant[/SYSTEM_PROMPT][INST]Hello[/INST]Hi there</s>[INST]How are you?[/INST]'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv  update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions