Optimize Metal Tensor API usage by Developer-Ecosystem-Engineering · Pull Request #20962 · ggml-org/llama.cpp

Developer-Ecosystem-Engineering · 2026-03-24T20:39:41Z

Overview

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

Geomean improvement of ~26%
TL-Q4_ 0 ~6.8%
DeepSeek-8B-f16 ~71.9%

Further test matrix below

Additional information

Tile dimensions are now configurable via compile-time macros (NRA×NRB), currently defaulting to 64×128 vs the legacy kernel's fixed 64×32.
New dimensions offer better performance across all models tested.
Matrix B is read directly from device memory, this was done to eliminate the threadgroup staging copy
The new kernel uses cooperative tensor accessors (cT.get_multidimensional_index / cT[i]) for direct per-element device writes
Threadgroup memory holds only dequantized A: NRA × NK_TOTAL × sizeof(fp16).

Requirements

Testing Details

Tested on 16 inch M5 Max (Best)

Model	pp512	pp1024	pp2048	pp4096	Model GeoMean
DeepSeek-8B-f16	+86.2%	+84.3%	+80.4%	+71.9%	+80.6%
L2-7B-Q6_K	+49.8%	+47.8%	+46.3%	+43.7%	+46.9%
TL-Q3_K_S	+32.7%	+31.9%	+27.3%	+21.5%	+28.3%
TL-Q5_K_M	+28.2%	+29.5%	+25.5%	+19.7%	+25.7%
TL-Q2_K	+28.6%	+30.5%	+26.2%	+20.0%	+26.3%
G-2B-q8_0	+27.9%	+27.6%	+28.6%	+27.9%	+28.0%
TL-Q6_K	+27.8%	+28.4%	+23.9%	+18.1%	+24.5%
Q3-4B-Q8_0	+23.9%	+24.2%	+22.8%	+20.7%	+22.9%
TL-IQ4_XS	+23.5%	+23.6%	+20.5%	+16.6%	+21.0%
TL-Q4_K_M	+18.7%	+19.5%	+17.0%	+13.2%	+17.1%
TL-Q8_0	+17.1%	+17.1%	+15.4%	+11.9%	+15.4%
TL-Q5_0	+14.8%	+16.3%	+13.8%	+10.2%	+13.8%
TL-Q4_0	+6.1%	+7.7%	+6.6%	+6.8%	+6.8%
Overall GeoMean					+26.4%

I have read and agree with the contributing guidelines
Yes
AI usage disclosure: Yes, Assistive tooling was utilized to navigate & better learn the project codebase, and split work into different phases.

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

H-A-Khan · 2026-03-28T20:40:05Z

M5 Max (MBP 16"), 6+12 CPU, 40 GPU, 64 GB

Tested with LLaMA 7B v2 (F16, Q8_0, Q4_0) — same models used in the Apple Silicon M-series benchmark discussion.

PR #20962 results (build `c3a1128`)

model	size	params	backend	threads	test	t/s
llama 7B F16	12.55 GiB	6.74 B	MTL,BLAS	6	pp512	3129.64 ± 4.17
llama 7B F16	12.55 GiB	6.74 B	MTL,BLAS	6	tg128	35.93 ± 0.15
llama 7B Q8_0	6.67 GiB	6.74 B	MTL,BLAS	6	pp512	3101.56 ± 8.30
llama 7B Q8_0	6.67 GiB	6.74 B	MTL,BLAS	6	tg128	68.34 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	MTL,BLAS	6	pp512	3246.19 ± 4.60
llama 7B Q4_0	3.56 GiB	6.74 B	MTL,BLAS	6	tg128	110.48 ± 0.41

Comparison with old pinned build (`8e672ef`)

Test	Old (`8e672ef`)	This PR	Change
F16 pp512	1018.30 t/s	3129.64 t/s	+207%
F16 tg128	37.58 t/s	35.93 t/s	-4%
Q8_0 pp512	1051.59 t/s	3101.56 t/s	+195%
Q8_0 tg128	64.61 t/s	68.34 t/s	+6%
Q4_0 pp512	987.10 t/s	3246.19 t/s	+229%
Q4_0 tg128	102.93 t/s	110.48 t/s	+7%

Prompt processing ~3x faster across the board. Text generation unchanged as expected (memory-bandwidth bound). Great work! 🚀

ggerganov · 2026-03-28T20:57:00Z

Could you also confirm correctness by running llama-perplexity? Thanks.

Edit: also, no need to compare to the old 8e672ef. Compare to current master.

Hassan-A-K · 2026-03-28T21:18:25Z

@ggerganov Perplexity results on M5 Max (MBP 16", 40 GPU, 64 GB) — Wikitext-2 test set:

Model	Perplexity (PPL)
F16	5.7966 ± 0.03235
Q8_0	5.7978 ± 0.03236
Q4_0	5.9622 ± 0.03348

All within expected ranges. Correctness looks good.

Edit: running perplexity comparison against current master as well. Results incoming.

Hassan-A-K · 2026-03-28T21:35:35Z

@ggerganov Updated results — perplexity + benchmark comparison against current master (c46758d).

Perplexity — Wikitext-2 (M5 Max, MBP 16", 40 GPU, 64 GB)

Model	Master	PR #20962
F16	5.7962	5.7966
Q8_0	5.7974	5.7978
Q4_0	5.9618	5.9622

Benchmark vs master

Test	Master (`c46758d`)	PR #20962	Change
F16 pp512	1,601.60 t/s	3,129.64 t/s	+95%
F16 tg128	36.86 t/s	35.93 t/s	-3%
Q8_0 pp512	1,909.17 t/s	3,101.56 t/s	+62%
Q8_0 tg128	67.86 t/s	68.34 t/s	+1%
Q4_0 pp512	2,052.20 t/s	3,246.19 t/s	+58%
Q4_0 tg128	109.56 t/s	110.48 t/s	+1%

Huge pp improvement across the board. TG unchanged (memory-bound). No correctness regression.

Hassan-A-K · 2026-03-28T21:43:52Z

Perplexity results — M5, 4+6 CPU, 10 GPU, 24 GB

Ran llama-perplexity on wikitext-2-raw with LLaMA 2 7B on the optimize_metal_tensor branch:

Model	PPL	±
F16	5.7845	0.03242
Q8_0	5.7865	0.03243
Q4_0	5.9581	0.03359

F16 and Q8_0 are nearly identical as expected. Q4_0 shows the usual small quantization degradation. No correctness issues observed.

build: c3a1128 (8509)

Developer-Ecosystem-Engineering · 2026-03-31T17:10:27Z

Thank you @Hassan-A-K for the additional details (and confirmation)! Will keep an eye on this for any follow up relevant requests or questions related to integration.

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 24, 2026

Developer-Ecosystem-Engineering changed the title ~~Optimize Metal Tensor API usage for matmul2d~~ Optimize Metal Tensor API usage Mar 24, 2026

TheTom mentioned this pull request Mar 25, 2026

Upstream: TurboQuant discussion + contribution requirements for llama.cpp TheTom/turboquant_plus#27

Open

Developer-Ecosystem-Engineering mentioned this pull request Mar 27, 2026

metal: Fix dimension constraint violation in matmul2d descriptor #21048

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Metal Tensor API usage#20962

Optimize Metal Tensor API usage#20962
Developer-Ecosystem-Engineering wants to merge 1 commit intoggml-org:masterfrom
Developer-Ecosystem-Engineering:optimize_metal_tensor

Developer-Ecosystem-Engineering commented Mar 24, 2026

Uh oh!

H-A-Khan commented Mar 28, 2026

Uh oh!

ggerganov commented Mar 28, 2026 •

edited

Loading

Uh oh!

Hassan-A-K commented Mar 28, 2026 •

edited

Loading

Uh oh!

Hassan-A-K commented Mar 28, 2026 •

edited

Loading

Uh oh!

Hassan-A-K commented Mar 28, 2026

Uh oh!

Developer-Ecosystem-Engineering commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Developer-Ecosystem-Engineering commented Mar 24, 2026

Overview

Additional information

Requirements

Uh oh!

H-A-Khan commented Mar 28, 2026

M5 Max (MBP 16"), 6+12 CPU, 40 GPU, 64 GB

PR #20962 results (build c3a1128)

Comparison with old pinned build (8e672ef)

Uh oh!

ggerganov commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hassan-A-K commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hassan-A-K commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perplexity — Wikitext-2 (M5 Max, MBP 16", 40 GPU, 64 GB)

Benchmark vs master

Uh oh!

Hassan-A-K commented Mar 28, 2026

Perplexity results — M5, 4+6 CPU, 10 GPU, 24 GB

Uh oh!

Developer-Ecosystem-Engineering commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PR #20962 results (build `c3a1128`)

Comparison with old pinned build (`8e672ef`)

ggerganov commented Mar 28, 2026 •

edited

Loading

Hassan-A-K commented Mar 28, 2026 •

edited

Loading

Hassan-A-K commented Mar 28, 2026 •

edited

Loading