Skip to content

Optimize Metal Tensor API usage#20962

Open
Developer-Ecosystem-Engineering wants to merge 1 commit intoggml-org:masterfrom
Developer-Ecosystem-Engineering:optimize_metal_tensor
Open

Optimize Metal Tensor API usage#20962
Developer-Ecosystem-Engineering wants to merge 1 commit intoggml-org:masterfrom
Developer-Ecosystem-Engineering:optimize_metal_tensor

Conversation

@Developer-Ecosystem-Engineering
Copy link
Copy Markdown

Overview

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

Geomean improvement of ~26%
TL-Q4_ 0 ~6.8%
DeepSeek-8B-f16 ~71.9%

Further test matrix below

Additional information

  • Tile dimensions are now configurable via compile-time macros (NRA×NRB), currently defaulting to 64×128 vs the legacy kernel's fixed 64×32.
  • New dimensions offer better performance across all models tested.
  • Matrix B is read directly from device memory, this was done to eliminate the threadgroup staging copy
  • The new kernel uses cooperative tensor accessors (cT.get_multidimensional_index / cT[i]) for direct per-element device writes
  • Threadgroup memory holds only dequantized A: NRA × NK_TOTAL × sizeof(fp16).

Requirements

Testing Details

  • Tested on 16 inch M5 Max (Best)
Model pp512 pp1024 pp2048 pp4096 Model GeoMean
DeepSeek-8B-f16 +86.2% +84.3% +80.4% +71.9% +80.6%
L2-7B-Q6_K +49.8% +47.8% +46.3% +43.7% +46.9%
TL-Q3_K_S +32.7% +31.9% +27.3% +21.5% +28.3%
TL-Q5_K_M +28.2% +29.5% +25.5% +19.7% +25.7%
TL-Q2_K +28.6% +30.5% +26.2% +20.0% +26.3%
G-2B-q8_0 +27.9% +27.6% +28.6% +27.9% +28.0%
TL-Q6_K +27.8% +28.4% +23.9% +18.1% +24.5%
Q3-4B-Q8_0 +23.9% +24.2% +22.8% +20.7% +22.9%
TL-IQ4_XS +23.5% +23.6% +20.5% +16.6% +21.0%
TL-Q4_K_M +18.7% +19.5% +17.0% +13.2% +17.1%
TL-Q8_0 +17.1% +17.1% +15.4% +11.9% +15.4%
TL-Q5_0 +14.8% +16.3% +13.8% +10.2% +13.8%
TL-Q4_0 +6.1% +7.7% +6.6% +6.8% +6.8%
Overall GeoMean +26.4%
  • I have read and agree with the contributing guidelines
    Yes
  • AI usage disclosure: Yes, Assistive tooling was utilized to navigate & better learn the project codebase, and split work into different phases.

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 24, 2026
@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering changed the title Optimize Metal Tensor API usage for matmul2d Optimize Metal Tensor API usage Mar 24, 2026
@H-A-Khan
Copy link
Copy Markdown

M5 Max (MBP 16"), 6+12 CPU, 40 GPU, 64 GB

Tested with LLaMA 7B v2 (F16, Q8_0, Q4_0) — same models used in the Apple Silicon M-series benchmark discussion.

PR #20962 results (build c3a1128)

model size params backend threads test t/s
llama 7B F16 12.55 GiB 6.74 B MTL,BLAS 6 pp512 3129.64 ± 4.17
llama 7B F16 12.55 GiB 6.74 B MTL,BLAS 6 tg128 35.93 ± 0.15
llama 7B Q8_0 6.67 GiB 6.74 B MTL,BLAS 6 pp512 3101.56 ± 8.30
llama 7B Q8_0 6.67 GiB 6.74 B MTL,BLAS 6 tg128 68.34 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B MTL,BLAS 6 pp512 3246.19 ± 4.60
llama 7B Q4_0 3.56 GiB 6.74 B MTL,BLAS 6 tg128 110.48 ± 0.41

Comparison with old pinned build (8e672ef)

Test Old (8e672ef) This PR Change
F16 pp512 1018.30 t/s 3129.64 t/s +207%
F16 tg128 37.58 t/s 35.93 t/s -4%
Q8_0 pp512 1051.59 t/s 3101.56 t/s +195%
Q8_0 tg128 64.61 t/s 68.34 t/s +6%
Q4_0 pp512 987.10 t/s 3246.19 t/s +229%
Q4_0 tg128 102.93 t/s 110.48 t/s +7%

Prompt processing ~3x faster across the board. Text generation unchanged as expected (memory-bandwidth bound). Great work! 🚀

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented Mar 28, 2026

Could you also confirm correctness by running llama-perplexity? Thanks.

Edit: also, no need to compare to the old 8e672ef. Compare to current master.

@Hassan-A-K
Copy link
Copy Markdown

Hassan-A-K commented Mar 28, 2026

@ggerganov Perplexity results on M5 Max (MBP 16", 40 GPU, 64 GB) — Wikitext-2 test set:

Model Perplexity (PPL)
F16 5.7966 ± 0.03235
Q8_0 5.7978 ± 0.03236
Q4_0 5.9622 ± 0.03348

All within expected ranges. Correctness looks good.

Edit: running perplexity comparison against current master as well. Results incoming.

@Hassan-A-K
Copy link
Copy Markdown

Hassan-A-K commented Mar 28, 2026

@ggerganov Updated results — perplexity + benchmark comparison against current master (c46758d).

Perplexity — Wikitext-2 (M5 Max, MBP 16", 40 GPU, 64 GB)

Model Master PR #20962
F16 5.7962 5.7966
Q8_0 5.7974 5.7978
Q4_0 5.9618 5.9622

Benchmark vs master

Test Master (c46758d) PR #20962 Change
F16 pp512 1,601.60 t/s 3,129.64 t/s +95%
F16 tg128 36.86 t/s 35.93 t/s -3%
Q8_0 pp512 1,909.17 t/s 3,101.56 t/s +62%
Q8_0 tg128 67.86 t/s 68.34 t/s +1%
Q4_0 pp512 2,052.20 t/s 3,246.19 t/s +58%
Q4_0 tg128 109.56 t/s 110.48 t/s +1%

Huge pp improvement across the board. TG unchanged (memory-bound). No correctness regression.

@Hassan-A-K
Copy link
Copy Markdown

Perplexity results — M5, 4+6 CPU, 10 GPU, 24 GB

Ran llama-perplexity on wikitext-2-raw with LLaMA 2 7B on the optimize_metal_tensor branch:

Model PPL ±
F16 5.7845 0.03242
Q8_0 5.7865 0.03243
Q4_0 5.9581 0.03359

F16 and Q8_0 are nearly identical as expected. Q4_0 shows the usual small quantization degradation. No correctness issues observed.

build: c3a1128 (8509)

@Developer-Ecosystem-Engineering
Copy link
Copy Markdown
Author

Thank you @Hassan-A-K for the additional details (and confirmation)! Will keep an eye on this for any follow up relevant requests or questions related to integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants