Optimize Metal Tensor API usage#20962
Optimize Metal Tensor API usage#20962Developer-Ecosystem-Engineering wants to merge 1 commit intoggml-org:masterfrom
Conversation
Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.
M5 Max (MBP 16"), 6+12 CPU, 40 GPU, 64 GBTested with LLaMA 7B v2 (F16, Q8_0, Q4_0) — same models used in the Apple Silicon M-series benchmark discussion. PR #20962 results (build c3a1128)
Comparison with old pinned build (8e672ef)
Prompt processing ~3x faster across the board. Text generation unchanged as expected (memory-bandwidth bound). Great work! 🚀 |
|
Could you also confirm correctness by running Edit: also, no need to compare to the old 8e672ef. Compare to current |
|
@ggerganov Perplexity results on M5 Max (MBP 16", 40 GPU, 64 GB) — Wikitext-2 test set:
All within expected ranges. Correctness looks good. Edit: running perplexity comparison against current master as well. Results incoming. |
|
@ggerganov Updated results — perplexity + benchmark comparison against current master ( Perplexity — Wikitext-2 (M5 Max, MBP 16", 40 GPU, 64 GB)
Benchmark vs master
Huge pp improvement across the board. TG unchanged (memory-bound). No correctness regression. |
Perplexity results — M5, 4+6 CPU, 10 GPU, 24 GBRan
F16 and Q8_0 are nearly identical as expected. Q4_0 shows the usual small quantization degradation. No correctness issues observed. build: c3a1128 (8509) |
|
Thank you @Hassan-A-K for the additional details (and confirmation)! Will keep an eye on this for any follow up relevant requests or questions related to integration. |
Overview
Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else.
Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.
Geomean improvement of ~26%
TL-Q4_ 0 ~6.8%
DeepSeek-8B-f16 ~71.9%
Further test matrix below
Additional information
Requirements
Testing Details
Yes