metal : allow ops to run concurrently by ggerganov · Pull Request #15929 · ggml-org/llama.cpp

ggerganov · 2025-09-10T19:07:19Z

While queueing the graph nodes, keep track of the memory intervals/ranges from which we read data and to which we write data. Using this information, for the new node we can determine if it can safely run concurrently with all the concurrent ops prior to it:

It should not read data from a memory range that a previous node is writing to
It should not write data to a memory range for which a previous node is reading from or writing to

This feature can be disabled with GGML_METAL_CONCURRENCY_DISABLE=1 env.

Improvements depends on the order of the nodes in the graph. Some models do not currently allow to benefit much from this logic, but utilizing a graph optimization approach similar to #15850 should improve things. Introduced logic for optimizing the graph to improve concurrency in a similar way as in #15850. The benefits are large for TG and decent for PP.

TODO:

Env to disable graph optimization
More comments about the implemented logic
Stats?

Example

For example, before this patch, the graph of one layer of gpt-oss-20b is executed like this :

(concurrent) means that the node runs in parallel with the previous one

0.02.203.953 D ggml_metal_encode_node: node[  897] - ADD          
0.02.203.954 D ggml_metal_encode_node: node[  898] - RMS_NORM     
0.02.203.955 D ggml_metal_encode_node:               fuse 2 ops
0.02.203.955 D ggml_metal_encode_node: node[  900] - MUL_MAT      
0.02.203.956 D ggml_metal_encode_node: node[  901] - ADD          
0.02.203.957 D ggml_metal_encode_node: node[  903] - ROPE         
0.02.203.957 D ggml_metal_encode_node: node[  904] - MUL_MAT      (concurrent)
0.02.203.958 D ggml_metal_encode_node: node[  905] - ADD          
0.02.203.959 D ggml_metal_encode_node: node[  907] - ROPE         
0.02.203.959 D ggml_metal_encode_node: node[  908] - MUL_MAT      (concurrent)
0.02.203.960 D ggml_metal_encode_node: node[  909] - ADD          
0.02.203.961 D ggml_metal_encode_node: node[  912] - SET_ROWS     (concurrent)
0.02.203.961 D ggml_metal_encode_node: node[  914] - SET_ROWS     
0.02.203.962 D ggml_metal_encode_node: node[  921] - FLASH_ATTN_EXT 
0.02.203.966 D ggml_metal_encode_node: node[  923] - MUL_MAT      
0.02.203.966 D ggml_metal_encode_node: node[  924] - ADD          
0.02.203.968 D ggml_metal_encode_node: node[  925] - ADD          
0.02.203.968 D ggml_metal_encode_node: node[  926] - RMS_NORM     
0.02.203.969 D ggml_metal_encode_node:               fuse 2 ops
0.02.203.969 D ggml_metal_encode_node: node[  929] - MUL_MAT      
0.02.203.970 D ggml_metal_encode_node: node[  930] - ADD          
0.02.203.971 D ggml_metal_encode_node: node[  931] - ARGSORT      
0.02.203.972 D ggml_metal_encode_node: node[  933] - MUL_MAT_ID   
0.02.203.972 D ggml_metal_encode_node: node[  934] - ADD_ID       
0.02.203.973 D ggml_metal_encode_node: node[  935] - MUL_MAT_ID   (concurrent)
0.02.203.974 D ggml_metal_encode_node: node[  936] - ADD_ID       
0.02.203.974 D ggml_metal_encode_node: node[  937] - GLU          
0.02.203.975 D ggml_metal_encode_node: node[  938] - MUL_MAT_ID   
0.02.203.976 D ggml_metal_encode_node: node[  939] - ADD_ID       
0.02.203.976 D ggml_metal_encode_node: node[  941] - GET_ROWS     (concurrent)
0.02.203.977 D ggml_metal_encode_node: node[  943] - SOFT_MAX     
0.02.203.978 D ggml_metal_encode_node: node[  945] - MUL          
0.02.203.978 D ggml_metal_encode_node: node[  950] - ADD          
0.02.203.979 D ggml_metal_encode_node:               fuse 3 ops

After this patch, the nodes are reordered and executed like this:

0.02.119.870 D ggml_metal_encode_node: node[  897] - ADD          
0.02.119.871 D ggml_metal_encode_node: node[  898] - RMS_NORM     
0.02.119.872 D ggml_metal_encode_node:               fuse 2 ops
0.02.119.872 D ggml_metal_encode_node: node[  900] - MUL_MAT      
0.02.119.873 D ggml_metal_encode_node: node[  901] - MUL_MAT      (concurrent)
0.02.119.874 D ggml_metal_encode_node: node[  902] - MUL_MAT      (concurrent)
0.02.119.875 D ggml_metal_encode_node: node[  903] - ADD          
0.02.119.875 D ggml_metal_encode_node: node[  905] - ADD          (concurrent)
0.02.119.876 D ggml_metal_encode_node: node[  907] - ADD          (concurrent)
0.02.119.877 D ggml_metal_encode_node: node[  909] - ROPE         
0.02.119.877 D ggml_metal_encode_node: node[  910] - ROPE         (concurrent)
0.02.119.878 D ggml_metal_encode_node: node[  913] - SET_ROWS     
0.02.119.879 D ggml_metal_encode_node: node[  914] - SET_ROWS     (concurrent)
0.02.119.880 D ggml_metal_encode_node: node[  921] - FLASH_ATTN_EXT 
0.02.119.883 D ggml_metal_encode_node: node[  923] - MUL_MAT      
0.02.119.884 D ggml_metal_encode_node: node[  924] - ADD          
0.02.119.885 D ggml_metal_encode_node: node[  925] - ADD          
0.02.119.891 D ggml_metal_encode_node: node[  926] - RMS_NORM     
0.02.119.892 D ggml_metal_encode_node:               fuse 2 ops
0.02.119.892 D ggml_metal_encode_node: node[  929] - MUL_MAT      
0.02.119.893 D ggml_metal_encode_node: node[  930] - ADD          
0.02.119.893 D ggml_metal_encode_node: node[  931] - ARGSORT      
0.02.119.894 D ggml_metal_encode_node: node[  933] - MUL_MAT_ID   
0.02.119.894 D ggml_metal_encode_node: node[  934] - MUL_MAT_ID   (concurrent)
0.02.119.895 D ggml_metal_encode_node: node[  935] - ADD_ID       
0.02.119.896 D ggml_metal_encode_node: node[  936] - ADD_ID       (concurrent)
0.02.119.897 D ggml_metal_encode_node: node[  937] - GLU          
0.02.119.911 D ggml_metal_encode_node: node[  938] - MUL_MAT_ID   
0.02.119.915 D ggml_metal_encode_node: node[  940] - ADD_ID       
0.02.119.917 D ggml_metal_encode_node: node[  941] - GET_ROWS     (concurrent)
0.02.119.920 D ggml_metal_encode_node: node[  943] - SOFT_MAX     
0.02.119.921 D ggml_metal_encode_node: node[  945] - MUL          
0.02.119.923 D ggml_metal_encode_node: node[  950] - ADD          
0.02.119.924 D ggml_metal_encode_node:               fuse 3 ops

Perf

Model	Test	t/s master	t/s gg/metal-concurrent-graphs	Speedup
gemma3 1B Q4_0	pp512	10347.13	10927.45	1.06
gemma3 1B Q4_0	pp2048	11105.25	11289.86	1.02
gemma3 1B Q4_0	pp4096	11278.28	11428.73	1.01
gemma3 1B Q4_0	tg128	204.67	225.84	1.10
gemma3 270M Q4_0	pp512	36085.32	37940.85	1.05
gemma3 270M Q4_0	pp2048	40402.50	41045.04	1.02
gemma3 270M Q4_0	pp4096	42624.23	43358.27	1.02
gemma3 270M Q4_0	tg128	333.98	392.28	1.17
gemma3 4B Q4_0	pp512	2664.39	2738.56	1.03
gemma3 4B Q4_0	pp2048	2837.75	2876.11	1.01
gemma3 4B Q4_0	pp4096	2823.03	2859.76	1.01
gemma3 4B Q4_0	tg128	124.45	137.87	1.11
gpt-oss 20B MXFP4 MoE	pp512	2262.85	2303.68	1.02
gpt-oss 20B MXFP4 MoE	pp2048	2660.63	2661.22	1.00
gpt-oss 20B MXFP4 MoE	pp4096	2653.64	2662.97	1.00
gpt-oss 20B MXFP4 MoE	tg128	120.91	133.14	1.10
qwen2 3B Q4_0	pp512	3019.72	3108.98	1.03
qwen2 3B Q4_0	pp2048	3239.79	3265.49	1.01
qwen2 3B Q4_0	pp4096	3055.22	3081.93	1.01
qwen2 3B Q4_0	tg128	152.27	167.65	1.10
qwen2 7B Q8_0	pp512	1427.79	1455.75	1.02
qwen2 7B Q8_0	pp2048	1500.79	1510.12	1.01
qwen2 7B Q8_0	pp4096	1445.67	1454.36	1.01
qwen2 7B Q8_0	tg128	75.93	78.35	1.03
qwen3 0.6B Q8_0	pp512	13398.86	13937.51	1.04
qwen3 0.6B Q8_0	pp2048	13190.99	13393.04	1.02
qwen3 0.6B Q8_0	pp4096	11061.29	11260.44	1.02
qwen3 0.6B Q8_0	tg128	245.57	274.48	1.12
qwen3moe 30B.A3B Q4_0	pp512	2119.15	2148.71	1.01
qwen3moe 30B.A3B Q4_0	pp2048	2447.42	2468.89	1.01
qwen3moe 30B.A3B Q4_0	pp4096	2183.56	2202.32	1.01
qwen3moe 30B.A3B Q4_0	tg128	91.51	101.70	1.11

calvin2021y · 2025-09-12T06:45:34Z

hi @ggerganov , thanks for the nice work

this pr can not merge into master branch any more

Applied patch to 'ggml/src/ggml-metal/CMakeLists.txt' cleanly.
Performing three-way merge...
error: ggml/src/ggml-metal/ggml-metal-common.cpp: does not exist in index
error: cannot read the current contents of 'ggml/src/ggml-metal/ggml-metal-common.cpp'
error: ggml/src/ggml-metal/ggml-metal-common.cpp: patch does not apply
Performing three-way merge...
error: ggml/src/ggml-metal/ggml-metal-common.h: does not exist in index
error: cannot read the current contents of 'ggml/src/ggml-metal/ggml-metal-common.h'
error: ggml/src/ggml-metal/ggml-metal-common.h: patch does not apply
Applied patch to 'ggml/src/ggml-metal/ggml-metal.m' cleanly.

ggerganov · 2025-09-12T14:28:47Z

@calvin2021y The branch is now rebased on latest master. Would appreciate feedback if you give it a try.

calvin2021y · 2025-09-13T08:01:39Z

I get 1% tps speedup with this patch. will try more models and update late.

ggml-ci

* metal : run graphs ops concurrently ggml-ci * cont : add flags for debugging and disabling concurrency ggml-ci * cont : refactor and handle fusing ggml-ci * cont : simplify - no need to use GPU address ggml-ci * cont : prepare mem ranges for reuse + add ggml-metal-common.cpp ggml-ci * cont : avoid redundant keywords in cpp [no ci] * metal : reorder graph for better concurrency ggml-ci * metal : fix race on mem pool buffers ggml-ci * cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE ggml-ci * cont : refactor, optimize, add comments ggml-ci * cont : refactor ggml-metal.m ggml-ci * minor : update logs [no ci]

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 10, 2025

ggerganov force-pushed the gg/metal-concurrent-graphs branch from 0a6f0eb to 417df40 Compare September 12, 2025 14:27

ggerganov added 11 commits September 13, 2025 12:45

metal : run graphs ops concurrently

2fb1552

ggml-ci

cont : add flags for debugging and disabling concurrency

a3519fd

ggml-ci

cont : refactor and handle fusing

74d2961

ggml-ci

cont : simplify - no need to use GPU address

acd1404

ggml-ci

cont : prepare mem ranges for reuse + add ggml-metal-common.cpp

f7aeab9

ggml-ci

cont : avoid redundant keywords in cpp [no ci]

1c9d3f3

metal : reorder graph for better concurrency

89cca2a

ggml-ci

metal : fix race on mem pool buffers

a3f17d6

ggml-ci

cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE

0b58636

ggml-ci

cont : refactor, optimize, add comments

907616d

ggml-ci

cont : refactor ggml-metal.m

faffbec

ggml-ci

ggerganov force-pushed the gg/metal-concurrent-graphs branch from 17cf93d to faffbec Compare September 13, 2025 09:50

minor : update logs [no ci]

e502db1

ggerganov merged commit f161463 into master Sep 13, 2025
1 check passed

ggerganov deleted the gg/metal-concurrent-graphs branch September 13, 2025 10:54

ggerganov mentioned this pull request Sep 13, 2025

metal : remove memory pools #15966

Merged

2 tasks

ggerganov mentioned this pull request Oct 3, 2025

metal : fix loop bound in ggml_mem_ranges #16412

Merged

ggerganov mentioned this pull request Feb 12, 2026

metal : improve concurrency #19555

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : allow ops to run concurrently#15929

metal : allow ops to run concurrently#15929
ggerganov merged 12 commits intomasterfrom
gg/metal-concurrent-graphs

ggerganov commented Sep 10, 2025 •

edited

Loading

Uh oh!

calvin2021y commented Sep 12, 2025

Uh oh!

ggerganov commented Sep 12, 2025

Uh oh!

calvin2021y commented Sep 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ggerganov commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example

Perf

Uh oh!

calvin2021y commented Sep 12, 2025

Uh oh!

ggerganov commented Sep 12, 2025

Uh oh!

calvin2021y commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Sep 10, 2025 •

edited

Loading

calvin2021y commented Sep 13, 2025 •

edited

Loading