metal : allow ops to run concurrently#15929
Merged
Conversation
|
hi @ggerganov , thanks for the nice work this pr can not merge into master branch any more Applied patch to 'ggml/src/ggml-metal/CMakeLists.txt' cleanly.
Performing three-way merge...
error: ggml/src/ggml-metal/ggml-metal-common.cpp: does not exist in index
error: cannot read the current contents of 'ggml/src/ggml-metal/ggml-metal-common.cpp'
error: ggml/src/ggml-metal/ggml-metal-common.cpp: patch does not apply
Performing three-way merge...
error: ggml/src/ggml-metal/ggml-metal-common.h: does not exist in index
error: cannot read the current contents of 'ggml/src/ggml-metal/ggml-metal-common.h'
error: ggml/src/ggml-metal/ggml-metal-common.h: patch does not apply
Applied patch to 'ggml/src/ggml-metal/ggml-metal.m' cleanly. |
0a6f0eb to
417df40
Compare
Member
Author
|
@calvin2021y The branch is now rebased on latest |
|
I get 1% tps speedup with this patch. will try more models and update late. |
ggml-ci
17cf93d to
faffbec
Compare
2 tasks
blime4
referenced
this pull request
in blime4/llama.cpp
Feb 5, 2026
* metal : run graphs ops concurrently ggml-ci * cont : add flags for debugging and disabling concurrency ggml-ci * cont : refactor and handle fusing ggml-ci * cont : simplify - no need to use GPU address ggml-ci * cont : prepare mem ranges for reuse + add ggml-metal-common.cpp ggml-ci * cont : avoid redundant keywords in cpp [no ci] * metal : reorder graph for better concurrency ggml-ci * metal : fix race on mem pool buffers ggml-ci * cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE ggml-ci * cont : refactor, optimize, add comments ggml-ci * cont : refactor ggml-metal.m ggml-ci * minor : update logs [no ci]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While queueing the graph nodes, keep track of the memory intervals/ranges from which we read data and to which we write data. Using this information, for the new node we can determine if it can safely run concurrently with all the concurrent ops prior to it:
This feature can be disabled with
GGML_METAL_CONCURRENCY_DISABLE=1env.Improvements depends on the order of the nodes in the graph. Some models do not currently allow to benefit much from this logic, but utilizing a graph optimization approach similar to #15850 should improve things.Introduced logic for optimizing the graph to improve concurrency in a similar way as in #15850. The benefits are large for TG and decent for PP.TODO:
Example
For example, before this patch, the graph of one layer of
gpt-oss-20bis executed like this :(concurrent)means that the node runs in parallel with the previous oneAfter this patch, the nodes are reordered and executed like this:
Perf