Make graph_max_nodes vary by ubatch size#17794
Conversation
…ght explode the graph
ggerganov
left a comment
There was a problem hiding this comment.
We need to eventually improve the Qwen3 Next graph and reduce the number of nodes - there are a lot of unnecessary ops in it.
And also, variable graph size is always going to have various drawbacks. The graph should not change for different batch sizes. It's important to figure out how to do that if we want to have long-term support for linear attention in llama.cpp.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Yeah, that's my plan after merging all the ops and fixing all the major bugs.
The problem is, the recurrent models by their design require chunking. I'm not sure how to do chunking without exploding the graph unless we allow an operation like "REPEAT_SUBGRAPH" (where all the operations are guaranteed to be performed with tensors of the same shape and characteristics for each repeat). At least that's the only idea I was able to come up with. |
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph * Update src/llama-context.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add missing const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph * Update src/llama-context.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add missing const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph * Update src/llama-context.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add missing const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Until now, graph node numbers have largely been constant for GGML graphs, however, with some hybrid attention models, the need for chunking arising from the O(n^3) complexity of recurrent updates in gating functions (for which we use
SOLVE_TRI) makes the graph size dependent on the number of chunks, which is dependent on the size of the ubatch. This patch passes the ubatch size / context size to the function so that for models that need it, we can dynamically calculate the needed number of max nodes.Fixes #17578