Make graph_max_nodes vary by ubatch size by pwilkin · Pull Request #17794 · ggml-org/llama.cpp

pwilkin · 2025-12-05T12:22:30Z

Until now, graph node numbers have largely been constant for GGML graphs, however, with some hybrid attention models, the need for chunking arising from the O(n^3) complexity of recurrent updates in gating functions (for which we use SOLVE_TRI) makes the graph size dependent on the number of chunks, which is dependent on the size of the ubatch. This patch passes the ubatch size / context size to the function so that for models that need it, we can dynamically calculate the needed number of max nodes.

Fixes #17578

…ght explode the graph

ggerganov

We need to eventually improve the Qwen3 Next graph and reduce the number of nodes - there are a lot of unnecessary ops in it.

And also, variable graph size is always going to have various drawbacks. The graph should not change for different batch sizes. It's important to figure out how to do that if we want to have long-term support for linear attention in llama.cpp.

src/llama-context.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

pwilkin · 2025-12-06T13:31:48Z

We need to eventually improve the Qwen3 Next graph and reduce the number of nodes - there are a lot of unnecessary ops in it.

Yeah, that's my plan after merging all the ops and fixing all the major bugs.

And also, variable graph size is always going to have various drawbacks. The graph should not change for different batch sizes. It's important to figure out how to do that if we want to have long-term support for linear attention in llama.cpp.

The problem is, the recurrent models by their design require chunking. I'm not sure how to do chunking without exploding the graph unless we allow an operation like "REPEAT_SUBGRAPH" (where all the operations are guaranteed to be performed with tensors of the same shape and characteristics for each repeat). At least that's the only idea I was able to come up with.

* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph * Update src/llama-context.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add missing const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Make graph_max_nodes vary by ubatch size for models where chunking mi…

7187520

…ght explode the graph

pwilkin requested a review from ggerganov as a code owner December 5, 2025 12:22

loci-dev mentioned this pull request Dec 5, 2025

UPSTREAM PR #17794: Make graph_max_nodes vary by ubatch size auroralabs-loci/llama.cpp#455

Open

pwilkin mentioned this pull request Dec 6, 2025

Mamba2 SSD #16982

Closed

ggerganov approved these changes Dec 6, 2025

View reviewed changes

src/llama-context.h Outdated Show resolved Hide resolved

Update src/llama-context.h

0ee8444

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

IIIIIllllIIIIIlllll mentioned this pull request Dec 6, 2025

Qwen3-Next --ubatch-size issue #17578

Closed

Add missing const

9aab394

pwilkin merged commit e4e9c43 into ggml-org:master Dec 8, 2025
71 of 78 checks passed

gabe-l-hart mentioned this pull request Dec 10, 2025

feat: llama.cpp bump (17f7f4) for SSM performance improvements ollama/ollama#13408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make graph_max_nodes vary by ubatch size#17794

Make graph_max_nodes vary by ubatch size#17794
pwilkin merged 3 commits intoggml-org:masterfrom
pwilkin:graph_max_nodes

pwilkin commented Dec 5, 2025

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

pwilkin commented Dec 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pwilkin commented Dec 5, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pwilkin commented Dec 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants