ggml : add ggml_build_forward_select by ggerganov · Pull Request #18550 · ggml-org/llama.cpp

ggerganov · 2026-01-02T17:35:47Z

Add GGML_TENSOR_FLAG_COMPUTE flag indicating that a tensor in the graph must be computed
Add new ggml_build_forward_select() call:

    GGML_API struct ggml_tensor * ggml_build_forward_select(
            struct ggml_cgraph  * cgraph,
            struct ggml_tensor ** tensors,
            int                   n_tensors,
            int                   idx);

All provided tensors are built forward into the graph. Only tensors[idx] and it's ancestry are marked for computing via the new flag value.

This new logic allows us to construct graphs that compute different things, but at the same time have the same topology. This is needed to avoid unwanted graph reallocations (#17617).

TODOs:

Check flag for CUDA graph updates (ggml : add ggml_build_forward_select #18550 (comment))
Update fusion logic? (ggml : add ggml_build_forward_select #18550 (comment))
~~Enable -DGGML_SCHED_NO_REALLOC=ON for server CI~~ (next PR)
Fix vulkan command submission logic ggml : add ggml_build_forward_select #18550 (comment)

jeffbolznv · 2026-01-02T20:16:08Z

Just want to make sure I understand how this is used - it would still be two separate graphs, they'd just be able to reuse allocations (i.e. ggml-alloc would decide they match)?

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

ggerganov · 2026-01-02T20:58:50Z

Just want to make sure I understand how this is used - it would still be two separate graphs, they'd just be able to reuse allocations (i.e. ggml-alloc would decide they match)?

Yes, for example the graph when the input is token ids (batch.token != null) and the graph when the input is directly embedding vectors (batch.embd != null) are still different, but with this extra logic the scheduler will not need to reallocate them because all nodes remain the same. It's just a different subset of the nodes being marked for computing.

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

Not yet sure that it's really necessary to do so - at least I can't think of a fail case so far. Note that the GGML_TENSOR_FLAG_COMPUTE flag is controlled only through ggml_build_forward_select().

max-krasnyansky

Looks good to me.

am17an · 2026-01-03T09:26:05Z

We would need to check how this behaves with CUDA graphs, since inherently the computation is changing

taronaeo · 2026-01-03T13:55:09Z

cc: @AlekseiNikiforovIBM @Andreas-Krebbel

Give us a week or so to check on this :)

am17an · 2026-01-05T03:40:59Z

For CUDA graphs I think adding a check for flags in ggml_graph_node_has_matching_properties should be enough. This would trigger an update to the graph

AlekseiNikiforovIBM · 2026-01-05T09:47:10Z

cc: @AlekseiNikiforovIBM @Andreas-Krebbel

Give us a week or so to check on this :)

LGTM

taronaeo

Ack for IBM zDNN backend :)

0cc4m · 2026-01-05T13:26:43Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

        }

+        if ((cgraph->nodes[i]->flags & GGML_TENSOR_FLAG_COMPUTE) == 0) {
+            continue;


If the last node or nodes are not flagged, the loop would end without the final command submission. This would need some way to ensure a final submit if submitted_nodes > 0.

Moved the check over here: eafbd13

Hm, just noticed this is maybe not enough. Will take some extra look.

Maybe update this logic:

// If the last op in the cgraph isn't backend GPU, the command buffer doesn't get closed properly while (last_node > 0 && ggml_vk_is_empty(cgraph->nodes[last_node])) { last_node -= 1; }

Updated like this: 9696c3e

Waiting for one final ack on the Vulkan changes and will proceed to merge.

Looks good now.

reeselevine

WebGPU update looks good to me, we always do a final submission if commands > 0 so there shouldn't be a problem like the comment about the Vulkan backend above.

ggerganov · 2026-01-16T15:55:40Z

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

@jeffbolznv This should be enough I think: 3646af9. Or maybe I am missing something regarding "And any backend-specific fusion logic."

@am17an CUDA graphs check updated in ced0693

jeffbolznv · 2026-01-16T16:32:55Z

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

@jeffbolznv This should be enough I think: 3646af9. Or maybe I am missing something regarding "And any backend-specific fusion logic."

ggml_can_fuse_subgraph_ext needs the same check. All vulkan fusions go through either ggml_can_fuse or ggml_can_fuse_subgraph, so we shouldn't need anything else for ggml-vulkan's fusion.

0cc4m

The Vulkan changes are fine now, thank you.

ggerganov requested review from 0cc4m, CISC, lhez, max-krasnyansky, reeselevine and taronaeo as code owners January 2, 2026 17:35

max-krasnyansky approved these changes Jan 3, 2026

View reviewed changes

ggerganov force-pushed the gg/graph-avoid-branches-3 branch from e7b6c35 to da5d289 Compare January 3, 2026 17:49

ggerganov force-pushed the gg/llama-reserve branch from 89d19e0 to c92df39 Compare January 4, 2026 09:12

ggerganov force-pushed the gg/graph-avoid-branches-3 branch 2 times, most recently from 9922d3a to 9f8a79c Compare January 4, 2026 14:56

ggerganov mentioned this pull request Jan 4, 2026

sampling : add support for backend sampling #17004

Merged

31 tasks

taronaeo approved these changes Jan 5, 2026

View reviewed changes

ggerganov force-pushed the gg/llama-reserve branch from c92df39 to cf2b3ca Compare January 5, 2026 12:18

0cc4m reviewed Jan 5, 2026

View reviewed changes

reeselevine approved these changes Jan 5, 2026

View reviewed changes

ggerganov mentioned this pull request Jan 8, 2026

model: try to improve Qwen3 Next #18683

Merged

ggerganov force-pushed the gg/llama-reserve branch from cf2b3ca to 4b74410 Compare January 11, 2026 15:49

ggerganov force-pushed the gg/graph-avoid-branches-3 branch from 9f8a79c to 3f6df60 Compare January 11, 2026 16:05

ggerganov force-pushed the gg/llama-reserve branch from 4b74410 to b579b97 Compare January 12, 2026 14:37

ggerganov force-pushed the gg/graph-avoid-branches-3 branch 2 times, most recently from 0e73cf9 to fc2dd15 Compare January 14, 2026 11:34

Base automatically changed from gg/llama-reserve to master January 15, 2026 14:39

ggerganov requested a review from ngxson as a code owner January 15, 2026 14:39

ggerganov force-pushed the gg/graph-avoid-branches-3 branch 2 times, most recently from 1d13a3d to 637e779 Compare January 16, 2026 15:38

loci-dev mentioned this pull request Jan 16, 2026

UPSTREAM PR #18550: ggml : add ggml_build_forward_select auroralabs-loci/llama.cpp#940

Open

4 tasks

ggerganov force-pushed the gg/graph-avoid-branches-3 branch from 637e779 to 8d8647e Compare January 16, 2026 15:53

ggerganov force-pushed the gg/graph-avoid-branches-3 branch from 8d8647e to 574a1db Compare January 16, 2026 19:00

ggerganov added 4 commits January 17, 2026 15:47

ggml : add ggml_build_forward_select

45e1394

cuda : adapt CUDA graph compat to new feature

14b1937

vulkan : update logic to handle command buffer closing

1a03932

ggml : check compute for fusion

548e4e0

ggerganov force-pushed the gg/graph-avoid-branches-3 branch from 574a1db to 548e4e0 Compare January 17, 2026 13:50

ggerganov mentioned this pull request Jan 17, 2026

graph : utilize ggml_build_forward_select() to avoid reallocations #18898

Merged

ggml : add comment

0522dd9

0cc4m reviewed Jan 19, 2026

View reviewed changes

0cc4m approved these changes Jan 19, 2026

View reviewed changes

ggerganov merged commit 365a3e8 into master Jan 19, 2026
75 of 78 checks passed

ggerganov deleted the gg/graph-avoid-branches-3 branch January 19, 2026 18:03

ggerganov mentioned this pull request Feb 9, 2026

Eval bug: Llama.cpp 40% slower than VLLM + high CPU usage when running Qwen Coder Next #19345

Open

Conversation

ggerganov commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Jan 2, 2026

Uh oh!

ggerganov commented Jan 2, 2026

Uh oh!

max-krasnyansky left a comment

Choose a reason for hiding this comment

Uh oh!

am17an commented Jan 3, 2026

Uh oh!

taronaeo commented Jan 3, 2026

Uh oh!

am17an commented Jan 5, 2026

Uh oh!

AlekseiNikiforovIBM commented Jan 5, 2026

Uh oh!

taronaeo left a comment

Choose a reason for hiding this comment

Uh oh!

0cc4m Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

0cc4m Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

reeselevine left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jan 16, 2026

Uh oh!

jeffbolznv commented Jan 16, 2026

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ggerganov commented Jan 2, 2026 •

edited

Loading

ggerganov Jan 16, 2026 •

edited

Loading