Skip to content

ggml : add ggml_build_forward_select#18550

Merged
ggerganov merged 5 commits intomasterfrom
gg/graph-avoid-branches-3
Jan 19, 2026
Merged

ggml : add ggml_build_forward_select#18550
ggerganov merged 5 commits intomasterfrom
gg/graph-avoid-branches-3

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Jan 2, 2026

target #18547
alt #18549

  • Add GGML_TENSOR_FLAG_COMPUTE flag indicating that a tensor in the graph must be computed
  • Add new ggml_build_forward_select() call:
    GGML_API struct ggml_tensor * ggml_build_forward_select(
            struct ggml_cgraph  * cgraph,
            struct ggml_tensor ** tensors,
            int                   n_tensors,
            int                   idx);

All provided tensors are built forward into the graph. Only tensors[idx] and it's ancestry are marked for computing via the new flag value.

This new logic allows us to construct graphs that compute different things, but at the same time have the same topology. This is needed to avoid unwanted graph reallocations (#17617).

TODOs:

@github-actions github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs OpenCL Issues specific to the OpenCL backend IBM zDNN issues specific to IBM zDNN Accelerator labels Jan 2, 2026
@jeffbolznv
Copy link
Collaborator

Just want to make sure I understand how this is used - it would still be two separate graphs, they'd just be able to reuse allocations (i.e. ggml-alloc would decide they match)?

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

@ggerganov
Copy link
Member Author

Just want to make sure I understand how this is used - it would still be two separate graphs, they'd just be able to reuse allocations (i.e. ggml-alloc would decide they match)?

Yes, for example the graph when the input is token ids (batch.token != null) and the graph when the input is directly embedding vectors (batch.embd != null) are still different, but with this extra logic the scheduler will not need to reallocate them because all nodes remain the same. It's just a different subset of the nodes being marked for computing.

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

Not yet sure that it's really necessary to do so - at least I can't think of a fail case so far. Note that the GGML_TENSOR_FLAG_COMPUTE flag is controlled only through ggml_build_forward_select().

Copy link
Member

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@am17an
Copy link
Contributor

am17an commented Jan 3, 2026

We would need to check how this behaves with CUDA graphs, since inherently the computation is changing

@taronaeo
Copy link
Contributor

taronaeo commented Jan 3, 2026

cc: @AlekseiNikiforovIBM @Andreas-Krebbel

Give us a week or so to check on this :)

@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch from e7b6c35 to da5d289 Compare January 3, 2026 17:49
@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch 2 times, most recently from 9922d3a to 9f8a79c Compare January 4, 2026 14:56
@am17an
Copy link
Contributor

am17an commented Jan 5, 2026

For CUDA graphs I think adding a check for flags in ggml_graph_node_has_matching_properties should be enough. This would trigger an update to the graph

@AlekseiNikiforovIBM
Copy link
Contributor

cc: @AlekseiNikiforovIBM @Andreas-Krebbel

Give us a week or so to check on this :)

LGTM

Copy link
Contributor

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack for IBM zDNN backend :)

}

if ((cgraph->nodes[i]->flags & GGML_TENSOR_FLAG_COMPUTE) == 0) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the last node or nodes are not flagged, the loop would end without the final command submission. This would need some way to ensure a final submit if submitted_nodes > 0.

Copy link
Member Author

@ggerganov ggerganov Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the check over here: eafbd13

Hm, just noticed this is maybe not enough. Will take some extra look.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe update this logic:

    // If the last op in the cgraph isn't backend GPU, the command buffer doesn't get closed properly
    while (last_node > 0 && ggml_vk_is_empty(cgraph->nodes[last_node])) {
        last_node -= 1;
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated like this: 9696c3e

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for one final ack on the Vulkan changes and will proceed to merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now.

Copy link
Contributor

@reeselevine reeselevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebGPU update looks good to me, we always do a final submission if commands > 0 so there shouldn't be a problem like the comment about the Vulkan backend above.

@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch from 9f8a79c to 3f6df60 Compare January 11, 2026 16:05
@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch 2 times, most recently from 0e73cf9 to fc2dd15 Compare January 14, 2026 11:34
Base automatically changed from gg/llama-reserve to master January 15, 2026 14:39
@ggerganov ggerganov requested a review from ngxson as a code owner January 15, 2026 14:39
@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch 2 times, most recently from 1d13a3d to 637e779 Compare January 16, 2026 15:38
@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch from 637e779 to 8d8647e Compare January 16, 2026 15:53
@ggerganov
Copy link
Member Author

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

@jeffbolznv This should be enough I think: 3646af9. Or maybe I am missing something regarding "And any backend-specific fusion logic."

@am17an CUDA graphs check updated in ced0693

@jeffbolznv
Copy link
Collaborator

I think ggml_can_fuse and ggml_can_fuse_subgroup would need to be updated to make sure all nodes are computed. And any backend-specific fusion logic.

@jeffbolznv This should be enough I think: 3646af9. Or maybe I am missing something regarding "And any backend-specific fusion logic."

ggml_can_fuse_subgraph_ext needs the same check. All vulkan fusions go through either ggml_can_fuse or ggml_can_fuse_subgraph, so we shouldn't need anything else for ggml-vulkan's fusion.

@ggerganov ggerganov force-pushed the gg/graph-avoid-branches-3 branch from 8d8647e to 574a1db Compare January 16, 2026 19:00
Copy link
Contributor

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Vulkan changes are fine now, thank you.

@ggerganov ggerganov merged commit 365a3e8 into master Jan 19, 2026
75 of 78 checks passed
@ggerganov ggerganov deleted the gg/graph-avoid-branches-3 branch January 19, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning IBM zDNN issues specific to IBM zDNN Accelerator model Model specific Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants