ggml-virtgpu: cgraph caching #19708

kpouget · 2026-02-18T09:39:51Z

kpouget
Feb 18, 2026

Hello @taronaeo (+ @0cc4m if you have some hints),

I'm trying to get the ggml-virtgpu running with ggml-cuda, but I'm hitting a GPU OOM.

Here is what's going on:

during backend_backend_graph_compute (host side), I allocate a new ggml_cgraph object every time

apir_deserialize_graph:
  ggml_cgraph *  graph = ggml_new_graph_custom(ctx, n_nodes, false);

and the cgraph is reconstructed from what the guest side sent.

but CUDA does an equivalent of this (Claude generated illustration)

  void* graph_key = ggml_cuda_graph_get_key(cgraph);           // Get nodes[0] address
  ggml_cuda_graph* graph = cuda_ctx->cuda_graph(graph_key);    // Cache lookup

  if (graph->instance == nullptr) {
      // CACHE MISS: Create new CUDA graph (~6-8MB allocation)
      create_and_instantiate_cuda_graph(cgraph, graph);
  } else {
      // CACHE HIT: Validate if cached graph can be reused
      if (ggml_cuda_graph_update_required(cuda_ctx, cgraph)) {
          // Structure changed: Update or recreate graph
          update_or_recreate_cuda_graph(cgraph, graph);
      } else {
          // Perfect match: Reuse existing optimized graph
          launch_cached_cuda_graph(graph);
      }
  }

the problem is that ggml-virtgpu gives a new cgraph and a new cgraph->nodes[0] key, so a new graph a allocated every time.

I've been trying to see what can be cached in the cgraph, but without success.
And I tried to release the cgraph and its memory, but it seems (according to Claude investigations) that the graph GPU memory is never reclaimed in ggml-cuda...

Do you have any clue how ggml-virtgpu could cache the cgraph object structure, but still update it to perform the right cgraph_compute ... ?
I mean, which parts of the object must be updated, which parts can stay cached?

0cc4m · 2026-02-19T09:53:26Z

0cc4m
Feb 19, 2026
Collaborator

I think @gaugarg-nv has recently worked on CUDA graphs. I don't know anything about how this works in CUDA.

3 replies

gaugarg-nv Feb 19, 2026
Collaborator

I'm not familiar with ggml-virtgpu, so I can't comment on how to cache cgraph.

One workaround for you is to disable CUDA graph by setting GGML_CUDA_DISABLE_GRAPHS=1 env variable.

I think it will be good to detect such cases in ggml-cuda and disable CUDA graphs, instead of OOM. If too many graph instances are created without ever getting used more than once, we can just disable CUDA graphs.
CC: @am17an @JohannesGaessler

am17an Feb 19, 2026
Collaborator

Yes that's probably a good idea. It would also be nice that the memory would be reclaimed, currently we store the cgraphs in an unordered_map which doesn't delete anything, a LRU type fixed size vector would be better perhaps.

kpouget Feb 19, 2026
Author

It would also be nice that the memory would be reclaimed

would be nice indeed, that was my first try, cleanup the cgraph, but that didn't help.

One workaround for you is to disable CUDA graph by setting GGML_CUDA_DISABLE_GRAPHS=1 env variable.

nice, I'll try that ASAP

am17an · 2026-02-19T12:28:46Z

am17an
Feb 19, 2026
Collaborator

Earlier we did not cache the keys, rather just compared the graph properties. The key caching allows us to store multiple graphs in case of splits, which speeds up common use cases like tensor offload etc. Is there a structural reason why this backend returns a new cgraph everytime?

2 replies

kpouget Feb 19, 2026
Author

Is there a structural reason why this backend returns a new cgraph everytime?

the ggml-virtgpu is actually split in two, one GGML backend running in the VM, one "API Remoting" backend running in the host.
both sides exchange the relevant data to make everything work, but in the current implementation, the cgraph object is reconstructed every time. Same logical content, different memory objects.

I tried to implement a caching mechanism, but nothing could work, and I lack knowledge to really understand what changes/what remains constant in the object between two cgraph_compute calls ...

am17an Feb 19, 2026
Collaborator

I see, since CUDA graphs are being recreated after graph_compute, you should just disable them for now as it wouldn't provide a speedup. A possible solution would be to store the graphs via the first node's node->name, but I'm unsure how reliable that is, apart from the string comparison in the fast path

kpouget · 2026-02-20T08:57:45Z

kpouget
Feb 20, 2026
Author

ok, it works better with GGML_CUDA_DISABLE_GRAPHS=1, nice :)

5 replies

gaugarg-nv Feb 20, 2026
Collaborator

I have a PR #19754 open that should also address this issue.

kpouget Mar 6, 2026
Author

just for information, the PR doesn't solve my issue. I'll try to investigate in the next weeks what's going on.

I hit this crash with b8163

/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: CUDA error
/app/llama.cpp/build/bin/libggml-base.so.0(+0x16608)[0x7f20c406f608]
/app/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x7f20c406f9d6]
/app/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x11d)[0x7f20c406fb5d]
/usr/bin/libggml-cuda.so(+0x188c03)[0x7f20ba40cc03]
/usr/bin/libggml-cuda.so(+0x19b662)[0x7f20ba41f662]
/usr/bin/libggml-cuda.so(+0x19c32d)[0x7f20ba42032d]
/app/llama.cpp/build/bin/libggml-virtgpu-backend.so(_Z29backend_backend_graph_computeP12apir_encoderP12apir_decoderP18virgl_apir_context+0x16e)[0x7f20c40fdcee]
/app/llama.cpp/build/bin/libggml-virtgpu-backend.so(apir_backend_dispatcher+0x68)[0x7f20c40fda08]
/usr/libexec/virgl_render_server[0x40549b]
/usr/libexec/virgl_render_server[0x406766]
/usr/libexec/virgl_render_server[0x403365]
/usr/libexec/virgl_render_server[0x4026b3]
/lib64/libc.so.6(+0x2a610)[0x7f20c5962610]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7f20c59626c0]
/usr/libexec/virgl_render_server[0x4026e5]
srv          stop: cancel task, id_task = 918

but when I set GGML_CUDA_DISABLE_GRAPHS=1 it works

prompt-processing benchmark

token-generation benchmark

gaugarg-nv Mar 8, 2026
Collaborator

Can you paste the error log? It should print a more descriptive CUDA error and the place where the error occurred.

kpouget Mar 8, 2026
Author

this (see above) is the error message that was captured, nothing more

/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: CUDA error
/app/llama.cpp/build/bin/libggml-base.so.0(+0x16608)[0x7f20c406f608]
/app/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x7f20c406f9d6]
/app/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x11d)[0x7f20c406fb5d]
/usr/bin/libggml-cuda.so(+0x188c03)[0x7f20ba40cc03]
/usr/bin/libggml-cuda.so(+0x19b662)[0x7f20ba41f662]
/usr/bin/libggml-cuda.so(+0x19c32d)[0x7f20ba42032d]
/app/llama.cpp/build/bin/libggml-virtgpu-backend.so(_Z29backend_backend_graph_computeP12apir_encoderP12apir_decoderP18virgl_apir_context+0x16e)[0x7f20c40fdcee]

that's the GGML_ABORT call, so it seems that the GGML_LOG_ERROR messages weren't logged :/

gaugarg-nv Mar 9, 2026
Collaborator

Maybe try using -v or -lv 2 command line option.

ggml-virtgpu: cgraph caching #19708

Uh oh!

kpouget Feb 18, 2026

Replies: 3 comments · 10 replies

Uh oh!

0cc4m Feb 19, 2026 Collaborator

Uh oh!

gaugarg-nv Feb 19, 2026 Collaborator

Uh oh!

am17an Feb 19, 2026 Collaborator

Uh oh!

kpouget Feb 19, 2026 Author

Uh oh!

am17an Feb 19, 2026 Collaborator

Uh oh!

kpouget Feb 19, 2026 Author

Uh oh!

am17an Feb 19, 2026 Collaborator

Uh oh!

kpouget Feb 20, 2026 Author

Uh oh!

gaugarg-nv Feb 20, 2026 Collaborator

Uh oh!

kpouget Mar 6, 2026 Author

Uh oh!

gaugarg-nv Mar 8, 2026 Collaborator

Uh oh!

kpouget Mar 8, 2026 Author

Uh oh!

gaugarg-nv Mar 9, 2026 Collaborator

kpouget
Feb 18, 2026

Replies: 3 comments 10 replies

0cc4m
Feb 19, 2026
Collaborator

gaugarg-nv Feb 19, 2026
Collaborator

am17an Feb 19, 2026
Collaborator

kpouget Feb 19, 2026
Author

am17an
Feb 19, 2026
Collaborator

kpouget Feb 19, 2026
Author

am17an Feb 19, 2026
Collaborator

kpouget
Feb 20, 2026
Author

gaugarg-nv Feb 20, 2026
Collaborator

kpouget Mar 6, 2026
Author

gaugarg-nv Mar 8, 2026
Collaborator

kpouget Mar 8, 2026
Author

gaugarg-nv Mar 9, 2026
Collaborator