Replies: 3 comments 10 replies
-
|
I think @gaugarg-nv has recently worked on CUDA graphs. I don't know anything about how this works in CUDA. |
Beta Was this translation helpful? Give feedback.
-
|
Earlier we did not cache the keys, rather just compared the graph properties. The key caching allows us to store multiple graphs in case of splits, which speeds up common use cases like tensor offload etc. Is there a structural reason why this backend returns a new cgraph everytime? |
Beta Was this translation helpful? Give feedback.
-
|
ok, it works better with |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
-
Hello @taronaeo (+ @0cc4m if you have some hints),
I'm trying to get the
ggml-virtgpurunning withggml-cuda, but I'm hitting a GPU OOM.Here is what's going on:
during
backend_backend_graph_compute(host side), I allocate a newggml_cgraphobject every timeand the cgraph is reconstructed from what the guest side sent.
but CUDA does an equivalent of this (Claude generated illustration)
the problem is that
ggml-virtgpugives a newcgraphand a newcgraph->nodes[0]key, so a new graph a allocated every time.I've been trying to see what can be cached in the
cgraph, but without success.And I tried to release the
cgraphand its memory, but it seems (according to Claude investigations) that the graph GPU memory is never reclaimed inggml-cuda...Do you have any clue how
ggml-virtgpucould cache the cgraph object structure, but still update it to perform the rightcgraph_compute... ?I mean, which parts of the object must be updated, which parts can stay cached?
Beta Was this translation helpful? Give feedback.
All reactions