Name and Version
CUDA version 8.6, build b8571
Operating systems
Linux
GGML backends
RPC
Hardware
RTX 3060
Models
OpenAI ChatGpt OSS
Problem description & steps to reproduce
When model layers are split across local and RPC backends the RPC one leaks memory, periodically writing log message: ggml_backend_cuda_graph_compute: CUDA graph warmup complete. The local backends work without such messages and memory leaks. Is the message related to the leak, or not, isn't known, but it is a visible difference. To reproduce the case it is enough to run same task repeatedly without restarting the backend, then the warmup messages appear and Nvidia tools show memory increase of a few megabytes each time. Nothing similar is happened on local backend.
Observation: when local backend is stopped (aborted) local devices show empty memory, but the remote device shows some volume occupied. It seems the volume size closely corresponds to the leaked memory.
Remote command line: ./rpc-server -c -p port -H address
Local parameters: --tensor-split 37,28 --device CUDA0,RPC0
Environment variable: LLAMA_ARG_RPC=host:port
First Bad Commit
No response
Relevant log output
Logs
Name and Version
CUDA version 8.6, build b8571
Operating systems
Linux
GGML backends
RPC
Hardware
RTX 3060
Models
OpenAI ChatGpt OSS
Problem description & steps to reproduce
When model layers are split across local and RPC backends the RPC one leaks memory, periodically writing log message: ggml_backend_cuda_graph_compute: CUDA graph warmup complete. The local backends work without such messages and memory leaks. Is the message related to the leak, or not, isn't known, but it is a visible difference. To reproduce the case it is enough to run same task repeatedly without restarting the backend, then the warmup messages appear and Nvidia tools show memory increase of a few megabytes each time. Nothing similar is happened on local backend.
Observation: when local backend is stopped (aborted) local devices show empty memory, but the remote device shows some volume occupied. It seems the volume size closely corresponds to the leaked memory.
Remote command line: ./rpc-server -c -p port -H address
Local parameters: --tensor-split 37,28 --device CUDA0,RPC0
Environment variable: LLAMA_ARG_RPC=host:port
First Bad Commit
No response
Relevant log output
Logs