Skip to content

Eval bug: Memory leak on RPC CUDA backend #21265

@karambaso

Description

@karambaso

Name and Version

CUDA version 8.6, build b8571

Operating systems

Linux

GGML backends

RPC

Hardware

RTX 3060

Models

OpenAI ChatGpt OSS

Problem description & steps to reproduce

When model layers are split across local and RPC backends the RPC one leaks memory, periodically writing log message: ggml_backend_cuda_graph_compute: CUDA graph warmup complete. The local backends work without such messages and memory leaks. Is the message related to the leak, or not, isn't known, but it is a visible difference. To reproduce the case it is enough to run same task repeatedly without restarting the backend, then the warmup messages appear and Nvidia tools show memory increase of a few megabytes each time. Nothing similar is happened on local backend.

Observation: when local backend is stopped (aborted) local devices show empty memory, but the remote device shows some volume occupied. It seems the volume size closely corresponds to the leaked memory.

Remote command line: ./rpc-server -c -p port -H address
Local parameters: --tensor-split 37,28 --device CUDA0,RPC0
Environment variable: LLAMA_ARG_RPC=host:port

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions