Your current environment
vLLM main (latest) and v0.8.5.post1, tested on H100 multi-GPU (tp=2).
How would you like to use vllm
I'm working on integrating NVIDIA's cuda-checkpoint tool with vLLM for near-zero cold starts (related to RFC #34303). During multi-GPU testing, I discovered that forked worker processes retain the parent's CUDA primary context for GPU 0, even when the worker is assigned to GPU 1.
Before submitting a new issue...
Describe the bug
When vLLM uses the fork multiprocessing method (the default on Linux), child worker processes inherit the parent's active CUDA primary contexts for all devices. A worker assigned to GPU 1 ends up with two active primary contexts: GPU 0 (inherited from parent) and GPU 1 (its own).
This causes two problems:
- Wasted GPU memory - the stale GPU 0 context in the GPU 1 worker holds driver-level allocations that never get freed
- NVIDIA cuda-checkpoint failures -
cuda-checkpoint --action restore fails with "invalid argument" because it tries to restore both contexts in the worker process, but the GPU 0 context is stale and cannot be restored
Reproduction
# In a forked worker process assigned to GPU 1:
import ctypes, torch
libcuda = ctypes.CDLL("libcuda.so.1")
libcuda.cuInit(0)
for dev_id in range(torch.cuda.device_count()):
dev = ctypes.c_int()
libcuda.cuDeviceGet(ctypes.byref(dev), dev_id)
flags = ctypes.c_uint()
state = ctypes.c_int()
libcuda.cuDevicePrimaryCtxGetState(dev, ctypes.byref(flags), ctypes.byref(state))
print(f"GPU {dev_id}: active={state.value != 0}")
# Output:
# GPU 0: active=True <-- STALE, inherited from parent
# GPU 1: active=True <-- worker's actual device
Root cause
In vllm/v1/worker/gpu_worker.py, Worker.init_device() calls torch.accelerator.set_device_index(self.device) to set the worker's device, but never releases inherited primary contexts from other devices. The parent process may have initialized CUDA on GPU 0 before forking, and that context persists in the child.
Fix
Call cuDevicePrimaryCtxRelease() for all non-assigned devices after setting the worker's device. I have a PR ready with the fix and a test.
Impact
This is a correctness issue for any external tooling that enumerates per-process CUDA contexts (cuda-checkpoint, GPU memory profilers, container checkpoint/restore). It also causes a small but unnecessary memory waste per worker process.
Your current environment
vLLM main (latest) and v0.8.5.post1, tested on H100 multi-GPU (tp=2).
How would you like to use vllm
I'm working on integrating NVIDIA's
cuda-checkpointtool with vLLM for near-zero cold starts (related to RFC #34303). During multi-GPU testing, I discovered that forked worker processes retain the parent's CUDA primary context for GPU 0, even when the worker is assigned to GPU 1.Before submitting a new issue...
Describe the bug
When vLLM uses the
forkmultiprocessing method (the default on Linux), child worker processes inherit the parent's active CUDA primary contexts for all devices. A worker assigned to GPU 1 ends up with two active primary contexts: GPU 0 (inherited from parent) and GPU 1 (its own).This causes two problems:
cuda-checkpoint --action restorefails with "invalid argument" because it tries to restore both contexts in the worker process, but the GPU 0 context is stale and cannot be restoredReproduction
Root cause
In
vllm/v1/worker/gpu_worker.py,Worker.init_device()callstorch.accelerator.set_device_index(self.device)to set the worker's device, but never releases inherited primary contexts from other devices. The parent process may have initialized CUDA on GPU 0 before forking, and that context persists in the child.Fix
Call
cuDevicePrimaryCtxRelease()for all non-assigned devices after setting the worker's device. I have a PR ready with the fix and a test.Impact
This is a correctness issue for any external tooling that enumerates per-process CUDA contexts (cuda-checkpoint, GPU memory profilers, container checkpoint/restore). It also causes a small but unnecessary memory waste per worker process.