Skip to content

[Bug]: Forked workers retain stale CUDA primary contexts from parent process #42873

@lokashrinav

Description

@lokashrinav

Your current environment

vLLM main (latest) and v0.8.5.post1, tested on H100 multi-GPU (tp=2).

How would you like to use vllm

I'm working on integrating NVIDIA's cuda-checkpoint tool with vLLM for near-zero cold starts (related to RFC #34303). During multi-GPU testing, I discovered that forked worker processes retain the parent's CUDA primary context for GPU 0, even when the worker is assigned to GPU 1.

Before submitting a new issue...

  • I have searched existing issues
  • I have read the relevant documentation

Describe the bug

When vLLM uses the fork multiprocessing method (the default on Linux), child worker processes inherit the parent's active CUDA primary contexts for all devices. A worker assigned to GPU 1 ends up with two active primary contexts: GPU 0 (inherited from parent) and GPU 1 (its own).

This causes two problems:

  1. Wasted GPU memory - the stale GPU 0 context in the GPU 1 worker holds driver-level allocations that never get freed
  2. NVIDIA cuda-checkpoint failures - cuda-checkpoint --action restore fails with "invalid argument" because it tries to restore both contexts in the worker process, but the GPU 0 context is stale and cannot be restored

Reproduction

# In a forked worker process assigned to GPU 1:
import ctypes, torch
libcuda = ctypes.CDLL("libcuda.so.1")
libcuda.cuInit(0)

for dev_id in range(torch.cuda.device_count()):
    dev = ctypes.c_int()
    libcuda.cuDeviceGet(ctypes.byref(dev), dev_id)
    flags = ctypes.c_uint()
    state = ctypes.c_int()
    libcuda.cuDevicePrimaryCtxGetState(dev, ctypes.byref(flags), ctypes.byref(state))
    print(f"GPU {dev_id}: active={state.value != 0}")
# Output:
#   GPU 0: active=True   <-- STALE, inherited from parent
#   GPU 1: active=True   <-- worker's actual device

Root cause

In vllm/v1/worker/gpu_worker.py, Worker.init_device() calls torch.accelerator.set_device_index(self.device) to set the worker's device, but never releases inherited primary contexts from other devices. The parent process may have initialized CUDA on GPU 0 before forking, and that context persists in the child.

Fix

Call cuDevicePrimaryCtxRelease() for all non-assigned devices after setting the worker's device. I have a PR ready with the fix and a test.

Impact

This is a correctness issue for any external tooling that enumerates per-process CUDA contexts (cuda-checkpoint, GPU memory profilers, container checkpoint/restore). It also causes a small but unnecessary memory waste per worker process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions