[Bug]: Forked workers retain stale CUDA primary contexts from parent process

## Your current environment

vLLM main (latest) and v0.8.5.post1, tested on H100 multi-GPU (tp=2).

## How would you like to use vllm

I'm working on integrating NVIDIA's `cuda-checkpoint` tool with vLLM for near-zero cold starts (related to RFC #34303). During multi-GPU testing, I discovered that forked worker processes retain the parent's CUDA primary context for GPU 0, even when the worker is assigned to GPU 1.

## Before submitting a new issue...

- [x] I have searched existing issues
- [x] I have read the relevant documentation

## Describe the bug

When vLLM uses the `fork` multiprocessing method (the default on Linux), child worker processes inherit the parent's active CUDA primary contexts for **all** devices. A worker assigned to GPU 1 ends up with **two** active primary contexts: GPU 0 (inherited from parent) and GPU 1 (its own).

This causes two problems:

1. **Wasted GPU memory** - the stale GPU 0 context in the GPU 1 worker holds driver-level allocations that never get freed
2. **NVIDIA cuda-checkpoint failures** - `cuda-checkpoint --action restore` fails with "invalid argument" because it tries to restore both contexts in the worker process, but the GPU 0 context is stale and cannot be restored

### Reproduction

```python
# In a forked worker process assigned to GPU 1:
import ctypes, torch
libcuda = ctypes.CDLL("libcuda.so.1")
libcuda.cuInit(0)

for dev_id in range(torch.cuda.device_count()):
    dev = ctypes.c_int()
    libcuda.cuDeviceGet(ctypes.byref(dev), dev_id)
    flags = ctypes.c_uint()
    state = ctypes.c_int()
    libcuda.cuDevicePrimaryCtxGetState(dev, ctypes.byref(flags), ctypes.byref(state))
    print(f"GPU {dev_id}: active={state.value != 0}")
# Output:
#   GPU 0: active=True   <-- STALE, inherited from parent
#   GPU 1: active=True   <-- worker's actual device
```

### Root cause

In `vllm/v1/worker/gpu_worker.py`, `Worker.init_device()` calls `torch.accelerator.set_device_index(self.device)` to set the worker's device, but never releases inherited primary contexts from other devices. The parent process may have initialized CUDA on GPU 0 before forking, and that context persists in the child.

### Fix

Call `cuDevicePrimaryCtxRelease()` for all non-assigned devices after setting the worker's device. I have a PR ready with the fix and a test.

### Impact

This is a correctness issue for any external tooling that enumerates per-process CUDA contexts (cuda-checkpoint, GPU memory profilers, container checkpoint/restore). It also causes a small but unnecessary memory waste per worker process.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Forked workers retain stale CUDA primary contexts from parent process #42873

Your current environment

How would you like to use vllm

Before submitting a new issue...

Describe the bug

Reproduction

Root cause

Fix

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Forked workers retain stale CUDA primary contexts from parent process #42873

Description

Your current environment

How would you like to use vllm

Before submitting a new issue...

Describe the bug

Reproduction

Root cause

Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions