[data] OOM killer kicks in but vLLM gpu processes are not cleaned up

### What happened + What you expected to happen

The issue is when we have a stage with vLLM engine and oom killer kicks in, the process associated with the killed process is not killed. This is separate from this issue https://github.com/ray-project/ray/issues/53124 that was mitigated by using ray as the distributed executor backend. 


This issue is there regardless of whether we use distributed_exector_backend = ray or mp. In case of using ray, the respawning may request a new gpu in a multi gpu cluster but the old process still lingers on its gpu. 

### Versions / Dependencies

N/A

### Reproduction script

```python

import ray
from vllm import LLM


class UDF:
    
    def __init__(self):
        self.memory = []
        self.llm = LLM(
            model="unsloth/Llama-3.2-1B-Instruct",
            enforce_eager=True,
            # If it's MP it the zombie process and the new process will collide on the same GPU.
            # If it's ray it can choose another gpu and not collide but eventually it can hit the same GPU.
            distributed_executor_backend="ray",
        )

        
        
    def __call__(self, batch):
        
        # ~400x4 MB of data per batch
        GIANT_OBJECT = "🤗" * 400_000_000
        self.memory.append(GIANT_OBJECT)
        
        return batch
    
    
ds = ray.data.range(2000)
ds = ds.map_batches(UDF, batch_size=2, concurrency=1, num_gpus=1)
ds = ds.materialize()

print(ds.take_all())

```

In this repro, I have a UDF that instantiates an LLM engine inside. In its call function I am creating a 400MB object and append it to the memory state effectively increasing the heap of the udf to trigger cpu oom. 

When CPU OOM killer kicks in and restarts the actor the gpu that was occupied by the previous dead process remains occupied. I ran this on 4xL40S with 380GB of VRAM. 

<img width="727" height="476" alt="Image" src="https://github.com/user-attachments/assets/b760e989-10cf-4e64-a8c4-f65eeeccc7b0" />

### Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] OOM killer kicks in but vLLM gpu processes are not cleaned up #54364

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[data] OOM killer kicks in but vLLM gpu processes are not cleaned up #54364

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions