Skip to content

[data] OOM killer kicks in but vLLM gpu processes are not cleaned up #54364

@kouroshHakha

Description

@kouroshHakha

What happened + What you expected to happen

The issue is when we have a stage with vLLM engine and oom killer kicks in, the process associated with the killed process is not killed. This is separate from this issue #53124 that was mitigated by using ray as the distributed executor backend.

This issue is there regardless of whether we use distributed_exector_backend = ray or mp. In case of using ray, the respawning may request a new gpu in a multi gpu cluster but the old process still lingers on its gpu.

Versions / Dependencies

N/A

Reproduction script

import ray
from vllm import LLM


class UDF:
    
    def __init__(self):
        self.memory = []
        self.llm = LLM(
            model="unsloth/Llama-3.2-1B-Instruct",
            enforce_eager=True,
            # If it's MP it the zombie process and the new process will collide on the same GPU.
            # If it's ray it can choose another gpu and not collide but eventually it can hit the same GPU.
            distributed_executor_backend="ray",
        )

        
        
    def __call__(self, batch):
        
        # ~400x4 MB of data per batch
        GIANT_OBJECT = "🤗" * 400_000_000
        self.memory.append(GIANT_OBJECT)
        
        return batch
    
    
ds = ray.data.range(2000)
ds = ds.map_batches(UDF, batch_size=2, concurrency=1, num_gpus=1)
ds = ds.materialize()

print(ds.take_all())

In this repro, I have a UDF that instantiates an LLM engine inside. In its call function I am creating a 400MB object and append it to the memory state effectively increasing the heap of the udf to trigger cpu oom.

When CPU OOM killer kicks in and restarts the actor the gpu that was occupied by the previous dead process remains occupied. I ran this on 4xL40S with 380GB of VRAM.

Image

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoredataRay Data-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions