[Data][LLM] Support multi-node setup for ray.data.llm

### Description

At the moment, `vLLMEngineProcessorConfig` from `ray.data.llm` requires all GPU to be located on the same node (see details below). This feature request intends to eliminate the restriction to support cross-node deployment.

with
```
        "tensor_parallel_size": 2,
        "pipeline_parallel_size": 3,
        "distributed_executor_backend": "ray"
```
=> `{'GPU': 1.0, 'CPU': 1.0} * 6 (STRICT_PACK)`

with
```
        "tensor_parallel_size": 2,
        "pipeline_parallel_size": 3,
        # "distributed_executor_backend": "ray"
```
=> `[{GPU: 6, CPU: 1}]`

with
```
    engine_kwargs={
        ...
        "tensor_parallel_size": 2,
        "pipeline_parallel_size": 3,
        # "distributed_executor_backend": "ray"
    },
    resources_per_bundle={'num_gpus': 1},
```
=> `[{CPU: 1, num_gpus: 6}]`

#### How does it happen

Currently, `vllm_engine_stage.py` aggregates [all TP and PP](https://github.com/ray-project/ray/blob/f973fe59032e20a80a7ed5cbc75b87eee37a2b45/python/ray/llm/_internal/batch/stages/vllm_engine_stage.py#L654) requirements, and generate `ray_remote_args` through [_ray_scheduling_strategy_fn](https://github.com/ray-project/ray/blob/f973fe59032e20a80a7ed5cbc75b87eee37a2b45/python/ray/llm/_internal/batch/stages/vllm_engine_stage.py#L669), which forces [STRICT_PACK](https://github.com/ray-project/ray/blob/f973fe59032e20a80a7ed5cbc75b87eee37a2b45/python/ray/llm/_internal/batch/stages/vllm_engine_stage.py#L615) strategy. `STRICT_PACK` puts [all bundles on the same node](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html#placement-strategy)

#### Request

Ideally we want TP to be put on the same node, but eliminate the restriction to allow PP put on different nodes.

### Use case

1. if TP*PP is too large, there might not be such node to satisfy the requirement (usually at most 8 GPUs on a node)
2. if the user is accessible to just small node (with 1 or 2 GPU), we want user still able to config PP.

### Potential solution

1. update `_ray_scheduling_strategy_fn` to generate something similar to `[num_gpus: tp_size] * pp_size, strategy='PACK'`. As TP usually requires high bandwidth to minimize latency
2. (be cautious) if above is not enough, introduce `resources` in addition to `resources_per_bundle` to allow users to provide detailed resource requirement.

internal jira https://anyscale1.atlassian.net/browse/CI-1255

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][LLM] Support multi-node setup for ray.data.llm #55491

Description

How does it happen

Request

Use case

Potential solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data][LLM] Support multi-node setup for ray.data.llm #55491

Description

Description

How does it happen

Request

Use case

Potential solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions