[Data][LLM] Support multi-node TP/PP for ray.data.llm#56779
[Data][LLM] Support multi-node TP/PP for ray.data.llm#56779kouroshHakha merged 9 commits intoray-project:masterfrom
Conversation
|
Hey @kouroshHakha, just posting this draft to share results and to get early feedback from you as I rebase onto master. There are a couple of things I’d appreciate your thoughts on.
|
…data.llm; introduce benchmarks Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
4dbea04 to
7a8f0b6
Compare
kouroshHakha
left a comment
There was a problem hiding this comment.
I feel like we should deprecate resource_per_bundle with this as using pg definition per ray actor in the pool is more expressive than just expressing the bundle / or resources per bundle.
python/ray/llm/_internal/batch/processor/serve_deployment_proc.py
Outdated
Show resolved
Hide resolved
| # Placement group config for TP/PP. | ||
| placement_group_config: Optional[Dict[str, Any]] = Field( | ||
| default=None, | ||
| description=( | ||
| "Custom Ray placement group configuration for scheduling vLLM engine workers. " | ||
| "Each bundle should define its resource requirements, such as 'CPU' and 'GPU'. " | ||
| "Note: When using vLLM's Ray distributed executor backend, each bundle must be restricted " | ||
| "to a single GPU. This configuration is only applicable when the Ray distributed executor backend is enabled." | ||
| ), | ||
| ) |
There was a problem hiding this comment.
We need to document better what this structure represent and how it should be constructed. Is it only the bundles? Is it gonna be bundles + strategy? Is it a placement group object?
There was a problem hiding this comment.
I'd create a schema data class for placement_group_config here:
class BundleSchema(allow_arbitrary_kwargs=True):
CPU: int = 1
GPU: int = 1
class PlacementGroupSchema:
bundles: list[BundleSchema]
strategy: Literal[PACK, STRICT_PACK, SPREAD, STRICT_SPREAD]
and then I'd cast the passed dict to this data type. we can then later do placementgroup(**pg.dict()) or sth similar to construct the real placement group. This allows you to skip all those validations you are doing further down the line.
There was a problem hiding this comment.
@kouroshHakha let me know if the latest explanation is clearer to you. Introduced the schema classes as suggested to perform validations and apply default values.
There was a problem hiding this comment.
There was a problem hiding this comment.
+1. Why is this change here?
Addressed separately in #57061. Removed this change from the latest revision to keep this PR focusing on TP/PP changes.
|
/gemini review |
|
@cursoragent review the benchmark code very closely. See what can be improved for the first version of it. |
There was a problem hiding this comment.
Code Review
This PR is a great addition, enabling multi-node TP/PP for ray.data.llm, which is a significant feature for scaling LLM inference. The changes are well-implemented, especially the introduction of placement_group_config and relaxing the placement strategy to PACK. The inclusion of a comprehensive benchmark suite and thorough unit/integration tests (including multi-node and failure cases) is excellent and greatly increases confidence in the changes. I have a few suggestions for minor improvements, mostly around code duplication in the new benchmark script and small code style enhancements.
|
@nrghosh could you work with @jeffreyjeffreywang to setup the new release test and hook it up with the CI? |
- Default placement strategy --> PACK - Schema validation with BundleSchema and PlacementGroupSchema - Deprecate resources_per_bundle --> placement_group_config - Update tests, remove single GPU per bundle restriction - Deprecation warnings + ensure backwards compatibility Based on feedback from kouroshHakha in ray-project#56779 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
|
Hey @nrghosh, could you help validate that |
There was a problem hiding this comment.
Quick comments prior to setting up CI / release testing
- Looks like llm gpu unit test
test_vllm_engine_processor_placement_groupis unhappy:
E AssertionError: assert {'accelerator...gpus': 1, ...} == {'accelerator..._batch': True}
--
| [2025-09-28T08:52:18Z] E Omitting 3 identical items, use -vv to show
| [2025-09-28T08:52:18Z] E Left contains 2 more items:
| [2025-09-28T08:52:18Z] E {'num_cpus': 1, 'num_gpus': 1}
| [2025-09-28T08:52:18Z] E Right contains 1 more item:
| [2025-09-28T08:52:18Z] E {'resources': {'CPU': 1, 'GPU': 1}}
| [2025-09-28T08:52:18Z] E Full diff:
| [2025-09-28T08:52:18Z] E {...
| [2025-09-28T08:52:18Z] E
| [2025-09-28T08:52:18Z] E ...Full output truncated (8 lines hidden), use '-vv' to show
| [2025-09-28T08:52:18Z]
| [2025-09-28T08:52:18Z] python/ray/llm/tests/batch/gpu/processor/test_vllm_engine_proc.py:86: AssertionError
- lint / pre_commit checker also unhappy (
./ci/lint/lint.sh pre_commit)
|
67fe896 to
5992c0d
Compare
- ray data requires separate num_cpus / num_gpus keywords but was receiving resources dict - unit test was expecting resources in format that contradicted with ray data requirements - python docstrings were not complete so linter complained Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
5992c0d to
254ddd8
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enables multi-node tensor and pipeline parallelism for ray.data.llm by introducing a configurable placement_group_config for the vLLMEngineProcessorConfig and changing the default placement strategy to PACK. This is a great enhancement for scaling LLM batch inference. The changes are well-implemented, and the addition of a new benchmarking module and corresponding tests is very valuable.
I have a couple of points of feedback. First, there's an unused parameter in build_serve_deployment_processor which could be misleading. Second, one of the new test cases for placement groups seems to have an incorrect configuration that will likely lead to test failures. Addressing these points will improve the clarity and robustness of the code.
python/ray/llm/_internal/batch/processor/serve_deployment_proc.py
Outdated
Show resolved
Hide resolved
|
|
||
| # Keep only non-CPU/GPU custom resources, if any. | ||
| if resource_counter: | ||
| ray_remote_args["resources"] = dict(resource_counter) |
There was a problem hiding this comment.
Bug: Empty Placement Group Config Causes Resource Allocation Issues
An empty placement_group_config results in an empty bundles list after validation. For the Ray backend, ray.util.placement_group fails. For other backends, it allocates zero CPU/GPU, preventing the vLLM engine from starting. Additionally, non-Ray backends sum resources across all bundles, which may not align with the intent for distinct bundle allocations.
Additional Locations (1)
) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Original PR #56779 by jeffreyjeffreywang Original: ray-project/ray#56779
…ata.llm Merged from original PR #56779 Original: ray-project/ray#56779
Original PR #56779 by jeffreyjeffreywang Original: ray-project/ray#56779
…ata.llm Merged from original PR #56779 Original: ray-project/ray#56779
) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
Original PR #56779 by jeffreyjeffreywang Original: ray-project/ray#56779
…ata.llm Merged from original PR #56779 Original: ray-project/ray#56779
) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

Why are these changes needed?
Currently,
vLLMEngineProcessorConfiginray.data.llmenforces that all GPUs must be colocated on a single node, preventing users from scaling tensor/pipeline parallelism across multiple nodes.This PR removes that limitation by dropping the
STRICT_PACKplacement group constraint, thereby enabling multi-node deployments. Moreover, this PR introduces a new benchmarking module for evaluatingray.data.llmprocessors.Benchmark
To validate that Ray Data LLM does not introduce unreasonable overhead to multi-node TP/PP batch inference, we perform the following benchmark on A10G (
g5.12xlargeinstances) with batch_size=64, drawing 10,000 samples from https://huggingface.co/datasets/Crystalcareai/Code-feedback-sharegpt-renamed. Results show that throughput drops when the model is distributed across multiple nodes. The performance degradation is most pronounced when TP spans across multiple nodes, while configurations that keep TP within the same node show slight declines. The optimal performance is achieved with single-node configurations, confirming that cross-node communication introduces significant overhead that exacerbates with higher tensor parallelism.To reproduce, execute the following command:
Design writeup: https://docs.google.com/document/d/1Db9TDwagXaW5lOWncmkbqq4g769ajDP-MVvURdicZBk/edit?usp=sharing
Related issue number
Closes #55491
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Adds placement-group-based scheduling to vLLM processor (enabling multi-node TP/PP) and introduces a new Ray Data LLM batch benchmarking module.
placement_group_configtovLLMEngineProcessorConfigand plumb through tovLLMEngineStagefor custom PG scheduling; default strategy nowPACK.resourcesin map_batches and compute GPUs/CPUs from PG config (or fallback), enabling multi-node TP/PP withdistributed_executor_backend=ray.ProcessorConfig.resources_per_bundle.resources; setnum_gpusvia Ray remote args.python/ray/llm/_internal/batch/benchmark/with a CLI to benchmark vLLM/Serve processors and a ShareGPT-backed dataset loader.Written by Cursor Bugbot for commit d3bb6cb. This will update automatically on new commits. Configure here.