[serve.llm] Fixed DP DSV3 issues by kouroshHakha · Pull Request #55802 · ray-project/ray

kouroshHakha · 2025-08-21T04:43:54Z

Fix Data Parallel Resource Allocation and KV Transfer for DSv3

Summary

Fixes resource allocation conflicts and KV transfer backend configuration for data parallel deployments in DSv3.

Key Changes

Resource bundling: Added logic to properly merge replica and child actor bundles so that the replica resource requirement is included in the first bundle to ensure collocated placement between the replica actor and one of the workers.
Port management: Fixed NIXL connector port conflicts by using base_port + dp_rank for data parallel case
Backend configuration: KV transfer backends now receive full LLMConfig instead of just transfer config for better context. This allows more expressive setup methods similar to what is needed for port collision handling.
Deployment options: Added options_override parameter for runtime configuration flexibility

Release tests passed: https://buildkite.com/ray-project/release/builds/54545

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

ruisearch42

Overall LGTM

ruisearch42 · 2025-08-21T17:27:01Z

python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/base.py

+    def __init__(self, llm_config: "LLMConfig"):
        """Base class for connector backends.

        Args:
            kv_transfer_config: Configuration for the KV transfer.


update args doc

ruisearch42 · 2025-08-21T17:28:18Z

python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/base.py

+        assert (
+            kv_transfer_config is not None
+        ), "In Connector backend, kv_transfer_config is not set"


better to validate it early in the constructor, and validate only once?

ruisearch42 · 2025-08-21T17:33:00Z

python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/nixl_connector.py

+                    "NIXL_SIDE_CHANNEL_PORT_BASE", vllm_utils.get_open_port()
+                )
+            )
+            # If dp_rank is set, we should use the
+            # base port + dp_rank as the side channel port
+            dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0)
+            port = base_port + dp_rank


IIUC this is to avoid race conditions? If get_open_port() works perfectly we don't need to add the dp_rank? Maybe add a comment to make it explicit.

ruisearch42 · 2025-08-21T17:34:44Z

python/ray/llm/_internal/serve/deployments/data_parallel/dp_server.py

    llm_config: LLMConfig,
    *,
    name_prefix: Optional[str] = None,
+    options_override: Optional[dict] = None,


QQ: what do you have on mind to use this？

placement groups / deployment name (full name) etc.

ruisearch42 · 2025-08-21T17:36:12Z

python/ray/llm/_internal/serve/configs/server_models.py

+        child_actor_bundles: List[Dict[str, float]],
+        replica_actor_bundle: Dict[str, float],
+    ) -> List[Dict[str, float]]:
+        """Sum up the bundles from replica actor bundles with the first bundle from child actor bundles.


Not fully getting the intention here: the placement strategy is STRICT_PACK (at least for TP only), why do we need this?

It was hanging when deployment was

[{CPU: 1, GPU:0}] + [{GPU: 1}] * tp

Also, in case of PACK, because replicas are not limited to be scheduled on the same node as their child RayWorkers I was always confounded. Modifying to this form of placement ensures the replica actor is scheduled on the same node as its own RayWorker

ruisearch42 · 2025-08-21T17:36:48Z

python/ray/llm/_internal/serve/configs/server_models.py

+        child_actor_bundles: List[Dict[str, float]],
+        replica_actor_bundle: Dict[str, float],


nit: switch the order

nrghosh

Premerge tests - Failing all permutations of python/ray/llm/tests/serve/cpu/configs/test_models.py::TestModelConfig - which makes sense because this PR is changing the shape of the PG bundles. Test expects old two-bundle form (CPU-only head + GPU worker) - and fails since we're merging them.

would it make sense to gate this logic behind a flag for DP path? And / or add new copies of the tests that check the new shape.

In server_models.py this could look like

   collocate = self.experimental_configs.get(
       "collocate_replica_and_child", False
   )
   if collocate:
       pg_bundles = self._merge_replica_actor_and_child_actor_bundles(
           child_actor_bundles, replica_actor_resources
       )
   else:
       pg_bundles = [replica_actor_resources] + child_actor_bundles

linting

kouroshHakha · 2025-08-21T20:50:52Z

would it make sense to gate this logic behind a flag for DP path? And / or add new copies of the tests that check the new shape.

I actually think collocating replica and child is always desired. Isn't it?

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

nrghosh · 2025-08-21T22:46:39Z

premerge assertion failure - test_models.py::TestModelConfig::test_get_serve_options_without_accelerator_type


[2025-08-21T22:20:55Z] E       AssertionError: assert [{'CPU': 1, 'GPU': 1}] == [{'CPU': 1, '...}, {'GPU': 1}]
--
  | [2025-08-21T22:20:55Z] E         At index 0 diff: {'GPU': 1, 'CPU': 1} != {'CPU': 1, 'GPU': 0}
  | [2025-08-21T22:20:55Z] E         Right contains one more item: {'GPU': 1}
  | [2025-08-21T22:20:55Z] E         Full diff:
  | [2025-08-21T22:20:55Z] E         - [{'CPU': 1, 'GPU': 0}, {'GPU': 1}]
  | [2025-08-21T22:20:55Z] E         ?                    ------------
  | [2025-08-21T22:20:55Z] E         + [{'CPU': 1, 'GPU': 1}]
  | [2025-08-21T22:20:55Z]
  | [2025-08-21T22:20:55Z] python/ray/llm/tests/serve/cpu/configs/test_models.py:216: AssertionError
  | [2025-08-21T22:20:55Z] =========================== short test summary info ============================
  | [2025-08-21T22:20:55Z] FAILED python/ray/llm/tests/serve/cpu/configs/test_models.py::TestModelConfig::test_get_serve_options_without_accelerator_type

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR ray-project#55802) only addressed Data Parallelism by using explicit data_parallel_rank. Changes: - base.py: Added _compute_port_offset() method with fallback logic * Priority 1: Use data_parallel_rank if set (DP case) * Priority 2: Hash replica_tag for deterministic offset (TP/PP case) * Fallback: Return 0 - nixl_connector.py: Use _compute_port_offset() instead of dp_rank - lmcache_connector_v1.py: Add numeric port support with offset logic Fixes port collision errors in TP/PP deployments: - Multiple workers no longer bind to same port - Prevents NIXL_ERR_BACKEND and ZMQ errors - Enables successful deployment with pipeline_parallel_size > 1 Reproduction: Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed: 'Creating v1 connector with engine_id: ...-52910 [repeated 3x]' After fix, each worker receives unique port via replica tag hashing, eliminating collisions. Related: ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha added 2 commits August 20, 2025 21:42

wip

a2b51ca

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

665204f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha mentioned this pull request Aug 21, 2025

[Serve.LLM] Failed to launch disaggregated prefiller & decoder worker when enabled pipeline_parallel_size =2 #55775

Closed

kouroshHakha added the go add ONLY when ready to merge, run all tests label Aug 21, 2025

kouroshHakha marked this pull request as ready for review August 21, 2025 16:57

kouroshHakha requested a review from a team as a code owner August 21, 2025 16:57

kouroshHakha changed the title ~~[wip][serve.llm] Fixed DP DSV3 issues~~ [serve.llm] Fixed DP DSV3 issues Aug 21, 2025

kouroshHakha requested a review from ruisearch42 August 21, 2025 17:03

ruisearch42 reviewed Aug 21, 2025

View reviewed changes

nrghosh requested changes Aug 21, 2025

View reviewed changes

ray-gardener bot added serve Ray Serve Related Issue llm labels Aug 21, 2025

kouroshHakha added 2 commits August 21, 2025 13:51

wip

e2dac65

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

3b3e4ba

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

9a55b5e

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

nrghosh approved these changes Aug 22, 2025

View reviewed changes

kouroshHakha merged commit 413f359 into ray-project:master Aug 22, 2025
5 checks passed

jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025

[serve.llm] Fixed DP DSV3 issues (ray-project#55802)

018a426

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

dstrodtman pushed a commit that referenced this pull request Oct 6, 2025

[serve.llm] Fixed DP DSV3 issues (#55802)

fc498c2

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

nrghosh mentioned this pull request Oct 15, 2025

[serve][llm] Generalize DP Fix for LMCache Port Conflicts #57757

Closed

nrghosh mentioned this pull request Oct 16, 2025

[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache #57771

Merged

16 tasks

landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025

[serve.llm] Fixed DP DSV3 issues (ray-project#55802)

6f20d6f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

nrghosh mentioned this pull request Jan 12, 2026

[LLM] Fix NIXL port conflict in prefill-decode disaggregation test #60057

Merged

		child_actor_bundles: List[Dict[str, float]],
		replica_actor_bundle: Dict[str, float],

Conversation

kouroshHakha commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix Data Parallel Resource Allocation and KV Transfer for DSv3

Summary

Key Changes

Uh oh!

ruisearch42 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrghosh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kouroshHakha commented Aug 21, 2025

Uh oh!

nrghosh commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kouroshHakha commented Aug 21, 2025 •

edited

Loading

nrghosh left a comment •

edited

Loading