[serve.llm] Fixed DP DSV3 issues#55802
Conversation
| def __init__(self, llm_config: "LLMConfig"): | ||
| """Base class for connector backends. | ||
|
|
||
| Args: | ||
| kv_transfer_config: Configuration for the KV transfer. |
| assert ( | ||
| kv_transfer_config is not None | ||
| ), "In Connector backend, kv_transfer_config is not set" |
There was a problem hiding this comment.
better to validate it early in the constructor, and validate only once?
| "NIXL_SIDE_CHANNEL_PORT_BASE", vllm_utils.get_open_port() | ||
| ) | ||
| ) | ||
| # If dp_rank is set, we should use the | ||
| # base port + dp_rank as the side channel port | ||
| dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0) | ||
| port = base_port + dp_rank |
There was a problem hiding this comment.
IIUC this is to avoid race conditions? If get_open_port() works perfectly we don't need to add the dp_rank? Maybe add a comment to make it explicit.
| llm_config: LLMConfig, | ||
| *, | ||
| name_prefix: Optional[str] = None, | ||
| options_override: Optional[dict] = None, |
There was a problem hiding this comment.
QQ: what do you have on mind to use this?
There was a problem hiding this comment.
placement groups / deployment name (full name) etc.
| child_actor_bundles: List[Dict[str, float]], | ||
| replica_actor_bundle: Dict[str, float], | ||
| ) -> List[Dict[str, float]]: | ||
| """Sum up the bundles from replica actor bundles with the first bundle from child actor bundles. |
There was a problem hiding this comment.
Not fully getting the intention here: the placement strategy is STRICT_PACK (at least for TP only), why do we need this?
There was a problem hiding this comment.
It was hanging when deployment was
[{CPU: 1, GPU:0}] + [{GPU: 1}] * tp
Also, in case of PACK, because replicas are not limited to be scheduled on the same node as their child RayWorkers I was always confounded. Modifying to this form of placement ensures the replica actor is scheduled on the same node as its own RayWorker
| child_actor_bundles: List[Dict[str, float]], | ||
| replica_actor_bundle: Dict[str, float], |
There was a problem hiding this comment.
- Premerge tests - Failing all permutations of
python/ray/llm/tests/serve/cpu/configs/test_models.py::TestModelConfig- which makes sense because this PR is changing the shape of the PG bundles. Test expects old two-bundle form (CPU-only head + GPU worker) - and fails since we're merging them.
would it make sense to gate this logic behind a flag for DP path? And / or add new copies of the tests that check the new shape.
In server_models.py this could look like
collocate = self.experimental_configs.get(
"collocate_replica_and_child", False
)
if collocate:
pg_bundles = self._merge_replica_actor_and_child_actor_bundles(
child_actor_bundles, replica_actor_resources
)
else:
pg_bundles = [replica_actor_resources] + child_actor_bundles
- linting
I actually think collocating replica and child is always desired. Isn't it? |
|
premerge assertion failure - |
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR ray-project#55802) only addressed Data Parallelism by using explicit data_parallel_rank. Changes: - base.py: Added _compute_port_offset() method with fallback logic * Priority 1: Use data_parallel_rank if set (DP case) * Priority 2: Hash replica_tag for deterministic offset (TP/PP case) * Fallback: Return 0 - nixl_connector.py: Use _compute_port_offset() instead of dp_rank - lmcache_connector_v1.py: Add numeric port support with offset logic Fixes port collision errors in TP/PP deployments: - Multiple workers no longer bind to same port - Prevents NIXL_ERR_BACKEND and ZMQ errors - Enables successful deployment with pipeline_parallel_size > 1 Reproduction: Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed: 'Creating v1 connector with engine_id: ...-52910 [repeated 3x]' After fix, each worker receives unique port via replica tag hashing, eliminating collisions. Related: ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Fix Data Parallel Resource Allocation and KV Transfer for DSv3
Summary
Fixes resource allocation conflicts and KV transfer backend configuration for data parallel deployments in DSv3.
Key Changes
base_port + dp_rankfor data parallel caseLLMConfiginstead of just transfer config for better context. This allows more expressive setup methods similar to what is needed for port collision handling.options_overrideparameter for runtime configuration flexibilityRelease tests passed: https://buildkite.com/ray-project/release/builds/54545