[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache#57771
Merged
kouroshHakha merged 4 commits intoray-project:masterfrom Oct 22, 2025
Merged
Conversation
Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR ray-project#55802) only addressed Data Parallelism by using explicit data_parallel_rank. Changes: - base.py: Added _compute_port_offset() method with fallback logic * Priority 1: Use data_parallel_rank if set (DP case) * Priority 2: Hash replica_tag for deterministic offset (TP/PP case) * Fallback: Return 0 - nixl_connector.py: Use _compute_port_offset() instead of dp_rank - lmcache_connector_v1.py: Add numeric port support with offset logic Fixes port collision errors in TP/PP deployments: - Multiple workers no longer bind to same port - Prevents NIXL_ERR_BACKEND and ZMQ errors - Enables successful deployment with pipeline_parallel_size > 1 Reproduction: Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed: 'Creating v1 connector with engine_id: ...-52910 [repeated 3x]' After fix, each worker receives unique port via replica tag hashing, eliminating collisions. Related: ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
nrghosh
commented
Oct 17, 2025
Contributor
Author
There was a problem hiding this comment.
Failing llm_serve_vllm_integration_tests::test_deepseek_model release test with
KeyError: Deployment(name='LLMServer:deepseek-ai--DeepSeek-V2-Lite', app='default')
But passes locally on feature branch
PASSED
============================================================================ 1 passed in 57.16s ============================================================================
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225270) WARNING 10-16 17:45:05 [fused_moe.py:798] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/ray/anaconda3/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=704,device_name=NVIDIA_L4.json'] [repeated 4x across cluster]
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [custom_all_reduce.py:154] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [gpu_model_runner.py:2602] Starting to load model deepseek-ai/DeepSeek-V2-Lite...
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [gpu_model_runner.py:2634] Loading model from scratch...
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [utils.py:125] Hidden layers were unevenly partitioned: [14,13]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [cuda.py:297] Using Triton MLA backend on V1 engine.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:45:00 [weight_utils.py:392] Using model weights format ['*.safetensors']
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:45:06 [gpu_worker.py:298] Available KV cache memory: 12.54 GiB [repeated 4x across cluster]
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225270) WARNING 10-16 17:45:07 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used. [repeated 2x across cluster]
(base) ray@ip-10-0-167-142:~/default/work/ray$
(base) ray@ip-10-0-167-142:~/default/work/ray$ git status
On branch nrghosh/pp-tp-kv-port-offset
nothing to commit, working tree clean
1 task
python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/base.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/base.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/base.py
Outdated
Show resolved
Hide resolved
| @@ -35,6 +35,38 @@ def _get_unique_suffix(self, len: int = 6) -> str: | |||
| """ | |||
| return "".join(random.choices(string.ascii_letters + string.digits, k=len)) | |||
|
|
|||
| def _compute_port_offset(self) -> int: | |||
Contributor
There was a problem hiding this comment.
we should just use the replica rank to do this I feel like.
Contributor
Author
There was a problem hiding this comment.
yep
now _compute_port_offset() uses replica_rank from replica context instead of hashing approach
so now logic is:
- Use
data_parallel_rankif explicitly set (DP deployments via DPServer) - Fall back to
replica_rankfrom serve context (TP/PP deployments) - Return 0 as final fallback
...on/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/lmcache_connector_v1.py
Outdated
Show resolved
Hide resolved
- Use replica_rank API instead of hashing approach - Simplify LMCache connector by just keeping string approach - Update comments / lint Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
nrghosh
added a commit
to nrghosh/ray
that referenced
this pull request
Oct 24, 2025
Multiplies replica_rank by tensor_parallel_size to prevent port collisions when scaling to 2+ replicas with TP≥2. Problem: PR ray-project#57771 fixed inter-replica port collisions by using replica_rank instead of defaulting to 0. However, it didn't account for the port space needed by TP workers within each replica. vLLM workers add their tp_rank (0, 1, ..., tp_size-1) to the base port at bind time (vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:790). Without proper spacing, consecutive replicas have overlapping port ranges: Replica 0 TP Worker 1: base + 0 + 1 = 50001 Replica 1 TP Worker 0: base + 1 + 0 = 50001 ← Collision Solution: Space replicas by tp_size ports to reserve room for all TP workers: Replica 0 uses ports: [base, base+1, ..., base+(tp_size-1)] Replica 1 uses ports: [base+tp_size, base+tp_size+1, ...] Impact: - Fixes port collisions when autoscaling to 2+ replicas with TP≥2 - Backward compatible: TP=1 multiplies by 1 (no-op) - DP deployments unchanged: vLLM handles spacing - Single replica deployments unchanged: no other replica to collide with Related: PR ray-project#57771, ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Aydin-ab
pushed a commit
to Aydin-ab/ray-aydin
that referenced
this pull request
Nov 19, 2025
…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Blaze-DSP
pushed a commit
to Blaze-DSP/ray
that referenced
this pull request
Dec 18, 2025
…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR #55802) only addressed Data Parallelism by using explicit
data_parallel_rank.Changes:
base.py: Added_compute_port_offset()method with fallback logicdata_parallel_rankif set (DP case)replica_rankfor deterministic offset (TP/PP case)nixl_connector.py: Use_compute_port_offset()instead of directdp_rankaccesslmcache_connector_v1.py: Simplified to use string-based port naming with random suffixFixes port collision errors in TP/PP deployments:
NIXL_ERR_BACKENDand ZMQ errorsPP size > 1Reproduction:
Deployed Ray Serve with
pipeline_parallel_size=2and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causingNIXL_ERR_BACKEND. Logs showed:Creating v1 connector with engine_id: ...-52910 [repeated 3x]After fix, each worker receives unique port via
replica_rank, eliminating collisions.Related issues
Addresses #55775
Addresses vllm-project/vllm#20980
Types of change
Checklist
Does this PR introduce breaking changes?
Testing:
Code Quality:
git commit -s)Documentation:
doc/source/(if applicable)Additional context
Code Changes
NIXL Connector - Before:
NIXL Connector - After:
_compute_port_offset()Implementation:LMCache Connector - Simplified approach:
Backward Compatibility
data_parallel_rank(priority 1)replica_rankfrom Ray Serve (priority 2)