-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Compiled Graph] Unstable Network Conditions Cause Hang with vLLM v1 API in Cross-Node Pipeline Parallelism #58426
Description
We are experiencing the same issue of Ray hanging with vLLM v1 API as described in Ray hangs with vllm . After extensive testing, we have gathered the following observations:
-
- The issue does not occur in a single-node setup (e.g., TP4 PP2).
-
- The hang typically manifests after several hours of load testing vLLM.
-
- The problem occurs regardless of whether VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE is set to nccl or shm.
-
- In the same environment with identical configuration, vLLM v0 (which does not use Ray Compiled Graph) runs without issues, while vLLM v1 (which enables Ray Compiled Graph by default) encounters this hang.
We therefore suspected a fundamental issue within the Ray Compiled Graph's implementation for cross-node communication, where certain edge cases are not properly handled, leading to hang.
To investigate, we modified the Ray source code to add more detailed logging. Our investigation revealed the root cause:
The Ray Compiled Graph workflow, during cross-node data transfer, does not account for network jitter or OS scheduling variances. This can lead to request reordering, where PushMutableObject requests from different processes within the same vLLM TP group become out-of-order (some processes haven't finished the previous round, while others have already started the next round), ultimately resulting in a hang.
Root Cause Analysis
Test Scenario: vLLM 0.9.1, Configuration: TP4 PP4.
Node1 (Ray head node): rank0-rank3、rank4-rank7
Node2: rank8-rank11、rank12-rank15
Symptom: After hours of load testing, the system hangs. Log analysis indicates that the hang occurs because the HandlePushMutableObject thread in node1's raylet, responsible for processing PushMutableObject requests, is permanently blocked.
Detailed Sequence of Events Leading to Hang:
=============at T=34ms ====================================================
Rank 12-15 operate normally.
Main thread: read -> compute -> write succeeds.
PollWriterClosure thread: Sends PushMutableObject gRPC request to node1, and receives a reply successfully.
========== ==at T=63ms ====================================================
Rank12:
Main thread: read -> compute -> write succeeds.
PollWriterClosure thread: Sends PushMutableObject gRPC request to node1, and receives a reply successfully.
Rank 13, 14, 15:
Main thread: read -> compute -> write succeeds.
PollWriterClosure thread: Sends PushMutableObject gRPC request to node1, but does not receive a reply yet (blocks waiting).
============ at T=113ms ===================================================
Rank12:
Main thread: read -> compute -> write succeeds.
PollWriterClosure thread: Sends the next PushMutableObject request for the new round to node1, and receives reply. (This request is processed sucessfully by node1 because the driver successfully read rank12's data from the previous round T=64ms).
Rank 13, 14, 15:
Main thread: read -> compute -> write succeeds.
PollWriterClosure thread: Still blocked waiting for the reply from their requests sent @63ms.
============at T=123ms (The Deadlock) ========================================
Rank12:
@
Main thread: read -> compute -> write succeeds.
PollWriterClosure thread: Sends another PushMutableObject request to node1.
CRITICAL: On node1, the HandlePushMutableObject thread now tries to process this new request from rank12,which needs to ensure that the data from the previous round was successfully read.. However, the driver process is still in a loop, continuously attempting to read data of Rank 13-15 (the round of T=63ms). Therefore, the data of Rank12 (the round of T=113ms) remains unread, causing the HandlePushMutableObject thread to be blocked and waiting.
Rank 13, 14, 15:
Main thread: read -> compute -> write blocks because their data from the previous round (@113ms) has not been.
PollWriterClosure thread: Still blocked waiting for the reply from their original requests (@63ms).
================================================================
The HandlePushMutableObject logic in the node1 raylet is designed to concurrently receive gRPC requests but process them in a single thread. Our logging of the PushMutableObject gRPC requests' send and receive timestamps confirmed that the requests from rank13, rank14, and rank15 (sent at T=63ms) were not received by the node1 raylet until after the T=123ms mark. By this time, the node1 raylet's processing thread was already blocked beacause of handling the request from rank12 of the 123ms round, preventing it from processing any other incoming requests. This situation created a bilateral deadlock, resulting in a permanent hang.
Summary
The essential failure is that the Ray Compiled Graph implementation assumes strict, in-order arrival of PushMutableObject requests from all processes within a vLLM TP group across execution rounds. This assumption is violated under real-world conditions like network jitter and OS scheduling, leading to the described deadlock.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status