[HiCache] Fix deadlock when creating new group#15805
[HiCache] Fix deadlock when creating new group#15805stmatengss merged 1 commit intosgl-project:mainfrom
Conversation
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
NIce Man! |
|
/tag-and-rerun-ci |
|
@xiezhq-hermann It is another bug for PP + HiCache compatibility. PTAL |
|
We should merge this first, and then #15175 |
ShangmingCai
left a comment
There was a problem hiding this comment.
@yizhang2077 please take a look of this PR when you have time.
|
Tested ok, with #15175 |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
The unit-test-backend-8-gpu-h200 (1) (pull_request) test failure is unrelated to this PR. It's due to the storage backend not being configured. |
LGTM |
Will retry fix it? |
| from sglang.srt.distributed.parallel_state import ( | ||
| create_custom_parallel_group, | ||
| ) | ||
|
|
||
| group_ranks = torch.distributed.get_process_group_ranks(tp_group) | ||
| self.prefetch_tp_group = torch.distributed.new_group( | ||
| group_ranks, backend="gloo" | ||
| self.prefetch_tp_group = create_custom_parallel_group( | ||
| group_ranks=group_ranks, backend="gloo" |
There was a problem hiding this comment.
We need another approve here from @xiezhq-hermann
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
@xiezhq-hermann CI is green now. Do you have time to quickly check this PR? |
|
Just to clarify a bit, does this problem occur only with pipeline parallelism? |
From my perspective, this issue will arise as long as multiple tp groups need to be established with an enabled storage backend. |
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Motivation
This PR fixes a critical deadlock (hang) issue during the initialization of HiCacheController when running sglang in a multi-node distributed environment (e.g., 2 nodes with 8 GPUs each).
When launching the server with --enable-hierarchical-cache --hicache-storage-backend on multiple nodes, the initialization process hangs at torch.distributed.new_group.
Root Cause Analysis: torch.distributed.new_group is a collective operation. It requires all ranks in the default world group to call it in the same order with the exact same arguments (specifically the ranks list).
In the previous implementation of HiCacheController (PP2 TP8):
Node 0 (Ranks 0-7) called new_group(ranks=[0...7]).
Node 1 (Ranks 8-15) called new_group(ranks=[8...15]).
Because the arguments differed across ranks, the internal distributed store could not match the creation requests, causing a deadlock where Node 0 waited for Node 1 to acknowledge the creation of Group A, while Node 1 was attempting to create Group B.
reported by: @whybeyoung
CC @stmatengss
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist