Skip to content

[HiCache] Fix deadlock when creating new group#15805

Merged
stmatengss merged 1 commit intosgl-project:mainfrom
openanolis:Xuchun/hicache
Dec 29, 2025
Merged

[HiCache] Fix deadlock when creating new group#15805
stmatengss merged 1 commit intosgl-project:mainfrom
openanolis:Xuchun/hicache

Conversation

@XucSh
Copy link
Copy Markdown
Collaborator

@XucSh XucSh commented Dec 25, 2025

Motivation

This PR fixes a critical deadlock (hang) issue during the initialization of HiCacheController when running sglang in a multi-node distributed environment (e.g., 2 nodes with 8 GPUs each).

When launching the server with --enable-hierarchical-cache --hicache-storage-backend on multiple nodes, the initialization process hangs at torch.distributed.new_group.

Root Cause Analysis: torch.distributed.new_group is a collective operation. It requires all ranks in the default world group to call it in the same order with the exact same arguments (specifically the ranks list).

In the previous implementation of HiCacheController (PP2 TP8):

Node 0 (Ranks 0-7) called new_group(ranks=[0...7]).

Node 1 (Ranks 8-15) called new_group(ranks=[8...15]).

Because the arguments differed across ranks, the internal distributed store could not match the creation requests, causing a deadlock where Node 0 waited for Node 1 to acknowledge the creation of Group A, while Node 1 was attempting to create Group B.

reported by: @whybeyoung
CC @stmatengss

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@whybeyoung
Copy link
Copy Markdown
Collaborator

NIce Man!

@XucSh
Copy link
Copy Markdown
Collaborator Author

XucSh commented Dec 25, 2025

/tag-and-rerun-ci

@XucSh XucSh changed the title [HiCache] Fix deadlock when create new group [HiCache] Fix deadlock when creating new group Dec 25, 2025
@stmatengss
Copy link
Copy Markdown
Collaborator

@xiezhq-hermann It is another bug for PP + HiCache compatibility. PTAL

@stmatengss
Copy link
Copy Markdown
Collaborator

We should merge this first, and then #15175

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yizhang2077 please take a look of this PR when you have time.

@whybeyoung
Copy link
Copy Markdown
Collaborator

Tested ok, with #15175

@XucSh
Copy link
Copy Markdown
Collaborator Author

XucSh commented Dec 26, 2025

/rerun-failed-ci

1 similar comment
@XucSh
Copy link
Copy Markdown
Collaborator Author

XucSh commented Dec 26, 2025

/rerun-failed-ci

@XucSh
Copy link
Copy Markdown
Collaborator Author

XucSh commented Dec 26, 2025

The unit-test-backend-8-gpu-h200 (1) (pull_request) test failure is unrelated to this PR. It's due to the storage backend not being configured.

@yizhang2077
Copy link
Copy Markdown
Collaborator

@yizhang2077 please take a look of this PR when you have time.

LGTM

@ShangmingCai
Copy link
Copy Markdown
Collaborator

The unit-test-backend-8-gpu-h200 (1) (pull_request) test failure is unrelated to this PR. It's due to the storage backend not being configured.

Will retry fix it?

Comment on lines +312 to +318
from sglang.srt.distributed.parallel_state import (
create_custom_parallel_group,
)

group_ranks = torch.distributed.get_process_group_ranks(tp_group)
self.prefetch_tp_group = torch.distributed.new_group(
group_ranks, backend="gloo"
self.prefetch_tp_group = create_custom_parallel_group(
group_ranks=group_ranks, backend="gloo"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need another approve here from @xiezhq-hermann

@xiezhq-hermann xiezhq-hermann self-assigned this Dec 26, 2025
@XucSh
Copy link
Copy Markdown
Collaborator Author

XucSh commented Dec 26, 2025

/rerun-failed-ci

1 similar comment
@stmatengss
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ShangmingCai
Copy link
Copy Markdown
Collaborator

@xiezhq-hermann CI is green now. Do you have time to quickly check this PR?

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

Just to clarify a bit, does this problem occur only with pipeline parallelism?

@XucSh
Copy link
Copy Markdown
Collaborator Author

XucSh commented Dec 29, 2025

Just to clarify a bit, does this problem occur only with pipeline parallelism?

From my perspective, this issue will arise as long as multiple tp groups need to be established with an enabled storage backend.

@stmatengss stmatengss merged commit 4ab66d9 into sgl-project:main Dec 29, 2025
384 of 403 checks passed
@XucSh XucSh deleted the Xuchun/hicache branch December 30, 2025 05:37
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants