[HiCache] Fix deadlock when creating new group by XucSh · Pull Request #15805 · sgl-project/sglang

XucSh · 2025-12-25T06:12:31Z

Motivation

This PR fixes a critical deadlock (hang) issue during the initialization of HiCacheController when running sglang in a multi-node distributed environment (e.g., 2 nodes with 8 GPUs each).

When launching the server with --enable-hierarchical-cache --hicache-storage-backend on multiple nodes, the initialization process hangs at torch.distributed.new_group.

Root Cause Analysis: torch.distributed.new_group is a collective operation. It requires all ranks in the default world group to call it in the same order with the exact same arguments (specifically the ranks list).

In the previous implementation of HiCacheController （PP2 TP8）:

Node 0 (Ranks 0-7) called new_group(ranks=[0...7]).

Node 1 (Ranks 8-15) called new_group(ranks=[8...15]).

Because the arguments differed across ranks, the internal distributed store could not match the creation requests, causing a deadlock where Node 0 waited for Node 1 to acknowledge the creation of Group A, while Node 1 was attempting to create Group B.

reported by: @whybeyoung
CC @stmatengss

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

gemini-code-assist · 2025-12-25T06:12:35Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

whybeyoung · 2025-12-25T06:16:15Z

NIce Man!

XucSh · 2025-12-25T06:17:47Z

/tag-and-rerun-ci

stmatengss · 2025-12-25T06:39:51Z

@xiezhq-hermann It is another bug for PP + HiCache compatibility. PTAL

stmatengss · 2025-12-25T07:09:03Z

We should merge this first, and then #15175

ShangmingCai

@yizhang2077 please take a look of this PR when you have time.

whybeyoung · 2025-12-25T10:41:07Z

Tested ok, with #15175

XucSh · 2025-12-26T00:01:40Z

/rerun-failed-ci

XucSh · 2025-12-26T02:14:12Z

/rerun-failed-ci

XucSh · 2025-12-26T03:09:27Z

The unit-test-backend-8-gpu-h200 (1) (pull_request) test failure is unrelated to this PR. It's due to the storage backend not being configured.

yizhang2077 · 2025-12-26T06:32:16Z

@yizhang2077 please take a look of this PR when you have time.

LGTM

ShangmingCai · 2025-12-26T06:57:41Z

The unit-test-backend-8-gpu-h200 (1) (pull_request) test failure is unrelated to this PR. It's due to the storage backend not being configured.

Will retry fix it?

ShangmingCai · 2025-12-26T07:02:12Z

+                from sglang.srt.distributed.parallel_state import (
+                    create_custom_parallel_group,
+                )
+
                group_ranks = torch.distributed.get_process_group_ranks(tp_group)
-                self.prefetch_tp_group = torch.distributed.new_group(
-                    group_ranks, backend="gloo"
+                self.prefetch_tp_group = create_custom_parallel_group(
+                    group_ranks=group_ranks, backend="gloo"


We need another approve here from @xiezhq-hermann

XucSh · 2025-12-26T09:53:57Z

/rerun-failed-ci

stmatengss · 2025-12-27T08:53:19Z

/rerun-failed-ci

ShangmingCai · 2025-12-29T03:10:30Z

@xiezhq-hermann CI is green now. Do you have time to quickly check this PR?

xiezhq-hermann · 2025-12-29T08:56:57Z

Just to clarify a bit, does this problem occur only with pipeline parallelism?

XucSh · 2025-12-29T09:04:38Z

Just to clarify a bit, does this problem occur only with pipeline parallelism?

From my perspective, this issue will arise as long as multiple tp groups need to be established with an enabled storage backend.

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

[HiCache] Fix deadlock when create new group

4fca686

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

XucSh requested review from Ying1123, ch-wan, hnyls2002, merrymercy, xiezhq-hermann and yizhang2077 as code owners December 25, 2025 06:12

github-actions Bot added the run-ci label Dec 25, 2025

XucSh changed the title ~~[HiCache] Fix deadlock when create new group~~ [HiCache] Fix deadlock when creating new group Dec 25, 2025

XucSh mentioned this pull request Dec 25, 2025

[HiCache] Add PP Support with suffix pp rank #15175

Merged

6 tasks

ShangmingCai reviewed Dec 25, 2025

View reviewed changes

yizhang2077 approved these changes Dec 26, 2025

View reviewed changes

ShangmingCai reviewed Dec 26, 2025

View reviewed changes

xiezhq-hermann self-assigned this Dec 26, 2025

xiezhq-hermann approved these changes Dec 29, 2025

View reviewed changes

stmatengss merged commit 4ab66d9 into sgl-project:main Dec 29, 2025
384 of 403 checks passed

XucSh deleted the Xuchun/hicache branch December 30, 2025 05:37

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[HiCache] Fix deadlock when creating new group (sgl-project#15805)

5686449

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

yizhang2077 mentioned this pull request Jan 20, 2026

Bugfix[GroupCoordinator]: Avoid creating an excessive number of invalid process groups. #14360

Closed

icepoint666 mentioned this pull request May 9, 2026

[Bugfix][HiCache]: Use GroupCoordinator.all_gather_object for creating custom group #24744

Open

5 tasks

Conversation

XucSh commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 25, 2025

Uh oh!

whybeyoung commented Dec 25, 2025

Uh oh!

XucSh commented Dec 25, 2025

Uh oh!

stmatengss commented Dec 25, 2025

Uh oh!

stmatengss commented Dec 25, 2025

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

whybeyoung commented Dec 25, 2025

Uh oh!

XucSh commented Dec 26, 2025

Uh oh!

XucSh commented Dec 26, 2025

Uh oh!

XucSh commented Dec 26, 2025

Uh oh!

yizhang2077 commented Dec 26, 2025

Uh oh!

ShangmingCai commented Dec 26, 2025

Uh oh!

ShangmingCai Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

XucSh commented Dec 26, 2025

Uh oh!

stmatengss commented Dec 27, 2025

Uh oh!

ShangmingCai commented Dec 29, 2025

Uh oh!

xiezhq-hermann commented Dec 29, 2025

Uh oh!

XucSh commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

XucSh commented Dec 25, 2025 •

edited

Loading