[PD] Fix the infinite loop in deocde resolve_pending_reqs by ShangmingCai · Pull Request #20371 · sgl-project/sglang

ShangmingCai · 2026-03-11T14:35:46Z

Motivation

Fix #20252
Can only reproduce the bug when using sglang-router (mini-lb will reject the req if prefill is dead immediately, so it won't go into the pending_reqs)

Fix plan:

Fix the infinite loop in decode resolve_pending_reqs
Reduce max_retries, 20 seconds is too long. Also, the sglang-router also has a retry mechanism (it can handle the case if this request really needs to be done), so we don't need to retry 20 times to block the scheduling. (Also consider running _resolve_pending_reqs asynchronously in the next PR)

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai · 2026-03-11T14:36:23Z

/rerun-stage stage-c-test-8-gpu-h20

github-actions · 2026-03-11T14:36:51Z

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

github-actions · 2026-03-11T14:36:57Z

🔗 View workflow run

gemini-code-assist · 2026-03-11T14:37:00Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai · 2026-03-11T15:03:33Z

/rerun-stage stage-c-test-8-gpu-h20

github-actions · 2026-03-11T15:04:09Z

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

github-actions · 2026-03-11T15:04:15Z

🔗 View workflow run

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai · 2026-03-11T15:14:10Z

/rerun-stage stage-c-test-8-gpu-h20

github-actions · 2026-03-11T15:14:35Z

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

github-actions · 2026-03-11T15:14:41Z

🔗 View workflow run

hnyls2002 · 2026-03-11T21:07:29Z

Actually, it is a bug introduced by me; I only considered one prefill instance. Your fix is right, we should group all the requests by the bootstrap address.

hnyls2002 · 2026-03-11T21:14:42Z

BTW, can we remove the ensure_parallel_info in the receiver? I thought that if we could gather the parallel info retrieval in the same place (before adding it to the queue), then the code logic would be much cleaner.

@ShangmingCai

…t#20371) Signed-off-by: Shangming Cai <csmthu@gmail.com>

[PD] Fix the infinite loop in deocde resolve_pending_reqs

a77c2fe

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai requested review from ByronHsu and hnyls2002 as code owners March 11, 2026 14:35

ShangmingCai mentioned this pull request Mar 11, 2026

[Bug] Large scale PD Disagression bug : cascading failure in Decode/prefill servers when corresponding Prefill/decode servers go offline #20252

Closed

5 tasks

ShangmingCai added 3 commits March 11, 2026 22:48

update metric

508b565

Signed-off-by: Shangming Cai <csmthu@gmail.com>

reduce max_retries

5f89e72

Signed-off-by: Shangming Cai <csmthu@gmail.com>

revert

a3f3296

Signed-off-by: Shangming Cai <csmthu@gmail.com>

reduce max_retries

b2fad09

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai assigned ShangmingCai and hnyls2002 Mar 11, 2026

hnyls2002 approved these changes Mar 11, 2026

View reviewed changes

hnyls2002 merged commit af4c289 into main Mar 11, 2026
87 of 95 checks passed

hnyls2002 deleted the fix_decode_loop branch March 11, 2026 21:11

ShangmingCai mentioned this pull request Mar 13, 2026

[PD] Make pending reqs resolving more robust #20505

Merged

5 tasks

liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026

[PD] Fix the infinite loop in deocde resolve_pending_reqs (sgl-projec…

a03cf1e

…t#20371) Signed-off-by: Shangming Cai <csmthu@gmail.com>

whybeyoung pushed a commit to whybeyoung/sglang that referenced this pull request Mar 14, 2026

[PD] Fix the infinite loop in deocde resolve_pending_reqs (sgl-projec…

2975def

…t#20371) Signed-off-by: Shangming Cai <csmthu@gmail.com>

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[PD] Fix the infinite loop in deocde resolve_pending_reqs (sgl-projec…

724d487

…t#20371) Signed-off-by: Shangming Cai <csmthu@gmail.com>

HanHan009527 pushed a commit to bytedance-iaas/sglang that referenced this pull request Apr 3, 2026

[PD] Fix the infinite loop in deocde resolve_pending_reqs (sgl-projec…

9c19a83

…t#20371) Signed-off-by: Shangming Cai <csmthu@gmail.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[PD] Fix the infinite loop in deocde resolve_pending_reqs (sgl-projec…

62fc6f3

…t#20371) Signed-off-by: Shangming Cai <csmthu@gmail.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[PD] Fix the infinite loop in deocde resolve_pending_reqs (sgl-projec…

379c30f

…t#20371) Signed-off-by: Shangming Cai <csmthu@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PD] Fix the infinite loop in deocde resolve_pending_reqs#20371

[PD] Fix the infinite loop in deocde resolve_pending_reqs#20371
hnyls2002 merged 5 commits intomainfrom
fix_decode_loop

ShangmingCai commented Mar 11, 2026 •

edited

Loading

Uh oh!

ShangmingCai commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Uh oh!

ShangmingCai commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

ShangmingCai commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

hnyls2002 commented Mar 11, 2026

Uh oh!

Uh oh!

hnyls2002 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShangmingCai commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

ShangmingCai commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Uh oh!

ShangmingCai commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

ShangmingCai commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

hnyls2002 commented Mar 11, 2026

Uh oh!

Uh oh!

hnyls2002 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShangmingCai commented Mar 11, 2026 •

edited

Loading