Skip to content

[PD] Fix the infinite loop in deocde resolve_pending_reqs#20371

Merged
hnyls2002 merged 5 commits intomainfrom
fix_decode_loop
Mar 11, 2026
Merged

[PD] Fix the infinite loop in deocde resolve_pending_reqs#20371
hnyls2002 merged 5 commits intomainfrom
fix_decode_loop

Conversation

@ShangmingCai
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai commented Mar 11, 2026

Motivation

Fix #20252
Can only reproduce the bug when using sglang-router (mini-lb will reject the req if prefill is dead immediately, so it won't go into the pending_reqs)

Fix plan:

  • Fix the infinite loop in decode resolve_pending_reqs
  • Reduce max_retries, 20 seconds is too long. Also, the sglang-router also has a retry mechanism (it can handle the case if this request really needs to be done), so we don't need to retry 20 times to block the scheduling. (Also consider running _resolve_pending_reqs asynchronously in the next PR)

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: Shangming Cai <csmthu@gmail.com>
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h20

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h20

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

Signed-off-by: Shangming Cai <csmthu@gmail.com>
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h20

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@hnyls2002
Copy link
Copy Markdown
Collaborator

Actually, it is a bug introduced by me; I only considered one prefill instance. Your fix is right, we should group all the requests by the bootstrap address.

@hnyls2002 hnyls2002 merged commit af4c289 into main Mar 11, 2026
87 of 95 checks passed
@hnyls2002 hnyls2002 deleted the fix_decode_loop branch March 11, 2026 21:11
@hnyls2002
Copy link
Copy Markdown
Collaborator

BTW, can we remove the ensure_parallel_info in the receiver? I thought that if we could gather the parallel info retrieval in the same place (before adding it to the queue), then the code logic would be much cleaner.

@ShangmingCai

liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026
whybeyoung pushed a commit to whybeyoung/sglang that referenced this pull request Mar 14, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
HanHan009527 pushed a commit to bytedance-iaas/sglang that referenced this pull request Apr 3, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Large scale PD Disagression bug : cascading failure in Decode/prefill servers when corresponding Prefill/decode servers go offline

2 participants