[PP] Fix recv tensor dict potential race condition#20341
Conversation
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a potential race condition in the pipeline parallelism (PP) communication by introducing a robust typed messaging system for tensor dictionaries. It ensures that inter-stage tensor exchanges are correctly ordered and processed, even if messages arrive out of sequence, by categorizing and temporarily storing unexpected messages. This enhancement significantly improves the reliability and stability of the PP scheduler's communication mechanisms. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a typed messaging system for pipeline parallelism to address a potential race condition when receiving tensor dictionaries, tagging messages with msg_kind and handling out-of-order messages via _pp_recv_typed_dict. However, a medium-severity vulnerability was identified in the message stashing logic due to a lack of proper resource constraints and error handling, which could lead to Denial of Service (DoS) through memory exhaustion or infinite loops. Additionally, consider avoiding in-place modification of the tensor dictionary being sent to improve code safety.
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
/rerun-stage stage-c-test-8-gpu-h20 |
|
✅ Triggered |
|
I test the code ,error: |
|
@chris0927 can you share your start commands? I cannot reproduce the bug, the key error one and the memory leak one. Which release version of SGLang? Main branch? can you share the commit hash as well? |
|
@ShangmingCai 0.5.9,only modify the file:sglang-0.5.9/python/sglang/srt/managers/scheduler_pp_mixin.py, run cmd: rank 1: python -m sglang.launch_server --dist-init-addr 192.168.9.85:9999 --nnodes 2 --node-rank 1 --model-path /home/GLM-5-FP8 --served-model-name glm-5 --tp 8 --pp-size 2 --trust-remote-code --tool-call-parser glm47 --reasoning-parser glm45 --chunked-prefill-size 2048 --max-running-requests 16 --context-length 202752 --host 0.0.0.0 --port 8000 --attention-backend flashinfer |
|
@ShangmingCai i pull the file :https://github.com/sgl-project/sglang/blob/aeda4a72a60b3d71c90e143fd7d2c48739012f85/python/sglang/srt/managers/scheduler_pp_mixin.py, |
@chris0927 Thx for the verification. You can also try running some stress tests on this version. If the key error won't happen again, we can merge this PR into main as soon as possible. I will run more tests tomorrow. |
|
@ShangmingCai After testing for 2 days, there were almost no errors about 'key error', but this error appeared: |
I have same error, this shape mismatch issue can be reproduced with multi-turn hicache benchmark at about 700/800 requests. |
|
@ShangmingCai Could you look at this shape mismatch issue or if I start a new issue? If I change pp-size from 2 to 1, this issue disappears, so I think this is a issue related to PP. |
|
Hi, I have issue here. |
This should be another issue, unless it is related to |
|
/tag-and-rerun-ci |
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
✅ Triggered |
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>


Motivation
This PR adds a msg type to distinguish proxy tensor dict receiving and output tensor dict receiving without harming performance, which might help when PP communication encounters a race condition in rare cases.
related issue:
#19686 #19750
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci