[CI] Improve PP consistency check success rate#20838
Conversation
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on improving the stability and success rate of the Parallel Processing (PP) consistency checks within the CI pipeline. By increasing the dataset size for accuracy tests, the changes aim to yield more consistent results. Concurrently, the tolerance for acceptable accuracy drops in the consistency check has been adjusted, which should reduce the likelihood of intermittent CI failures and enhance the overall robustness of the testing process. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
✅ Triggered |
There was a problem hiding this comment.
Code Review
This pull request aims to improve the consistency of pipeline parallelism (PP) tests by increasing the number of questions used in the gsm8k benchmark and fixing an incorrect percentage in an assertion message. The changes are correct and achieve the stated goal. My review includes suggestions to replace the newly introduced magic numbers with constants to improve code maintainability and prevent potential inconsistencies in the future. Specifically, I've pointed out the repeated use of num_questions=512 and the hardcoded accuracy threshold and its corresponding percentage in error messages.
|
/rerun-ut test/registered/distributed/test_pp_single_node.py |
|
✅ Triggered |
|
/rerun-ut test/registered/distributed/test_pp_single_node.py |
|
✅ Triggered |
|
https://github.com/sgl-project/sglang/actions/runs/23285613694/job/67708248542 Accuracy diff seems a lot better now. |
|
Compared to the previous successful run: The estimated elapsed time increased from 580s -> 640s 10% more time each turn, but in return, the success rate should be 100% now. |
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>






Motivation
The current pp consistency check is flaky, since the benchmark accuracy of the model is not stable. If the diff is greater than 2%, we need to rerun the whole suite again, which wastes our CI resources.
num_questionschanged from 200 to 512, more reqs will help us get a more stable accuracy, so the PP accuracy diff will be smaller, which will improve the success rateModifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci