Skip to content

[CI] Improve PP consistency check success rate#20838

Merged
ShangmingCai merged 2 commits intomainfrom
clean_pp_test
Mar 19, 2026
Merged

[CI] Improve PP consistency check success rate#20838
ShangmingCai merged 2 commits intomainfrom
clean_pp_test

Conversation

@ShangmingCai
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai commented Mar 18, 2026

Motivation

The current pp consistency check is flaky, since the benchmark accuracy of the model is not stable. If the diff is greater than 2%, we need to rerun the whole suite again, which wastes our CI resources.

  • num_questions changed from 200 to 512, more reqs will help us get a more stable accuracy, so the PP accuracy diff will be smaller, which will improve the success rate
  • Fix error msg, "1%" -> "2%"

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: Shangming Cai <csmthu@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving the stability and success rate of the Parallel Processing (PP) consistency checks within the CI pipeline. By increasing the dataset size for accuracy tests, the changes aim to yield more consistent results. Concurrently, the tolerance for acceptable accuracy drops in the consistency check has been adjusted, which should reduce the likelihood of intermittent CI failures and enhance the overall robustness of the testing process.

Highlights

  • Test Stability Improvement: The number of questions used in the GSM8K accuracy tests has been increased from 200 to 512. This change aims to provide a more stable and reliable accuracy measurement, thereby improving the success rate of PP consistency checks.
  • Consistency Check Tolerance Adjustment: The error message for PP accuracy drops has been updated, changing the reported threshold from '1%' to '2%'. This adjustment reflects a revised tolerance for accuracy variations, contributing to fewer false positives in CI checks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-4-gpu-h100

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve the consistency of pipeline parallelism (PP) tests by increasing the number of questions used in the gsm8k benchmark and fixing an incorrect percentage in an assertion message. The changes are correct and achieve the stated goal. My review includes suggestions to replace the newly introduced magic numbers with constants to improve code maintainability and prevent potential inconsistencies in the future. Specifically, I've pointed out the repeated use of num_questions=512 and the hardcoded accuracy threshold and its corresponding percentage in error messages.

Comment thread test/registered/distributed/test_pp_single_node.py
Comment thread test/registered/distributed/test_pp_single_node.py
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-ut test/registered/distributed/test_pp_single_node.py

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered /rerun-ut on 4-gpu-h100 runner:

cd test/ && python3 registered/distributed/test_pp_single_node.py

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-ut test/registered/distributed/test_pp_single_node.py

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered /rerun-ut on 4-gpu-h100 runner:

cd test/ && python3 registered/distributed/test_pp_single_node.py

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

https://github.com/sgl-project/sglang/actions/runs/23285613694/job/67708248542
image
image
image

Accuracy diff seems a lot better now.

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

Compared to the previous successful run:
image (https://github.com/sgl-project/sglang/actions/runs/23176040059/job/67719743331?pr=19669)

The estimated elapsed time increased from 580s -> 640s
image

10% more time each turn, but in return, the success rate should be 100% now.

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

image

Full H100 CI pass.

@ShangmingCai ShangmingCai merged commit 4c52b7f into main Mar 19, 2026
63 of 69 checks passed
@ShangmingCai ShangmingCai deleted the clean_pp_test branch March 19, 2026 11:12
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant