Add B200 CI test workflow#9604
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @csahithi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces initial tests for the B200 GPU runner by populating the per-commit-8-gpu-b200 test suite. The primary goal is to establish a foundational CI test workflow for Blackwell testing, ensuring that critical functionalities related to FP4 operations are validated on the target hardware.
Highlights
- Expanded B200 Test Suite: The
per-commit-8-gpu-b200test suite, previously empty, has been updated to includetest_fp4_gemm.pyandtest_fp4_quantize.py. These additions are crucial for establishing initial test coverage for the B200 runner.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request aims to add a new CI test workflow for B200 runners by populating a new test suite. The changes involve adding two new test files to the per-commit-8-gpu-b200 suite in test/srt/run_suite.py. My review identified a critical issue where the paths to these new test files are incorrect, which would likely cause the new CI workflow to fail. I have provided a code suggestion to correct these paths.
|
thanks! LGTM now. |
There was a problem hiding this comment.
- the sgl-kernel tests should go to separates files (https://github.com/sgl-project/sglang/blob/main/.github/workflows/pr-test-sgl-kernel.yml). You can create a new pr-test-sgl-kernel-blackwell.yml or reuse the existing one.
- Do not run things in the parent folder
../../python/sglang/test/test_fp4_moe.py. You can import or copy it undertest/srt
There was a problem hiding this comment.
@merrymercy Thanks for the comments. Updated the PR with these changes. Created a new pr-test-sgl-kernel-blackwell.yml - https://github.com/sgl-project/sglang/actions/runs/17411110681/job/49438550895?pr=9604
8b98ca0 to
57bbcbd
Compare
|
For the sake of convenience: pasting the B200 job directly here: https://github.com/sgl-project/sglang/actions/runs/17411110648/job/49428467799?pr=9604 |
|
https://github.com/sgl-project/sglang/actions/runs/17411110648/job/49539116290?pr=9604 this PR does not change ".github/workflows/pr-test-amd.yml" |
|
Hi @zhyncs , could you please help resolve the conflict and merge this? Thanks! |
Signed-off-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com>
0b70562 to
d45aa52
Compare
|
@zhyncs I resolved the merge conflicts, could this PR pls be merged now? Thanks! |
| cd test/srt | ||
| python3 run_suite.py --suite per-commit-8-gpu-deepep | ||
|
|
||
| unit-test-backend-8-gpu-b200: |
There was a problem hiding this comment.
This was separated out into pr-test-blackwell.yml
| @@ -0,0 +1,79 @@ | |||
| name: PR Test (Blackwell) | |||
There was a problem hiding this comment.
What is the benefit of separating the test out in a new yml file?
There was a problem hiding this comment.
The idea is to add additional tests for blackwell in the future including e2e suite similar to pr-test-amd.yml
| timeout-minutes: 60 | ||
| run: | | ||
| cd test/srt | ||
| python3 run_suite.py --suite per-commit-8-gpu-b200 --auto-partition-id 0 --auto-partition-size 1 |
There was a problem hiding this comment.
Currently we have per-commit unit testing.
Nightly should bet testing something bigger.
There was a problem hiding this comment.
Yes, similiar to above, the idea is to add a lot more tests going forward including running, for example, end to end tests. This is just the beginning to get started on that going forward.
There was a problem hiding this comment.
Currently, we want some basic CI at least to be enabled on blackwell (there is no CI running on B200 nodes at the moment). This PR serves that purpose to just add basic CI testing which is why it is crucial to get this merged ASAP.
There was a problem hiding this comment.
this job is actually running on B200 CI
Re: "there is no CI running on B200 nodes at the moment"
| PIP_SUFFIX="--break-system-packages" | ||
| pip uninstall sgl-kernel -y $PIP_SUFFIX || true | ||
| pip uninstall ~gl-kernel -y $PIP_SUFFIX || true | ||
| pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu129 $PIP_SUFFIX |
There was a problem hiding this comment.
maybe we shouldn't hardcode the torch & cuda versions here?
Motivation
Add test suite to run on B200 runner for blackwell testing
Modifications
pr-test-blackwellAccuracy Tests
Tested the newly added test suite on b200
per-commit-8-gpu-b200.txt
Benchmarking and Profiling
Checklist