Skip to content

Add B200 CI test workflow#9604

Closed
csahithi wants to merge 1 commit intosgl-project:mainfrom
csahithi:b200-testing
Closed

Add B200 CI test workflow#9604
csahithi wants to merge 1 commit intosgl-project:mainfrom
csahithi:b200-testing

Conversation

@csahithi
Copy link
Copy Markdown
Collaborator

@csahithi csahithi commented Aug 25, 2025

Motivation

Add test suite to run on B200 runner for blackwell testing

Modifications

  • Added a new github workflow pr-test-blackwell
  • Updated b200 test suite (which is currently empty)

Accuracy Tests

Tested the newly added test suite on b200
per-commit-8-gpu-b200.txt

Benchmarking and Profiling

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @csahithi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial tests for the B200 GPU runner by populating the per-commit-8-gpu-b200 test suite. The primary goal is to establish a foundational CI test workflow for Blackwell testing, ensuring that critical functionalities related to FP4 operations are validated on the target hardware.

Highlights

  • Expanded B200 Test Suite: The per-commit-8-gpu-b200 test suite, previously empty, has been updated to include test_fp4_gemm.py and test_fp4_quantize.py. These additions are crucial for establishing initial test coverage for the B200 runner.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to add a new CI test workflow for B200 runners by populating a new test suite. The changes involve adding two new test files to the per-commit-8-gpu-b200 suite in test/srt/run_suite.py. My review identified a critical issue where the paths to these new test files are incorrect, which would likely cause the new CI workflow to fail. I have provided a code suggestion to correct these paths.

Comment thread test/srt/run_suite.py Outdated
Comment thread test/srt/run_suite.py Outdated
Comment thread .github/workflows/pr-test-blackwell.yml Outdated
Comment thread .github/workflows/pr-test-blackwell.yml Outdated
Comment thread .github/workflows/pr-test-blackwell.yml Outdated
@kushanam
Copy link
Copy Markdown
Collaborator

thanks! LGTM now.
cc: @zhyncs

Comment thread test/srt/run_suite.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the sgl-kernel tests should go to separates files (https://github.com/sgl-project/sglang/blob/main/.github/workflows/pr-test-sgl-kernel.yml). You can create a new pr-test-sgl-kernel-blackwell.yml or reuse the existing one.
  2. Do not run things in the parent folder ../../python/sglang/test/test_fp4_moe.py. You can import or copy it under test/srt

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy Thanks for the comments. Updated the PR with these changes. Created a new pr-test-sgl-kernel-blackwell.yml - https://github.com/sgl-project/sglang/actions/runs/17411110681/job/49438550895?pr=9604

@csahithi csahithi force-pushed the b200-testing branch 2 times, most recently from 8b98ca0 to 57bbcbd Compare September 2, 2025 17:23
@Fridge003 Fridge003 self-assigned this Sep 3, 2025
@nWEIdia
Copy link
Copy Markdown

nWEIdia commented Sep 3, 2025

For the sake of convenience: pasting the B200 job directly here: https://github.com/sgl-project/sglang/actions/runs/17411110648/job/49428467799?pr=9604

@nWEIdia
Copy link
Copy Markdown

nWEIdia commented Sep 3, 2025

https://github.com/sgl-project/sglang/actions/runs/17411110648/job/49539116290?pr=9604 this PR does not change ".github/workflows/pr-test-amd.yml"
I guess we can ignore half of the 6 failures?

Comment thread scripts/ci/ci_install_dependency.sh Outdated
@nWEIdia
Copy link
Copy Markdown

nWEIdia commented Sep 4, 2025

Hi @zhyncs , could you please help resolve the conflict and merge this? Thanks!

Signed-off-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com>
@csahithi
Copy link
Copy Markdown
Collaborator Author

csahithi commented Sep 4, 2025

@zhyncs I resolved the merge conflicts, could this PR pls be merged now? Thanks!

cd test/srt
python3 run_suite.py --suite per-commit-8-gpu-deepep

unit-test-backend-8-gpu-b200:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csahithi Why do we remove pr-test unit-test-backend-8-gpu-b200?

this job is actually running on B200 CI

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was separated out into pr-test-blackwell.yml

@@ -0,0 +1,79 @@
name: PR Test (Blackwell)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of separating the test out in a new yml file?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to add additional tests for blackwell in the future including e2e suite similar to pr-test-amd.yml

timeout-minutes: 60
run: |
cd test/srt
python3 run_suite.py --suite per-commit-8-gpu-b200 --auto-partition-id 0 --auto-partition-size 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have per-commit unit testing.
Nightly should bet testing something bigger.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, similiar to above, the idea is to add a lot more tests going forward including running, for example, end to end tests. This is just the beginning to get started on that going forward.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we want some basic CI at least to be enabled on blackwell (there is no CI running on B200 nodes at the moment). This PR serves that purpose to just add basic CI testing which is why it is crucial to get this merged ASAP.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this job is actually running on B200 CI
Re: "there is no CI running on B200 nodes at the moment"

PIP_SUFFIX="--break-system-packages"
pip uninstall sgl-kernel -y $PIP_SUFFIX || true
pip uninstall ~gl-kernel -y $PIP_SUFFIX || true
pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu129 $PIP_SUFFIX
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we shouldn't hardcode the torch & cuda versions here?

@Fridge003 Fridge003 closed this Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants