[diffusion][CI]: route multimodal component accuracy through run_suite by Ratish1 · Pull Request #21960 · sgl-project/sglang

Ratish1 · 2026-04-02T16:19:09Z

Motivation

Related #18709 .

The multimodal diffusion component-accuracy jobs were temporarily wired with explicit workflow-side pytest / torchrun commands so they could pass CI correctly.

That fixed the immediate CI failures, but it left these jobs outside the normal multimodal runner path. The follow-up goal here is to make python/sglang/multimodal_gen/test/run_suite.py the only entrypoint again, while preserving the execution behavior that component-accuracy actually needs:

1-GPU accuracy must keep file/process isolation
2-GPU accuracy must launch with distributed world size 2
existing multimodal server/unit suites should remain unchanged

This PR brings component-accuracy back under run_suite.py without broadening the runner design for unrelated suites.

Modifications

1. Add component-accuracy suites to `run_suite.py`

New suites:

component-accuracy-1-gpu
component-accuracy-2-gpu

2. Add a narrow accuracy-only execution branch in `run_suite.py`

The default multimodal runner path is preserved for existing suites:

collect test items
partition by item
run plain pytest

For the two component-accuracy suites only:

partition by file
run each file separately
launch 2-GPU accuracy with:
- python -m torch.distributed.run --nproc_per_node=2 -m pytest

This keeps the behavior that was required for correctness:

1-GPU accuracy keeps file-level process isolation
2-GPU accuracy gets the distributed launcher it needs

3. Move multimodal component-accuracy jobs back to `run_suite.py`

Updated:

.github/workflows/pr-test-multimodal-gen.yml

The workflow no longer invokes explicit file-level pytest / torchrun commands for component accuracy.

Instead, both jobs call:

python3 sglang/multimodal_gen/test/run_suite.py --suite component-accuracy-1-gpu ...
python3 sglang/multimodal_gen/test/run_suite.py --suite component-accuracy-2-gpu ...

with partitions [0, 1], so the runner remains the single entrypoint and autopartition is preserved.

High-Level Flow

flowchart TD
    A[pr-test-multimodal-gen.yml] --> B[run_suite.py]

    B --> C{Suite type}

    C -->|unit / 1-gpu / 2-gpu| D[Collect pytest items]
    D --> E[Partition by item]
    E --> F[Run plain pytest]

    C -->|component-accuracy-1-gpu| G[Resolve suite files]
    G --> H[Partition by file]
    H --> I[Run one file per pytest process]

    C -->|component-accuracy-2-gpu| J[Resolve suite files]
    J --> K[Partition by file]
    K --> L[Run one file per distributed pytest process]
    L --> M[python -m torch.distributed.run --nproc_per_node=2 -m pytest]

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-02T16:19:14Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

mickqian · 2026-04-03T06:39:10Z

thanks so much, really appreciate this. could you also make sure this test is robust and not flaky

mickqian · 2026-04-03T06:39:24Z

/tag-and-rerun-ci

Ratish1 · 2026-04-03T06:49:13Z

thanks so much, really appreciate this. could you also make sure this test is robust and not flaky

I tested locally with this PR for the acc tests and it worked fine, it should be robust.

Ratish1 · 2026-04-03T08:14:31Z

Hey @mickqian , figured out why multimodal-gen-test-2-gpu (0) (pull_request) is failing.

Fixed wan2_2_i2v_a14b_2gpu to explicitly use num_gpus=2; it was previously falling back to the default 1, which made the 2-GPU CI shard appear stuck and fail eventually

Ratish1 · 2026-04-03T13:43:37Z

/tag-and-rerun-ci

yhyang201 · 2026-04-04T04:06:35Z

/rerun-failed-ci

yhyang201 · 2026-04-04T05:52:18Z

/rerun-failed-ci

Ratish1 · 2026-04-04T11:53:37Z

/tag-and-rerun-ci

yhyang201 · 2026-04-04T17:12:04Z

/rerun-failed-ci

yhyang201 · 2026-04-04T18:20:19Z

/rerun-failed-ci

yhyang201 · 2026-04-04T19:28:54Z

/rerun-failed-ci

yhyang201 · 2026-04-04T20:42:21Z

/rerun-failed-ci

Ratish1 · 2026-04-05T09:30:18Z

Hey @mickqian , I routed the test through run_suite file and made sure the component accuracy is stable. Lmk if this looks good to you. Thanks.

yhyang201 · 2026-04-09T01:37:36Z

/rerun-failed-ci

yhyang201 · 2026-04-10T01:39:49Z

/rerun-failed-ci

yhyang201 · 2026-04-10T06:14:29Z

/rerun-failed-ci

yhyang201 · 2026-04-10T08:48:38Z

/rerun-failed-ci

yhyang201 · 2026-04-10T13:30:17Z

/rerun-failed-ci

BBuf · 2026-04-10T15:05:59Z

https://github.com/sgl-project/sglang/actions/runs/24191047395/job/70754459149?pr=21960

#21960)

sgl-project#21960)

[diffusion][CI]: route multimodal component accuracy through run_suite

f63a8c6

Ratish1 requested review from Fridge003, Kangyan-Zhou, bingxche, ispobock, merrymercy, mickqian, ping1jing2 and yhyang201 as code owners April 2, 2026 16:19

github-actions Bot added Multi-modal multi-modal language model diffusion SGLang Diffusion labels Apr 2, 2026

fix conflict

cfd2402

Ratish1 mentioned this pull request Apr 3, 2026

[diffusion][CI]: Add individual component accuracy CI for diffusion models #18709

Merged

5 tasks

add sleep after server shutdown

ee86709

github-actions Bot added the run-ci label Apr 3, 2026

fix 2 gpu wan case

57b20aa

Ratish1 removed the run-ci label Apr 3, 2026

upd

3e27bc8

github-actions Bot added the run-ci label Apr 3, 2026

Ratish1 added 2 commits April 3, 2026 20:23

fix

79c7fff

fix

e38fab4

Merge remote-tracking branch 'upstream/main' into fix/ci-acc

29da4c6

fix LTX-2 comp accuracy

145771f

github-actions Bot added the run-ci label Apr 4, 2026

Ratish1 added 4 commits April 8, 2026 23:11

fix conflict

810e36a

match diffusers path

50877ff

upd

25dff16

clear transformer caches before staged accuracy teardown

8038eb6

Ratish1 added 5 commits April 9, 2026 12:04

fix LTX2.3

31324be

Merge remote-tracking branch 'upstream/main' into fix/ci-acc

2211ca8

fix

ccd0538

fix

fccf952

Fix component accuracy loading and teardown

69672dd

BBuf approved these changes Apr 10, 2026

View reviewed changes

BBuf merged commit cf5ad12 into sgl-project:main Apr 10, 2026
403 of 456 checks passed

Ratish1 deleted the fix/ci-acc branch April 10, 2026 15:07

Ratish1 mentioned this pull request Apr 10, 2026

[CI] dynamic load-balanced partitioning for diffusion CI #15528

Merged

6 tasks

Fridge003 pushed a commit that referenced this pull request Apr 11, 2026

[diffusion][CI]: route multimodal component accuracy through run_suite (

e92853f

#21960)

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

[diffusion][CI]: route multimodal component accuracy through run_suite (

b2af378

sgl-project#21960)

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[diffusion][CI]: route multimodal component accuracy through run_suite (

3054ba4

sgl-project#21960)

Conversation

Ratish1 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

1. Add component-accuracy suites to run_suite.py

2. Add a narrow accuracy-only execution branch in run_suite.py

3. Move multimodal component-accuracy jobs back to run_suite.py

High-Level Flow

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 2, 2026

Uh oh!

mickqian commented Apr 3, 2026

Uh oh!

mickqian commented Apr 3, 2026

Uh oh!

Ratish1 commented Apr 3, 2026

Uh oh!

Ratish1 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ratish1 commented Apr 3, 2026

Uh oh!

yhyang201 commented Apr 4, 2026

Uh oh!

yhyang201 commented Apr 4, 2026

Uh oh!

Ratish1 commented Apr 4, 2026

Uh oh!

yhyang201 commented Apr 4, 2026

Uh oh!

yhyang201 commented Apr 4, 2026

Uh oh!

yhyang201 commented Apr 4, 2026

Uh oh!

yhyang201 commented Apr 4, 2026

Uh oh!

Ratish1 commented Apr 5, 2026

Uh oh!

yhyang201 commented Apr 9, 2026

Uh oh!

yhyang201 commented Apr 10, 2026

Uh oh!

yhyang201 commented Apr 10, 2026

Uh oh!

yhyang201 commented Apr 10, 2026

Uh oh!

yhyang201 commented Apr 10, 2026

Uh oh!

BBuf commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ratish1 commented Apr 2, 2026 •

edited

Loading

1. Add component-accuracy suites to `run_suite.py`

2. Add a narrow accuracy-only execution branch in `run_suite.py`

3. Move multimodal component-accuracy jobs back to `run_suite.py`

Ratish1 commented Apr 3, 2026 •

edited

Loading