Skip to content

[diffusion][CI]: route multimodal component accuracy through run_suite#21960

Merged
BBuf merged 19 commits intosgl-project:mainfrom
Ratish1:fix/ci-acc
Apr 10, 2026
Merged

[diffusion][CI]: route multimodal component accuracy through run_suite#21960
BBuf merged 19 commits intosgl-project:mainfrom
Ratish1:fix/ci-acc

Conversation

@Ratish1
Copy link
Copy Markdown
Collaborator

@Ratish1 Ratish1 commented Apr 2, 2026

Motivation

Related #18709 .

The multimodal diffusion component-accuracy jobs were temporarily wired with explicit workflow-side pytest / torchrun commands so they could pass CI correctly.

That fixed the immediate CI failures, but it left these jobs outside the normal multimodal runner path. The follow-up goal here is to make python/sglang/multimodal_gen/test/run_suite.py the only entrypoint again, while preserving the execution behavior that component-accuracy actually needs:

  • 1-GPU accuracy must keep file/process isolation
  • 2-GPU accuracy must launch with distributed world size 2
  • existing multimodal server/unit suites should remain unchanged

This PR brings component-accuracy back under run_suite.py without broadening the runner design for unrelated suites.

Modifications

1. Add component-accuracy suites to run_suite.py

New suites:

  • component-accuracy-1-gpu
  • component-accuracy-2-gpu

2. Add a narrow accuracy-only execution branch in run_suite.py

The default multimodal runner path is preserved for existing suites:

  • collect test items
  • partition by item
  • run plain pytest

For the two component-accuracy suites only:

  • partition by file
  • run each file separately
  • launch 2-GPU accuracy with:
    • python -m torch.distributed.run --nproc_per_node=2 -m pytest

This keeps the behavior that was required for correctness:

  • 1-GPU accuracy keeps file-level process isolation
  • 2-GPU accuracy gets the distributed launcher it needs

3. Move multimodal component-accuracy jobs back to run_suite.py

Updated:

  • .github/workflows/pr-test-multimodal-gen.yml

The workflow no longer invokes explicit file-level pytest / torchrun commands for component accuracy.

Instead, both jobs call:

  • python3 sglang/multimodal_gen/test/run_suite.py --suite component-accuracy-1-gpu ...
  • python3 sglang/multimodal_gen/test/run_suite.py --suite component-accuracy-2-gpu ...

with partitions [0, 1], so the runner remains the single entrypoint and autopartition is preserved.

High-Level Flow

flowchart TD
    A[pr-test-multimodal-gen.yml] --> B[run_suite.py]

    B --> C{Suite type}

    C -->|unit / 1-gpu / 2-gpu| D[Collect pytest items]
    D --> E[Partition by item]
    E --> F[Run plain pytest]

    C -->|component-accuracy-1-gpu| G[Resolve suite files]
    G --> H[Partition by file]
    H --> I[Run one file per pytest process]

    C -->|component-accuracy-2-gpu| J[Resolve suite files]
    J --> K[Partition by file]
    K --> L[Run one file per distributed pytest process]
    L --> M[python -m torch.distributed.run --nproc_per_node=2 -m pytest]
Loading

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added Multi-modal multi-modal language model diffusion SGLang Diffusion labels Apr 2, 2026
@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Apr 3, 2026

thanks so much, really appreciate this. could you also make sure this test is robust and not flaky

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Apr 3, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 3, 2026
@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Apr 3, 2026

thanks so much, really appreciate this. could you also make sure this test is robust and not flaky

I tested locally with this PR for the acc tests and it worked fine, it should be robust.

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Apr 3, 2026

Hey @mickqian , figured out why multimodal-gen-test-2-gpu (0) (pull_request) is failing.

Fixed wan2_2_i2v_a14b_2gpu to explicitly use num_gpus=2; it was previously falling back to the default 1, which made the 2-GPU CI shard appear stuck and fail eventually

@Ratish1 Ratish1 removed the run-ci label Apr 3, 2026
@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Apr 3, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 3, 2026
@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Apr 4, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 4, 2026
@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

3 similar comments
@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Apr 5, 2026

Hey @mickqian , I routed the test through run_suite file and made sure the component accuracy is stable. Lmk if this looks good to you. Thanks.

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

3 similar comments
@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Apr 10, 2026

@BBuf BBuf merged commit cf5ad12 into sgl-project:main Apr 10, 2026
403 of 456 checks passed
@Ratish1 Ratish1 deleted the fix/ci-acc branch April 10, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants