[diffusion][CI]: route multimodal component accuracy through run_suite#21960
[diffusion][CI]: route multimodal component accuracy through run_suite#21960BBuf merged 19 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
thanks so much, really appreciate this. could you also make sure this test is robust and not flaky |
|
/tag-and-rerun-ci |
I tested locally with this PR for the acc tests and it worked fine, it should be robust. |
|
Hey @mickqian , figured out why multimodal-gen-test-2-gpu (0) (pull_request) is failing. Fixed wan2_2_i2v_a14b_2gpu to explicitly use num_gpus=2; it was previously falling back to the default 1, which made the 2-GPU CI shard appear stuck and fail eventually |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
Hey @mickqian , I routed the test through run_suite file and made sure the component accuracy is stable. Lmk if this looks good to you. Thanks. |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Motivation
Related #18709 .The multimodal diffusion component-accuracy jobs were temporarily wired with explicit workflow-side
pytest/torchruncommands so they could pass CI correctly.That fixed the immediate CI failures, but it left these jobs outside the normal multimodal runner path. The follow-up goal here is to make
python/sglang/multimodal_gen/test/run_suite.pythe only entrypoint again, while preserving the execution behavior that component-accuracy actually needs:This PR brings component-accuracy back under
run_suite.pywithout broadening the runner design for unrelated suites.Modifications
1. Add component-accuracy suites to
run_suite.pyNew suites:
component-accuracy-1-gpucomponent-accuracy-2-gpu2. Add a narrow accuracy-only execution branch in
run_suite.pyThe default multimodal runner path is preserved for existing suites:
pytestFor the two component-accuracy suites only:
python -m torch.distributed.run --nproc_per_node=2 -m pytestThis keeps the behavior that was required for correctness:
3. Move multimodal component-accuracy jobs back to
run_suite.pyUpdated:
.github/workflows/pr-test-multimodal-gen.ymlThe workflow no longer invokes explicit file-level
pytest/torchruncommands for component accuracy.Instead, both jobs call:
python3 sglang/multimodal_gen/test/run_suite.py --suite component-accuracy-1-gpu ...python3 sglang/multimodal_gen/test/run_suite.py --suite component-accuracy-2-gpu ...with partitions
[0, 1], so the runner remains the single entrypoint and autopartition is preserved.High-Level Flow
flowchart TD A[pr-test-multimodal-gen.yml] --> B[run_suite.py] B --> C{Suite type} C -->|unit / 1-gpu / 2-gpu| D[Collect pytest items] D --> E[Partition by item] E --> F[Run plain pytest] C -->|component-accuracy-1-gpu| G[Resolve suite files] G --> H[Partition by file] H --> I[Run one file per pytest process] C -->|component-accuracy-2-gpu| J[Resolve suite files] J --> K[Partition by file] K --> L[Run one file per distributed pytest process] L --> M[python -m torch.distributed.run --nproc_per_node=2 -m pytest]Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci