[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic#175067
[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic#175067seemethere wants to merge 3 commits intogh/seemethere/128/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175067
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 33 PendingAs of commit ab2dcce with merge base 996c7d8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Both CUDA 12.8 and 13.0 are shipping wheel targets, but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window, CUDA 12.8 never uniquely caught a regression that 13.0 missed. CUDA 13.0 is kept per-commit because it is the newest shipping version and the most likely to surface novel breakage from new CUDA runtime behavior. CUDA 12.8 is mature and well-understood; regressions there can tolerate the ~8-hour periodic detection window. The 12.8 build job remains in trunk because cross-compile-linux-test depends on its artifacts. Changes: - trunk.yml: remove CUDA 12.8 test job and no-ops build - periodic.yml: add default (5 GPU shards) and distributed (3 multi-GPU shards) to existing CUDA 12.8 periodic entry Estimated savings: ~1,270 GPU-hours/week. See P2188981399 for the full CI workflow analysis. ghstack-source-id: c9cb69b Pull-Request: #175067
Both CUDA 12.8 and 13.0 are shipping wheel targets, but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window, CUDA 12.8 never uniquely caught a regression that 13.0 missed. CUDA 13.0 is kept per-commit because it is the newest shipping version and the most likely to surface novel breakage from new CUDA runtime behavior. CUDA 12.8 is mature and well-understood; regressions there can tolerate the ~8-hour periodic detection window. The 12.8 build job remains in trunk because cross-compile-linux-test depends on its artifacts. Changes: - trunk.yml: remove CUDA 12.8 test job and no-ops build - periodic.yml: add default (5 GPU shards) and distributed (3 multi-GPU shards) to existing CUDA 12.8 periodic entry Estimated savings: ~1,270 GPU-hours/week. See P2188981399 for the full CI workflow analysis. ghstack-source-id: bdf62ca Pull-Request: #175067
Supporting DataFailure correlation (30-day analysis, Jan 15 - Feb 15 2026)
Per-commit compute removed from trunk
Why keep 13.0 per-commit instead of 12.8?Nightly wheels ship 4 CUDA versions: cu126, cu128, cu129, cu130. CUDA 13.0 is the newest and most likely to surface novel breakage from new CUDA runtime behavior. CUDA 12.8 is mature -- the ~8-hour periodic detection window is acceptable for the rare 12.8-only regression. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Both CUDA 12.8 and 13.0 are shipping wheel targets, but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window, CUDA 12.8 never uniquely caught a regression that 13.0 missed. CUDA 13.0 is kept per-commit because it is the newest shipping version and the most likely to surface novel breakage from new CUDA runtime behavior. CUDA 12.8 is mature and well-understood; regressions there can tolerate the ~8-hour periodic detection window. The 12.8 build job remains in trunk because cross-compile-linux-test depends on its artifacts. Changes: - trunk.yml: remove CUDA 12.8 test job and no-ops build - periodic.yml: add default (5 GPU shards) and distributed (3 multi-GPU shards) to existing CUDA 12.8 periodic entry Estimated savings: ~1,270 GPU-hours/week. See P2188981399 for the full CI workflow analysis. ghstack-source-id: 54729dc Pull-Request: #175067
|
@pytorchbot merge -f 'This is just moving stuff around / removing dead benchmarks' |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.11 -c critical |
) ## Summary Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays). Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have **85-90% failure correlation** -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, **CUDA 12.8 never uniquely caught a regression that 13.0 missed**. CUDA 13.0 is kept per-commit because: - It is the **newest** shipping CUDA version - Most likely to surface **novel breakage** from new CUDA runtime behavior - Forward-looking CI should protect what's coming, not what's already stable CUDA 12.8 is moved to periodic because: - It is **mature and well-understood** -- breakage is less likely and less urgent - The rare 12.8-only regression can tolerate the ~8-hour periodic detection window - The 12.8 build job **remains in trunk** because `cross-compile-linux-test` depends on its artifacts **Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)** This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: **~1,580 GPU-hours/week (~6,320 GPU-hours/month)**. ### Changes - `trunk.yml`: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build - `periodic.yml`: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry ## Test Plan - CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays) - CUDA 13.0 per-commit coverage is unchanged - Cross-compile-linux-test continues to work (12.8 build job kept) Pull Request resolved: #175067 Approved by: https://github.com/malfet ghstack dependencies: #175066 (cherry picked from commit ef0353f)
Cherry picking #175067The cherry pick PR is at #175300 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
) [CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic (#175067) ## Summary Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays). Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have **85-90% failure correlation** -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, **CUDA 12.8 never uniquely caught a regression that 13.0 missed**. CUDA 13.0 is kept per-commit because: - It is the **newest** shipping CUDA version - Most likely to surface **novel breakage** from new CUDA runtime behavior - Forward-looking CI should protect what's coming, not what's already stable CUDA 12.8 is moved to periodic because: - It is **mature and well-understood** -- breakage is less likely and less urgent - The rare 12.8-only regression can tolerate the ~8-hour periodic detection window - The 12.8 build job **remains in trunk** because `cross-compile-linux-test` depends on its artifacts **Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)** This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: **~1,580 GPU-hours/week (~6,320 GPU-hours/month)**. ### Changes - `trunk.yml`: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build - `periodic.yml`: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry ## Test Plan - CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays) - CUDA 13.0 per-commit coverage is unchanged - Cross-compile-linux-test continues to work (12.8 build job kept) Pull Request resolved: #175067 Approved by: https://github.com/malfet ghstack dependencies: #175066 (cherry picked from commit ef0353f) Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
Stack from ghstack (oldest at bottom):
Summary
Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays).
Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, CUDA 12.8 never uniquely caught a regression that 13.0 missed.
CUDA 13.0 is kept per-commit because:
CUDA 12.8 is moved to periodic because:
cross-compile-linux-testdepends on its artifactsEstimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)
This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: ~1,580 GPU-hours/week (~6,320 GPU-hours/month).
Changes
trunk.yml: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops buildperiodic.yml: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entryTest Plan
cc @pytorch/pytorch-dev-infra