Skip to content

[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic#175300

Merged
atalman merged 1 commit intorelease/2.11from
cherry-pick-175067-by-pytorch_bot_bot_
Feb 19, 2026
Merged

[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic#175300
atalman merged 1 commit intorelease/2.11from
cherry-pick-175067-by-pytorch_bot_bot_

Conversation

@pytorchbot
Copy link
Collaborator

Stack from ghstack (oldest at bottom):

Summary

Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays).

Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, CUDA 12.8 never uniquely caught a regression that 13.0 missed.

CUDA 13.0 is kept per-commit because:

  • It is the newest shipping CUDA version
  • Most likely to surface novel breakage from new CUDA runtime behavior
  • Forward-looking CI should protect what's coming, not what's already stable

CUDA 12.8 is moved to periodic because:

  • It is mature and well-understood -- breakage is less likely and less urgent
  • The rare 12.8-only regression can tolerate the ~8-hour periodic detection window
  • The 12.8 build job remains in trunk because cross-compile-linux-test depends on its artifacts

Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)

This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: ~1,580 GPU-hours/week (~6,320 GPU-hours/month).

Changes

  • trunk.yml: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build
  • periodic.yml: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry

Test Plan

  • CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays)
  • CUDA 13.0 per-commit coverage is unchanged
  • Cross-compile-linux-test continues to work (12.8 build job kept)

cc @pytorch/pytorch-dev-infra

)

## Summary

Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays).

Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have **85-90% failure correlation** -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, **CUDA 12.8 never uniquely caught a regression that 13.0 missed**.

CUDA 13.0 is kept per-commit because:
- It is the **newest** shipping CUDA version
- Most likely to surface **novel breakage** from new CUDA runtime behavior
- Forward-looking CI should protect what's coming, not what's already stable

CUDA 12.8 is moved to periodic because:
- It is **mature and well-understood** -- breakage is less likely and less urgent
- The rare 12.8-only regression can tolerate the ~8-hour periodic detection window
- The 12.8 build job **remains in trunk** because `cross-compile-linux-test` depends on its artifacts

**Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)**

This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: **~1,580 GPU-hours/week (~6,320 GPU-hours/month)**.

### Changes
- `trunk.yml`: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build
- `periodic.yml`: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry

## Test Plan

- CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays)
- CUDA 13.0 per-commit coverage is unchanged
- Cross-compile-linux-test continues to work (12.8 build job kept)

Pull Request resolved: #175067
Approved by: https://github.com/malfet
ghstack dependencies: #175066

(cherry picked from commit ef0353f)
@pytorchbot pytorchbot requested a review from a team as a code owner February 19, 2026 01:27
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 19, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175300

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d807fca with merge base 0fd766e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@atalman atalman merged commit d80a584 into release/2.11 Feb 19, 2026
110 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants