Skip to content

[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic#175067

Closed
seemethere wants to merge 3 commits intogh/seemethere/128/basefrom
gh/seemethere/128/head
Closed

[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic#175067
seemethere wants to merge 3 commits intogh/seemethere/128/basefrom
gh/seemethere/128/head

Conversation

@seemethere
Copy link
Member

@seemethere seemethere commented Feb 16, 2026

Stack from ghstack (oldest at bottom):

Summary

Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays).

Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, CUDA 12.8 never uniquely caught a regression that 13.0 missed.

CUDA 13.0 is kept per-commit because:

  • It is the newest shipping CUDA version
  • Most likely to surface novel breakage from new CUDA runtime behavior
  • Forward-looking CI should protect what's coming, not what's already stable

CUDA 12.8 is moved to periodic because:

  • It is mature and well-understood -- breakage is less likely and less urgent
  • The rare 12.8-only regression can tolerate the ~8-hour periodic detection window
  • The 12.8 build job remains in trunk because cross-compile-linux-test depends on its artifacts

Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)

This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: ~1,580 GPU-hours/week (~6,320 GPU-hours/month).

Changes

  • trunk.yml: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build
  • periodic.yml: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry

Test Plan

  • CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays)
  • CUDA 13.0 per-commit coverage is unchanged
  • Cross-compile-linux-test continues to work (12.8 build job kept)

cc @pytorch/pytorch-dev-infra

[ghstack-poisoned]
@seemethere seemethere requested a review from a team as a code owner February 16, 2026 03:21
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175067

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 33 Pending

As of commit ab2dcce with merge base 996c7d8 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

seemethere added a commit that referenced this pull request Feb 16, 2026
Both CUDA 12.8 and 13.0 are shipping wheel targets, but their trunk CI
test suites have 85-90% failure correlation -- they almost always fail
together. Over a 30-day analysis window, CUDA 12.8 never uniquely
caught a regression that 13.0 missed.

CUDA 13.0 is kept per-commit because it is the newest shipping version
and the most likely to surface novel breakage from new CUDA runtime
behavior. CUDA 12.8 is mature and well-understood; regressions there
can tolerate the ~8-hour periodic detection window.

The 12.8 build job remains in trunk because cross-compile-linux-test
depends on its artifacts.

Changes:
- trunk.yml: remove CUDA 12.8 test job and no-ops build
- periodic.yml: add default (5 GPU shards) and distributed (3 multi-GPU
  shards) to existing CUDA 12.8 periodic entry

Estimated savings: ~1,270 GPU-hours/week.
See P2188981399 for the full CI workflow analysis.


ghstack-source-id: c9cb69b
Pull-Request: #175067
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Feb 16, 2026
[ghstack-poisoned]
seemethere added a commit that referenced this pull request Feb 16, 2026
Both CUDA 12.8 and 13.0 are shipping wheel targets, but their trunk CI
test suites have 85-90% failure correlation -- they almost always fail
together. Over a 30-day analysis window, CUDA 12.8 never uniquely
caught a regression that 13.0 missed.

CUDA 13.0 is kept per-commit because it is the newest shipping version
and the most likely to surface novel breakage from new CUDA runtime
behavior. CUDA 12.8 is mature and well-understood; regressions there
can tolerate the ~8-hour periodic detection window.

The 12.8 build job remains in trunk because cross-compile-linux-test
depends on its artifacts.

Changes:
- trunk.yml: remove CUDA 12.8 test job and no-ops build
- periodic.yml: add default (5 GPU shards) and distributed (3 multi-GPU
  shards) to existing CUDA 12.8 periodic entry

Estimated savings: ~1,270 GPU-hours/week.
See P2188981399 for the full CI workflow analysis.


ghstack-source-id: bdf62ca
Pull-Request: #175067
@seemethere
Copy link
Member Author

Supporting Data

Failure correlation (30-day analysis, Jan 15 - Feb 15 2026)

Metric Value
CUDA 12.8 vs 13.0 correlation (distributed tests) ~0.90
CUDA 12.8 vs 13.0 correlation (default tests) ~0.85
CUDA 12.8 unique regression catches 0 out of 38 significant revert events
Total reverts analyzed 97 (50 autorevert + 47 human)

Per-commit compute removed from trunk

Job Shards Runner Avg Duration Cost Weight
CUDA 12.8 default tests 5 g6.4xlarge GPU ~125 min each 5x
CUDA 12.8 distributed tests 3 g4dn.12xlarge GPU ~205 min each 10x
CUDA 12.8 no-ops build 1 CPU ~5 min 1x

Why keep 13.0 per-commit instead of 12.8?

Nightly wheels ship 4 CUDA versions: cu126, cu128, cu129, cu130. CUDA 13.0 is the newest and most likely to surface novel breakage from new CUDA runtime behavior. CUDA 12.8 is mature -- the ~8-hour periodic detection window is acceptable for the rare 12.8-only regression.

@huydhn
Copy link
Contributor

huydhn commented Feb 16, 2026

Probably need @malfet and @atalman to chime in here. I thought that we want to keep the main CUDA version in trunk, which is still 12.8 at this time

@malfet
Copy link
Contributor

malfet commented Feb 17, 2026

Probably need @malfet and @atalman to chime in here. I thought that we want to keep the main CUDA version in trunk, which is still 12.8 at this time

If CUDA-13 and CUDA-12.8 tests are running against the same HW, not see a good reason to keep both tests around

@seemethere
Copy link
Member Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 17, 2026
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

[ghstack-poisoned]
seemethere added a commit that referenced this pull request Feb 17, 2026
Both CUDA 12.8 and 13.0 are shipping wheel targets, but their trunk CI
test suites have 85-90% failure correlation -- they almost always fail
together. Over a 30-day analysis window, CUDA 12.8 never uniquely
caught a regression that 13.0 missed.

CUDA 13.0 is kept per-commit because it is the newest shipping version
and the most likely to surface novel breakage from new CUDA runtime
behavior. CUDA 12.8 is mature and well-understood; regressions there
can tolerate the ~8-hour periodic detection window.

The 12.8 build job remains in trunk because cross-compile-linux-test
depends on its artifacts.

Changes:
- trunk.yml: remove CUDA 12.8 test job and no-ops build
- periodic.yml: add default (5 GPU shards) and distributed (3 multi-GPU
  shards) to existing CUDA 12.8 periodic entry

Estimated savings: ~1,270 GPU-hours/week.
See P2188981399 for the full CI workflow analysis.


ghstack-source-id: 54729dc
Pull-Request: #175067
@seemethere
Copy link
Member Author

@pytorchbot merge -f 'This is just moving stuff around / removing dead benchmarks'

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@atalman
Copy link
Contributor

atalman commented Feb 19, 2026

@pytorchbot cherry-pick --onto release/2.11 -c critical

pytorchbot pushed a commit that referenced this pull request Feb 19, 2026
)

## Summary

Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays).

Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have **85-90% failure correlation** -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, **CUDA 12.8 never uniquely caught a regression that 13.0 missed**.

CUDA 13.0 is kept per-commit because:
- It is the **newest** shipping CUDA version
- Most likely to surface **novel breakage** from new CUDA runtime behavior
- Forward-looking CI should protect what's coming, not what's already stable

CUDA 12.8 is moved to periodic because:
- It is **mature and well-understood** -- breakage is less likely and less urgent
- The rare 12.8-only regression can tolerate the ~8-hour periodic detection window
- The 12.8 build job **remains in trunk** because `cross-compile-linux-test` depends on its artifacts

**Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)**

This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: **~1,580 GPU-hours/week (~6,320 GPU-hours/month)**.

### Changes
- `trunk.yml`: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build
- `periodic.yml`: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry

## Test Plan

- CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays)
- CUDA 13.0 per-commit coverage is unchanged
- Cross-compile-linux-test continues to work (12.8 build job kept)

Pull Request resolved: #175067
Approved by: https://github.com/malfet
ghstack dependencies: #175066

(cherry picked from commit ef0353f)
@pytorchbot
Copy link
Collaborator

Cherry picking #175067

The cherry pick PR is at #175300 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

atalman pushed a commit that referenced this pull request Feb 19, 2026
)

[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic (#175067)

## Summary

Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays).

Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have **85-90% failure correlation** -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, **CUDA 12.8 never uniquely caught a regression that 13.0 missed**.

CUDA 13.0 is kept per-commit because:
- It is the **newest** shipping CUDA version
- Most likely to surface **novel breakage** from new CUDA runtime behavior
- Forward-looking CI should protect what's coming, not what's already stable

CUDA 12.8 is moved to periodic because:
- It is **mature and well-understood** -- breakage is less likely and less urgent
- The rare 12.8-only regression can tolerate the ~8-hour periodic detection window
- The 12.8 build job **remains in trunk** because `cross-compile-linux-test` depends on its artifacts

**Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)**

This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: **~1,580 GPU-hours/week (~6,320 GPU-hours/month)**.

### Changes
- `trunk.yml`: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build
- `periodic.yml`: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry

## Test Plan

- CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays)
- CUDA 13.0 per-commit coverage is unchanged
- Cross-compile-linux-test continues to work (12.8 build job kept)

Pull Request resolved: #175067
Approved by: https://github.com/malfet
ghstack dependencies: #175066

(cherry picked from commit ef0353f)

Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants