[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic by seemethere · Pull Request #175067 · pytorch/pytorch

seemethere · 2026-02-16T03:21:47Z

Stack from ghstack (oldest at bottom):

Summary

Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays).

Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, CUDA 12.8 never uniquely caught a regression that 13.0 missed.

CUDA 13.0 is kept per-commit because:

It is the newest shipping CUDA version
Most likely to surface novel breakage from new CUDA runtime behavior
Forward-looking CI should protect what's coming, not what's already stable

CUDA 12.8 is moved to periodic because:

It is mature and well-understood -- breakage is less likely and less urgent
The rare 12.8-only regression can tolerate the ~8-hour periodic detection window
The 12.8 build job remains in trunk because cross-compile-linux-test depends on its artifacts

Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)

This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: ~1,580 GPU-hours/week (~6,320 GPU-hours/month).

Changes

trunk.yml: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build
periodic.yml: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry

Test Plan

CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays)
CUDA 13.0 per-commit coverage is unchanged
Cross-compile-linux-test continues to work (12.8 build job kept)

cc @pytorch/pytorch-dev-infra

[ghstack-poisoned]

pytorch-bot · 2026-02-16T03:21:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175067

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 33 Pending

As of commit ab2dcce with merge base 996c7d8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Both CUDA 12.8 and 13.0 are shipping wheel targets, but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window, CUDA 12.8 never uniquely caught a regression that 13.0 missed. CUDA 13.0 is kept per-commit because it is the newest shipping version and the most likely to surface novel breakage from new CUDA runtime behavior. CUDA 12.8 is mature and well-understood; regressions there can tolerate the ~8-hour periodic detection window. The 12.8 build job remains in trunk because cross-compile-linux-test depends on its artifacts. Changes: - trunk.yml: remove CUDA 12.8 test job and no-ops build - periodic.yml: add default (5 GPU shards) and distributed (3 multi-GPU shards) to existing CUDA 12.8 periodic entry Estimated savings: ~1,270 GPU-hours/week. See P2188981399 for the full CI workflow analysis. ghstack-source-id: c9cb69b Pull-Request: #175067

[ghstack-poisoned]

Both CUDA 12.8 and 13.0 are shipping wheel targets, but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window, CUDA 12.8 never uniquely caught a regression that 13.0 missed. CUDA 13.0 is kept per-commit because it is the newest shipping version and the most likely to surface novel breakage from new CUDA runtime behavior. CUDA 12.8 is mature and well-understood; regressions there can tolerate the ~8-hour periodic detection window. The 12.8 build job remains in trunk because cross-compile-linux-test depends on its artifacts. Changes: - trunk.yml: remove CUDA 12.8 test job and no-ops build - periodic.yml: add default (5 GPU shards) and distributed (3 multi-GPU shards) to existing CUDA 12.8 periodic entry Estimated savings: ~1,270 GPU-hours/week. See P2188981399 for the full CI workflow analysis. ghstack-source-id: bdf62ca Pull-Request: #175067

seemethere · 2026-02-16T03:27:39Z

Supporting Data

Failure correlation (30-day analysis, Jan 15 - Feb 15 2026)

Metric	Value
CUDA 12.8 vs 13.0 correlation (distributed tests)	~0.90
CUDA 12.8 vs 13.0 correlation (default tests)	~0.85
CUDA 12.8 unique regression catches	0 out of 38 significant revert events
Total reverts analyzed	97 (50 autorevert + 47 human)

Per-commit compute removed from trunk

Job	Shards	Runner	Avg Duration	Cost Weight
CUDA 12.8 default tests	5	g6.4xlarge GPU	~125 min each	5x
CUDA 12.8 distributed tests	3	g4dn.12xlarge GPU	~205 min each	10x
CUDA 12.8 no-ops build	1	CPU	~5 min	1x

Why keep 13.0 per-commit instead of 12.8?

Nightly wheels ship 4 CUDA versions: cu126, cu128, cu129, cu130. CUDA 13.0 is the newest and most likely to surface novel breakage from new CUDA runtime behavior. CUDA 12.8 is mature -- the ~8-hour periodic detection window is acceptable for the rare 12.8-only regression.

huydhn · 2026-02-16T22:35:06Z

Probably need @malfet and @atalman to chime in here. I thought that we want to keep the main CUDA version in trunk, which is still 12.8 at this time

malfet · 2026-02-17T15:56:17Z

Probably need @malfet and @atalman to chime in here. I thought that we want to keep the main CUDA version in trunk, which is still 12.8 at this time

If CUDA-13 and CUDA-12.8 tests are running against the same HW, not see a good reason to keep both tests around

seemethere · 2026-02-17T16:35:16Z

@pytorchbot merge

pytorchmergebot · 2026-02-17T16:40:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[ghstack-poisoned]

Both CUDA 12.8 and 13.0 are shipping wheel targets, but their trunk CI test suites have 85-90% failure correlation -- they almost always fail together. Over a 30-day analysis window, CUDA 12.8 never uniquely caught a regression that 13.0 missed. CUDA 13.0 is kept per-commit because it is the newest shipping version and the most likely to surface novel breakage from new CUDA runtime behavior. CUDA 12.8 is mature and well-understood; regressions there can tolerate the ~8-hour periodic detection window. The 12.8 build job remains in trunk because cross-compile-linux-test depends on its artifacts. Changes: - trunk.yml: remove CUDA 12.8 test job and no-ops build - periodic.yml: add default (5 GPU shards) and distributed (3 multi-GPU shards) to existing CUDA 12.8 periodic entry Estimated savings: ~1,270 GPU-hours/week. See P2188981399 for the full CI workflow analysis. ghstack-source-id: 54729dc Pull-Request: #175067

seemethere · 2026-02-17T17:47:59Z

@pytorchbot merge -f 'This is just moving stuff around / removing dead benchmarks'

pytorchmergebot · 2026-02-17T17:48:17Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2026-02-17T17:50:20Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

atalman · 2026-02-19T01:22:33Z

@pytorchbot cherry-pick --onto release/2.11 -c critical

) ## Summary Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays). Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have **85-90% failure correlation** -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, **CUDA 12.8 never uniquely caught a regression that 13.0 missed**. CUDA 13.0 is kept per-commit because: - It is the **newest** shipping CUDA version - Most likely to surface **novel breakage** from new CUDA runtime behavior - Forward-looking CI should protect what's coming, not what's already stable CUDA 12.8 is moved to periodic because: - It is **mature and well-understood** -- breakage is less likely and less urgent - The rare 12.8-only regression can tolerate the ~8-hour periodic detection window - The 12.8 build job **remains in trunk** because `cross-compile-linux-test` depends on its artifacts **Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)** This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: **~1,580 GPU-hours/week (~6,320 GPU-hours/month)**. ### Changes - `trunk.yml`: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build - `periodic.yml`: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry ## Test Plan - CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays) - CUDA 13.0 per-commit coverage is unchanged - Cross-compile-linux-test continues to work (12.8 build job kept) Pull Request resolved: #175067 Approved by: https://github.com/malfet ghstack dependencies: #175066 (cherry picked from commit ef0353f)

pytorchbot · 2026-02-19T01:27:58Z

Cherry picking #175067

The cherry pick PR is at #175300 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v.2.11.0] Release Tracker #175093 (comment)

Details for Dev Infra team

Raised by workflow job

) [CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic (#175067) ## Summary Move CUDA 12.8 GPU tests from per-commit trunk CI to periodic (~3x/day on weekdays). Both CUDA 12.8 and 13.0 are shipping wheel targets (nightly ships cu126, cu128, cu129, cu130), but their trunk CI test suites have **85-90% failure correlation** -- they almost always fail together. Over a 30-day analysis window covering 97 reverts and 38 significant regression events, **CUDA 12.8 never uniquely caught a regression that 13.0 missed**. CUDA 13.0 is kept per-commit because: - It is the **newest** shipping CUDA version - Most likely to surface **novel breakage** from new CUDA runtime behavior - Forward-looking CI should protect what's coming, not what's already stable CUDA 12.8 is moved to periodic because: - It is **mature and well-understood** -- breakage is less likely and less urgent - The rare 12.8-only regression can tolerate the ~8-hour periodic detection window - The 12.8 build job **remains in trunk** because `cross-compile-linux-test` depends on its artifacts **Estimated savings: ~1,270 GPU-hours/week (~5,080 GPU-hours/month)** This is the #2 savings opportunity from a broader CI workflow analysis (P2188981399) covering 128 PR+trunk jobs over 30 days. Combined with #175066 (CycleGAN skip, ~310 GPU-hours/week), total savings from this stack: **~1,580 GPU-hours/week (~6,320 GPU-hours/month)**. ### Changes - `trunk.yml`: remove CUDA 12.8 test job (5 default + 3 distributed + 1 pr_time_benchmarks + 1 libtorch shards) and no-ops build - `periodic.yml`: add default (5 GPU shards on g6.4xlarge) and distributed (3 multi-GPU shards on g4dn.12xlarge) to existing CUDA 12.8 periodic entry ## Test Plan - CUDA 12.8 GPU tests continue to run in periodic (3x/day weekdays) - CUDA 13.0 per-commit coverage is unchanged - Cross-compile-linux-test continues to work (12.8 build job kept) Pull Request resolved: #175067 Approved by: https://github.com/malfet ghstack dependencies: #175066 (cherry picked from commit ef0353f) Co-authored-by: Eli Uriegas <eliuriegas@meta.com>

Update

d298da9

[ghstack-poisoned]

seemethere requested a review from a team as a code owner February 16, 2026 03:21

pytorch-bot bot added the topic: not user facing topic category label Feb 16, 2026

seemethere mentioned this pull request Feb 16, 2026

[benchmark] Skip pytorch_CycleGAN_and_pix2pix from inductor benchmarks #175066

Closed

Update

e574b90

[ghstack-poisoned]

malfet approved these changes Feb 17, 2026

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 17, 2026

pytorchmergebot added the merging label Feb 17, 2026

Update

ab2dcce

[ghstack-poisoned]

pytorchmergebot added the Merged label Feb 17, 2026

pytorchmergebot closed this in ef0353f Feb 17, 2026

pytorchmergebot removed the merging label Feb 17, 2026

pytorchbot mentioned this pull request Feb 19, 2026

[benchmark] Skip pytorch_CycleGAN_and_pix2pix from inductor benchmarks #175299

Merged

pytorchbot mentioned this pull request Feb 19, 2026

[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic #175300

Merged

pytorchbot mentioned this pull request Feb 19, 2026

[v.2.11.0] Release Tracker #175093

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic#175067

[CI] Move CUDA 12.8 GPU tests from per-commit trunk to periodic#175067
seemethere wants to merge 3 commits intogh/seemethere/128/basefrom
gh/seemethere/128/head

seemethere commented Feb 16, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 16, 2026 •

edited

Loading

Uh oh!

seemethere commented Feb 16, 2026

Uh oh!

huydhn commented Feb 16, 2026 •

edited

Loading

Uh oh!

malfet commented Feb 17, 2026

Uh oh!

seemethere commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Uh oh!

seemethere commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Uh oh!

atalman commented Feb 19, 2026

Uh oh!

pytorchbot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

seemethere commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan

Uh oh!

pytorch-bot bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175067

⏳ No Failures, 33 Pending

Uh oh!

seemethere commented Feb 16, 2026

Supporting Data

Failure correlation (30-day analysis, Jan 15 - Feb 15 2026)

Per-commit compute removed from trunk

Why keep 13.0 per-commit instead of 12.8?

Uh oh!

huydhn commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet commented Feb 17, 2026

Uh oh!

seemethere commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Merge started

Uh oh!

seemethere commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Merge started

Uh oh!

atalman commented Feb 19, 2026

Uh oh!

pytorchbot commented Feb 19, 2026

Cherry picking #175067

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

seemethere commented Feb 16, 2026 •

edited

Loading

pytorch-bot bot commented Feb 16, 2026 •

edited

Loading

huydhn commented Feb 16, 2026 •

edited

Loading