Skip to content

[windows][smoke test] Add an option to install cuda if required cuda/cudnn on windows AMI do not match#177273

Closed
atalman wants to merge 5 commits intopytorch:mainfrom
atalman:fix_cudnn_test
Closed

[windows][smoke test] Add an option to install cuda if required cuda/cudnn on windows AMI do not match#177273
atalman wants to merge 5 commits intopytorch:mainfrom
atalman:fix_cudnn_test

Conversation

@atalman
Copy link
Copy Markdown
Contributor

@atalman atalman commented Mar 12, 2026

Followup fix for #167242
After #175547 Windows AMI and intendent version of CUDNN do not match. Hence followup fixes required. With this approach we are flexible to update windows CUDNN without actually updating the Windows AMI. Windows AMI update can be done later.

Fixes failure during test: https://github.com/pytorch/pytorch/actions/runs/22979336872/job/66724249070

RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (9, 19, 0) but found runtime version (9, 10, 2). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. 

Please note: The cuda/cudnn version will not be updated if the right version is already installed.

Test Plan: In CI via ciflow/binaries

@atalman atalman requested a review from a team as a code owner March 12, 2026 15:39
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 12, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177273

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 36 Pending

As of commit 0904455 with merge base e05e600 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 12, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@atalman atalman added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Mar 12, 2026

pushd %SRC_DIR%\..

if not "%CUDA_VERSION%" == "cpu" if not "%CUDA_VERSION%" == "xpu" call internal\cuda_install.bat
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we install the cuda driver before the cuda toolkit?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine, but can you please ask claude to provide a more comphrehensive review of repeated checks and validation that ensures cuda/cudnn versions installed during the "smoke check" will not be out of sync with cuda/cudnn version we install on Linux, etc


pushd %SRC_DIR%\..

if not "%CUDA_VERSION%" == "cpu" if not "%CUDA_VERSION%" == "xpu" call internal\cuda_install.bat
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, you already have this check, don't you:

if "%CUDA_VERSION%" == "cpu" (

@malfet
Copy link
Copy Markdown
Contributor

malfet commented Mar 12, 2026

@claude review this please

@claude
Copy link
Copy Markdown

claude bot commented Mar 12, 2026

Claude finished @malfet's task in 2m 13s —— View job


Review of PR #177273

  • Read the changed file and gather context
  • Read related CUDA install scripts for duplication/sync analysis
  • Provide comprehensive review

Summary

This PR adds a call to cuda_install.bat at the beginning of smoke_test.bat (before driver_update.bat) to handle the case where the pre-installed CUDA/cuDNN on a Windows AMI doesn't match what PyTorch was compiled against.

Issues

1. Ordering: cuda_install.bat is called before driver_update.bat

As @albanD pointed out, the CUDA toolkit is being installed before the GPU driver. cuda_install.bat is now called at line 5, and driver_update.bat at line 8. Generally, the NVIDIA driver should be installed before or alongside the CUDA toolkit. However, since cuda_install.bat only installs toolkit + cuDNN libraries (not runtime components that need the driver), and the smoke tests that actually use the GPU come later, this ordering is likely fine in practice. The toolkit installer may emit warnings but should succeed. Still, swapping the order to install the driver first would be more conventional and safer.

2. Duplicated cuDNN version mappings in cuda_install.bat — the key concern

cuda_install.bat defines cuDNN version-to-CUDA mappings in two separate places:

  • Lines 36–58 (the cuda_download path, used when CUDA is not pre-installed):

    • CUDA 12.6 → cuDNN 9.10.2.21
    • CUDA 12.8 → cuDNN 9.19.0.56
    • CUDA 12.9 → cuDNN 9.17.1.4
    • CUDA 13.0 → cuDNN 9.19.0.56
  • Lines 156–171 (the check_cudnn path, used when CUDA is pre-installed but cuDNN may be wrong):

    • Same mappings, but duplicated as separate if blocks with both CUDNN_FOLDER and EXPECTED_CUDNN_VERSION.

These two copies can drift independently. If someone updates one, they could easily forget the other. The existing validate_cudnn_version_consistency() in generate_binary_build_matrix.py (line 224) only parses the first mapping (the :cudaXXX label blocks at lines 36-58) via _parse_windows_cudnn_versions(). It does not validate the check_cudnn section (lines 156-171). So if someone updates the first block but forgets the second, the validation in generate_binary_build_matrix.py would pass while check_cudnn would still use stale versions.

Recommendation: Refactor cuda_install.bat to set CUDNN_FOLDER and EXPECTED_CUDNN_VERSION in a single place (the :cudaXXX blocks), then use those same variables in both the fresh-install and the check-cudnn-only paths. This would eliminate the duplication entirely. The check_cudnn section could reuse the variables already set by the jump-table at the top of the file rather than redefining them.

3. The PR change itself is redundant with existing logic

As @malfet noted, cuda_install.bat already has the cpu/xpu guard at lines 3-10. The conditional in smoke_test.bat:

if not "%CUDA_VERSION%" == "cpu" if not "%CUDA_VERSION%" == "xpu" call internal\cuda_install.bat

duplicates the guard that's already inside cuda_install.bat itself. This isn't harmful, but it's worth noting that cuda_install.bat could simply be called unconditionally (it exits early for cpu/xpu).

4. Cross-platform cuDNN version sync

The cuDNN versions between Linux (install_cuda.sh) and Windows (cuda_install.bat) are currently in sync — both use the same versions per CUDA variant. The validate_cudnn_version_consistency() function in generate_binary_build_matrix.py ensures this stays true at matrix-generation time. However, as noted in point 2, this validation only covers the first mapping in cuda_install.bat, not the duplicated check_cudnn mapping. This is a gap.

Verdict

The change itself is a reasonable quick fix for the immediate smoke test failure, but it introduces a pattern where cuda_install.bat is called from a new location without addressing the underlying duplication issue. The check_cudnn block (lines 151-212) duplicates cuDNN version mappings that aren't covered by the existing cross-platform validation, which is a latent bug risk.


@atalman atalman requested review from albanD and malfet March 12, 2026 16:17
@atalman
Copy link
Copy Markdown
Contributor Author

atalman commented Mar 12, 2026

The Duplicated cuDNN version mappings in cuda_install.bat is not a concern. This allows us to update cuda and cudnn for Linux and Windows versions independently:

Windows is specified here:

if %CUDA_VER% EQU 126 (
set CUDNN_FOLDER=cudnn-windows-x86_64-9.10.2.21_cuda12-archive
set EXPECTED_CUDNN_VERSION=9.10.2
)
if %CUDA_VER% EQU 128 (
set CUDNN_FOLDER=cudnn-windows-x86_64-9.19.0.56_cuda12-archive
set EXPECTED_CUDNN_VERSION=9.19.0
)
if %CUDA_VER% EQU 129 (
set CUDNN_FOLDER=cudnn-windows-x86_64-9.17.1.4_cuda12-archive
set EXPECTED_CUDNN_VERSION=9.17.1
)
if %CUDA_VER% EQU 130 (
set CUDNN_FOLDER=cudnn-windows-x86_64-9.19.0.56_cuda13-archive
set EXPECTED_CUDNN_VERSION=9.19.0
)

While Linux x86 and aarch64 is here:

PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"12.6": (
"cuda-toolkit[nvrtc,cudart,cupti,cufft,curand,cusolver,cusparse,cublas,cufile,nvjitlink,nvtx]==12.6.3; platform_system == 'Linux' | " # noqa: B950
"cuda-bindings>=12.9.4,<13; platform_system == 'Linux' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.29.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.4.5; platform_system == 'Linux'"
),
"12.8": (
"cuda-toolkit[nvrtc,cudart,cupti,cufft,curand,cusolver,cusparse,cublas,cufile,nvjitlink,nvtx]==12.8.1; platform_system == 'Linux' | " # noqa: B950
"cuda-bindings>=12.9.4,<13; platform_system == 'Linux' | "
"nvidia-cudnn-cu12==9.19.0.56; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.29.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.4.5; platform_system == 'Linux'"
),
"12.9": (
"cuda-toolkit[nvrtc,cudart,cupti,cufft,curand,cusolver,cusparse,cublas,cufile,nvjitlink,nvtx]==12.9.1; platform_system == 'Linux' | " # noqa: B950
"cuda-bindings>=12.9.4,<13; platform_system == 'Linux' | "
"nvidia-cudnn-cu12==9.17.1.4; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.29.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.4.5; platform_system == 'Linux'"
),
"13.0": (
"cuda-toolkit[nvrtc,cudart,cupti,cufft,curand,cusolver,cusparse,cublas,cufile,nvjitlink,nvtx]==13.0.2; platform_system == 'Linux' | " # noqa: B950
"cuda-bindings>=13.0.3,<14; platform_system == 'Linux' | "
"nvidia-cudnn-cu13==9.19.0.56; platform_system == 'Linux' | "
"nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.29.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu13==3.4.5; platform_system == 'Linux'"

Copy link
Copy Markdown
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds ok.

Copy link
Copy Markdown
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating a description, as it unblocks 2.11 LGTM, but let's have a followup discussion on why this is necessary. I.e. if PyTorch was build with cudnn-X.Y.Z, no matter which one you have setup locally it should prefer ones it's bundled with, unless system setup is garbled.

Also, woulnd't it introduce significant flakiness as every Windows run will effectively download CuDNN

@atalman
Copy link
Copy Markdown
Contributor Author

atalman commented Mar 13, 2026

@pytorchmergebot merge -f "All required tests are passing"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 13, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@atalman
Copy link
Copy Markdown
Contributor Author

atalman commented Mar 13, 2026

@pytorchbot cherry-pick --onto release/2.11 -c critical

pytorchbot pushed a commit that referenced this pull request Mar 13, 2026
…cudnn on windows AMI do not match (#177273)

Followup fix for #167242
After #175547 Windows AMI and intendent version of CUDNN do not match. Hence followup fixes required. With this approach we are flexible to update windows CUDNN without actually updating the Windows AMI. Windows AMI update can be done later.

Fixes failure during test: https://github.com/pytorch/pytorch/actions/runs/22979336872/job/66724249070

```
RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (9, 19, 0) but found runtime version (9, 10, 2). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.
```

Please note: The cuda/cudnn version will not be updated if the right version is already installed.

Test Plan: In CI via ciflow/binaries
Pull Request resolved: #177273
Approved by: https://github.com/malfet, https://github.com/albanD

(cherry picked from commit e55da9f)
@pytorchbot
Copy link
Copy Markdown
Collaborator

Cherry picking #177273

The cherry pick PR is at #177369 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

atalman added a commit that referenced this pull request Mar 13, 2026
…cudnn on windows AMI do not match (#177369)

[windows][smoke test] Add an option to install cuda if required cuda/cudnn on windows AMI do not match (#177273)

Followup fix for #167242
After #175547 Windows AMI and intendent version of CUDNN do not match. Hence followup fixes required. With this approach we are flexible to update windows CUDNN without actually updating the Windows AMI. Windows AMI update can be done later.

Fixes failure during test: https://github.com/pytorch/pytorch/actions/runs/22979336872/job/66724249070

```
RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (9, 19, 0) but found runtime version (9, 10, 2). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.
```

Please note: The cuda/cudnn version will not be updated if the right version is already installed.

Test Plan: In CI via ciflow/binaries
Pull Request resolved: #177273
Approved by: https://github.com/malfet, https://github.com/albanD

(cherry picked from commit e55da9f)

Co-authored-by: atalman <atalman@fb.com>
pytorchmergebot pushed a commit that referenced this pull request Mar 13, 2026
## Summary

Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures.

Returns a structured JSON verdict as an artifact:
- **revert** — causal chain found, proceed to revert immediately
- **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged)
- **not_related** — failures unrelated to the change, ignore this signal
- **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours

Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit

## Evaluation Results (13/13 correct verdicts)

Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types:

### Round 1 (2026-03-12) — 4/4 correct

| Test Case | PR | Failure | Expected | Actual | Job |
|-----------|-----|---------|----------|--------|-----|
| Doc-only change | #177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) |
| Dynamo einops fix | #177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) |
| MPS cdouble guard | #176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) |
| Lint missing import | #176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) |

### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled)

| Timestamp | PR | Signal Key | Expected | Actual | Job |
|-----------|-----|-----------|----------|--------|-----|
| 03:12Z | #176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) |
| 03:12Z | #176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) |
| 09:11Z | #177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) |
| 10:12Z | #176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) |
| 10:12Z | #176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) |
| 11:10Z | #176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) |
| 11:10Z | #176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) |
| 15:09Z | #177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) |
| 16:09Z | #176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) |
| 16:09Z | #176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) |

### Summary by verdict type

| Verdict | Count | Correct | Avg Confidence |
|---------|-------|---------|----------------|
| revert | 4 | 4/4 | 0.97 |
| garbage | 2 | 2/2 | 0.95 |
| not_related | 7 | 7/7 | 0.94 |

## Test plan

- [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases
- [x] Verified structured JSON output matches schema
- [x] Verified verdict artifact uploads correctly
- [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works
- [ ] Integrate dispatch call into autorevert lambda (follow-up)
Pull Request resolved: #177404
Approved by: https://github.com/wdvr
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 24, 2026
## Summary

Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures.

Returns a structured JSON verdict as an artifact:
- **revert** — causal chain found, proceed to revert immediately
- **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged)
- **not_related** — failures unrelated to the change, ignore this signal
- **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours

Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit

## Evaluation Results (13/13 correct verdicts)

Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types:

### Round 1 (2026-03-12) — 4/4 correct

| Test Case | PR | Failure | Expected | Actual | Job |
|-----------|-----|---------|----------|--------|-----|
| Doc-only change | pytorch#177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) |
| Dynamo einops fix | pytorch#177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) |
| MPS cdouble guard | pytorch#176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) |
| Lint missing import | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) |

### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled)

| Timestamp | PR | Signal Key | Expected | Actual | Job |
|-----------|-----|-----------|----------|--------|-----|
| 03:12Z | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) |
| 03:12Z | pytorch#176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) |
| 09:11Z | pytorch#177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) |
| 10:12Z | pytorch#176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) |
| 10:12Z | pytorch#176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) |
| 11:10Z | pytorch#176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) |
| 11:10Z | pytorch#176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) |
| 15:09Z | pytorch#177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) |
| 16:09Z | pytorch#176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) |
| 16:09Z | pytorch#176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) |

### Summary by verdict type

| Verdict | Count | Correct | Avg Confidence |
|---------|-------|---------|----------------|
| revert | 4 | 4/4 | 0.97 |
| garbage | 2 | 2/2 | 0.95 |
| not_related | 7 | 7/7 | 0.94 |

## Test plan

- [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases
- [x] Verified structured JSON output matches schema
- [x] Verified verdict artifact uploads correctly
- [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works
- [ ] Integrate dispatch call into autorevert lambda (follow-up)
Pull Request resolved: pytorch#177404
Approved by: https://github.com/wdvr
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
## Summary

Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures.

Returns a structured JSON verdict as an artifact:
- **revert** — causal chain found, proceed to revert immediately
- **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged)
- **not_related** — failures unrelated to the change, ignore this signal
- **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours

Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit

## Evaluation Results (13/13 correct verdicts)

Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types:

### Round 1 (2026-03-12) — 4/4 correct

| Test Case | PR | Failure | Expected | Actual | Job |
|-----------|-----|---------|----------|--------|-----|
| Doc-only change | pytorch#177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) |
| Dynamo einops fix | pytorch#177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) |
| MPS cdouble guard | pytorch#176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) |
| Lint missing import | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) |

### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled)

| Timestamp | PR | Signal Key | Expected | Actual | Job |
|-----------|-----|-----------|----------|--------|-----|
| 03:12Z | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) |
| 03:12Z | pytorch#176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) |
| 09:11Z | pytorch#177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) |
| 10:12Z | pytorch#176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) |
| 10:12Z | pytorch#176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) |
| 11:10Z | pytorch#176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) |
| 11:10Z | pytorch#176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) |
| 15:09Z | pytorch#177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) |
| 16:09Z | pytorch#176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) |
| 16:09Z | pytorch#176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) |

### Summary by verdict type

| Verdict | Count | Correct | Avg Confidence |
|---------|-------|---------|----------------|
| revert | 4 | 4/4 | 0.97 |
| garbage | 2 | 2/2 | 0.95 |
| not_related | 7 | 7/7 | 0.94 |

## Test plan

- [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases
- [x] Verified structured JSON output matches schema
- [x] Verified verdict artifact uploads correctly
- [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works
- [ ] Integrate dispatch call into autorevert lambda (follow-up)
Pull Request resolved: pytorch#177404
Approved by: https://github.com/wdvr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants