[windows][smoke test] Add an option to install cuda if required cuda/cudnn on windows AMI do not match#177273
[windows][smoke test] Add an option to install cuda if required cuda/cudnn on windows AMI do not match#177273atalman wants to merge 5 commits intopytorch:mainfrom
Conversation
…cudnn on windows AMI do not match
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177273
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 36 PendingAs of commit 0904455 with merge base e05e600 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
|
||
| pushd %SRC_DIR%\.. | ||
|
|
||
| if not "%CUDA_VERSION%" == "cpu" if not "%CUDA_VERSION%" == "xpu" call internal\cuda_install.bat |
There was a problem hiding this comment.
shouldn't we install the cuda driver before the cuda toolkit?
malfet
left a comment
There was a problem hiding this comment.
This is probably fine, but can you please ask claude to provide a more comphrehensive review of repeated checks and validation that ensures cuda/cudnn versions installed during the "smoke check" will not be out of sync with cuda/cudnn version we install on Linux, etc
|
|
||
| pushd %SRC_DIR%\.. | ||
|
|
||
| if not "%CUDA_VERSION%" == "cpu" if not "%CUDA_VERSION%" == "xpu" call internal\cuda_install.bat |
There was a problem hiding this comment.
First of all, you already have this check, don't you:
|
@claude review this please |
|
Claude finished @malfet's task in 2m 13s —— View job Review of PR #177273
SummaryThis PR adds a call to Issues1. Ordering: As @albanD pointed out, the CUDA toolkit is being installed before the GPU driver. 2. Duplicated cuDNN version mappings in
These two copies can drift independently. If someone updates one, they could easily forget the other. The existing Recommendation: Refactor 3. The PR change itself is redundant with existing logic As @malfet noted, if not "%CUDA_VERSION%" == "cpu" if not "%CUDA_VERSION%" == "xpu" call internal\cuda_install.batduplicates the guard that's already inside 4. Cross-platform cuDNN version sync The cuDNN versions between Linux ( VerdictThe change itself is a reasonable quick fix for the immediate smoke test failure, but it introduces a pattern where |
|
The Duplicated cuDNN version mappings in cuda_install.bat is not a concern. This allows us to update cuda and cudnn for Linux and Windows versions independently: Windows is specified here: pytorch/.ci/pytorch/windows/internal/cuda_install.bat Lines 156 to 171 in 440ad94 While Linux x86 and aarch64 is here: pytorch/.github/scripts/generate_binary_build_matrix.py Lines 51 to 82 in 440ad94 |
malfet
left a comment
There was a problem hiding this comment.
Thank you for updating a description, as it unblocks 2.11 LGTM, but let's have a followup discussion on why this is necessary. I.e. if PyTorch was build with cudnn-X.Y.Z, no matter which one you have setup locally it should prefer ones it's bundled with, unless system setup is garbled.
Also, woulnd't it introduce significant flakiness as every Windows run will effectively download CuDNN
|
@pytorchmergebot merge -f "All required tests are passing" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR needs a
|
|
@pytorchbot cherry-pick --onto release/2.11 -c critical |
…cudnn on windows AMI do not match (#177273) Followup fix for #167242 After #175547 Windows AMI and intendent version of CUDNN do not match. Hence followup fixes required. With this approach we are flexible to update windows CUDNN without actually updating the Windows AMI. Windows AMI update can be done later. Fixes failure during test: https://github.com/pytorch/pytorch/actions/runs/22979336872/job/66724249070 ``` RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 19, 0) but found runtime version (9, 10, 2). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. ``` Please note: The cuda/cudnn version will not be updated if the right version is already installed. Test Plan: In CI via ciflow/binaries Pull Request resolved: #177273 Approved by: https://github.com/malfet, https://github.com/albanD (cherry picked from commit e55da9f)
Cherry picking #177273The cherry pick PR is at #177369 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
…cudnn on windows AMI do not match (#177369) [windows][smoke test] Add an option to install cuda if required cuda/cudnn on windows AMI do not match (#177273) Followup fix for #167242 After #175547 Windows AMI and intendent version of CUDNN do not match. Hence followup fixes required. With this approach we are flexible to update windows CUDNN without actually updating the Windows AMI. Windows AMI update can be done later. Fixes failure during test: https://github.com/pytorch/pytorch/actions/runs/22979336872/job/66724249070 ``` RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 19, 0) but found runtime version (9, 10, 2). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. ``` Please note: The cuda/cudnn version will not be updated if the right version is already installed. Test Plan: In CI via ciflow/binaries Pull Request resolved: #177273 Approved by: https://github.com/malfet, https://github.com/albanD (cherry picked from commit e55da9f) Co-authored-by: atalman <atalman@fb.com>
## Summary Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures. Returns a structured JSON verdict as an artifact: - **revert** — causal chain found, proceed to revert immediately - **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged) - **not_related** — failures unrelated to the change, ignore this signal - **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit ## Evaluation Results (13/13 correct verdicts) Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types: ### Round 1 (2026-03-12) — 4/4 correct | Test Case | PR | Failure | Expected | Actual | Job | |-----------|-----|---------|----------|--------|-----| | Doc-only change | #177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) | | Dynamo einops fix | #177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) | | MPS cdouble guard | #176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) | | Lint missing import | #176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) | ### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled) | Timestamp | PR | Signal Key | Expected | Actual | Job | |-----------|-----|-----------|----------|--------|-----| | 03:12Z | #176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) | | 03:12Z | #176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) | | 09:11Z | #177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) | | 10:12Z | #176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) | | 10:12Z | #176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) | | 11:10Z | #176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) | | 11:10Z | #176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) | | 15:09Z | #177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) | | 16:09Z | #176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) | | 16:09Z | #176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) | ### Summary by verdict type | Verdict | Count | Correct | Avg Confidence | |---------|-------|---------|----------------| | revert | 4 | 4/4 | 0.97 | | garbage | 2 | 2/2 | 0.95 | | not_related | 7 | 7/7 | 0.94 | ## Test plan - [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases - [x] Verified structured JSON output matches schema - [x] Verified verdict artifact uploads correctly - [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works - [ ] Integrate dispatch call into autorevert lambda (follow-up) Pull Request resolved: #177404 Approved by: https://github.com/wdvr
## Summary Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures. Returns a structured JSON verdict as an artifact: - **revert** — causal chain found, proceed to revert immediately - **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged) - **not_related** — failures unrelated to the change, ignore this signal - **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit ## Evaluation Results (13/13 correct verdicts) Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types: ### Round 1 (2026-03-12) — 4/4 correct | Test Case | PR | Failure | Expected | Actual | Job | |-----------|-----|---------|----------|--------|-----| | Doc-only change | pytorch#177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) | | Dynamo einops fix | pytorch#177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) | | MPS cdouble guard | pytorch#176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) | | Lint missing import | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) | ### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled) | Timestamp | PR | Signal Key | Expected | Actual | Job | |-----------|-----|-----------|----------|--------|-----| | 03:12Z | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) | | 03:12Z | pytorch#176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) | | 09:11Z | pytorch#177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) | | 10:12Z | pytorch#176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) | | 10:12Z | pytorch#176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) | | 11:10Z | pytorch#176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) | | 11:10Z | pytorch#176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) | | 15:09Z | pytorch#177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) | | 16:09Z | pytorch#176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) | | 16:09Z | pytorch#176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) | ### Summary by verdict type | Verdict | Count | Correct | Avg Confidence | |---------|-------|---------|----------------| | revert | 4 | 4/4 | 0.97 | | garbage | 2 | 2/2 | 0.95 | | not_related | 7 | 7/7 | 0.94 | ## Test plan - [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases - [x] Verified structured JSON output matches schema - [x] Verified verdict artifact uploads correctly - [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works - [ ] Integrate dispatch call into autorevert lambda (follow-up) Pull Request resolved: pytorch#177404 Approved by: https://github.com/wdvr
## Summary Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures. Returns a structured JSON verdict as an artifact: - **revert** — causal chain found, proceed to revert immediately - **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged) - **not_related** — failures unrelated to the change, ignore this signal - **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit ## Evaluation Results (13/13 correct verdicts) Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types: ### Round 1 (2026-03-12) — 4/4 correct | Test Case | PR | Failure | Expected | Actual | Job | |-----------|-----|---------|----------|--------|-----| | Doc-only change | pytorch#177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) | | Dynamo einops fix | pytorch#177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) | | MPS cdouble guard | pytorch#176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) | | Lint missing import | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) | ### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled) | Timestamp | PR | Signal Key | Expected | Actual | Job | |-----------|-----|-----------|----------|--------|-----| | 03:12Z | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) | | 03:12Z | pytorch#176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) | | 09:11Z | pytorch#177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) | | 10:12Z | pytorch#176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) | | 10:12Z | pytorch#176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) | | 11:10Z | pytorch#176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) | | 11:10Z | pytorch#176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) | | 15:09Z | pytorch#177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) | | 16:09Z | pytorch#176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) | | 16:09Z | pytorch#176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) | ### Summary by verdict type | Verdict | Count | Correct | Avg Confidence | |---------|-------|---------|----------------| | revert | 4 | 4/4 | 0.97 | | garbage | 2 | 2/2 | 0.95 | | not_related | 7 | 7/7 | 0.94 | ## Test plan - [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases - [x] Verified structured JSON output matches schema - [x] Verified verdict artifact uploads correctly - [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works - [ ] Integrate dispatch call into autorevert lambda (follow-up) Pull Request resolved: pytorch#177404 Approved by: https://github.com/wdvr
Followup fix for #167242
After #175547 Windows AMI and intendent version of CUDNN do not match. Hence followup fixes required. With this approach we are flexible to update windows CUDNN without actually updating the Windows AMI. Windows AMI update can be done later.
Fixes failure during test: https://github.com/pytorch/pytorch/actions/runs/22979336872/job/66724249070
Please note: The cuda/cudnn version will not be updated if the right version is already installed.
Test Plan: In CI via ciflow/binaries