Skip to content

Add Claude-powered autorevert AI advisor workflow#177404

Closed
izaitsevfb wants to merge 1 commit intomainfrom
claude-autorevert-advisor
Closed

Add Claude-powered autorevert AI advisor workflow#177404
izaitsevfb wants to merge 1 commit intomainfrom
claude-autorevert-advisor

Conversation

@izaitsevfb
Copy link
Copy Markdown
Contributor

Summary

Adds a workflow_dispatch workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures.

Returns a structured JSON verdict as an artifact:

  • revert — causal chain found, proceed to revert immediately
  • unsure — inconclusive, continue with restart-to-confirm (default behavior unchanged)
  • not_related — failures unrelated to the change, ignore this signal
  • garbage — signal is unreliable (infra flake, driver crash), suppress for ~2 hours

Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit

Evaluation Results (13/13 correct verdicts)

Prototyped and tested on pytorch/ciforge. Results across diverse failure types:

Round 1 (2026-03-12) — 4/4 correct

Test Case PR Failure Expected Actual Job
Doc-only change #177288 pca_lowrank stride mismatch not_related not_related @ 0.99 job
Dynamo einops fix #177165 detectron2 graph_breaks + test_is_nonzero_mps not_related not_related @ 0.93 job
MPS cdouble guard #176985 test_is_nonzero_mps + pca_lowrank revert revert @ 0.95 job
Lint missing import #176613 Lint / lintrunner-noclang-all revert revert @ 0.95 job

Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled)

Timestamp PR Signal Key Expected Actual Job
03:12Z #176613 Lint / lintrunner-noclang-all revert revert @ 0.98 job
03:12Z #176613 fsdp/test_fully_shard_comm (test exec) revert revert @ 0.98 job
09:11Z #177273 test-timeout-270min (infra) cancelled job
10:12Z #176019 AllenaiLongformerBase fail_to_run (periodic) garbage garbage @ 0.95 job
10:12Z #176019 detectron2_fcos IMPROVED (periodic) not_related not_related @ 0.95 job
11:10Z #176019 functorch_dp_cifar10 fail_accuracy (periodic) not_related not_related @ 0.93 job
11:10Z #176019 basic_gnn_edgecnn IMPROVED (periodic) not_related not_related @ 0.92 job
15:09Z #177096 S3 PutObject IAM denied - ROCm gfx950 (infra) garbage garbage @ 0.95 job
16:09Z #176019 vit_base_patch16_siglip_256 fail_to_run (periodic) not_related not_related @ 0.97 job
16:09Z #176019 shufflenet_v2_x1_0 fail_accuracy (periodic) not_related not_related @ 0.95 job

Summary by verdict type

Verdict Count Correct Avg Confidence
revert 4 4/4 0.97
garbage 2 2/2 0.95
not_related 7 7/7 0.94

Test plan

  • Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases
  • Verified structured JSON output matches schema
  • Verified verdict artifact uploads correctly
  • Trigger via GitHub UI with workflow_dispatch on pytorch/pytorch to validate bedrock environment works
  • Integrate dispatch call into autorevert lambda (follow-up)

Adds a workflow_dispatch workflow that the autorevert system can trigger
when it detects an early failure pattern. Claude analyzes the suspect
commit diff, failed job logs, and PyTorch source code to determine
whether the commit caused the CI failures.

Returns a structured JSON verdict (revert/unsure/not_related/garbage)
as an artifact that autorevert can consume to make faster, smarter
revert decisions.
@izaitsevfb izaitsevfb requested a review from a team as a code owner March 13, 2026 18:24
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177404

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 63 Pending

As of commit 4646b85 with merge base 01316b2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ci-no-td Do not run TD on this PR topic: not user facing topic category labels Mar 13, 2026
@izaitsevfb
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge -f 'lint passed'

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

izaitsevfb added a commit to pytorch/test-infra that referenced this pull request Mar 19, 2026
## Summary

Adds shadow-mode AI advisor dispatch to the autorevert lambda. When a
clean failure partition is detected (2+ failures, 1+ success, no unknown
gap), the lambda dispatches the `claude-autorevert-advisor.yml` workflow
([pytorch/pytorch#177404](pytorch/pytorch#177404))
for AI-powered failure analysis.

Shadow mode: verdicts are not consumed — dispatch is fire-and-forget for
accuracy data collection.

## Changes

- `DispatchAdvisor` dataclass emitted from
`process_valid_autorevert_pattern()` (pure functional)
- `SignalActionProcessor.dispatch_advisors()` — shuffled dispatch with
per-signal dedup and cap (8 per workflow+commit)
- `AdvisorAction` enum (skip/log/run), CLI `--advisor-action`, env
`ADVISOR_ACTION`
- Logged to `misc.autorevert_events_v2` as `action='advisor'`
- Signal pattern JSON written to `/tmp/advisor-patterns/` for debugging
- HUD: "AI" badge on outcome cells + advisor dispatch summary table
- State JSON: optional `advisor_dispatches` key (forward/backward
compatible)
- Default: `AdvisorAction.RUN` (lambda dispatches advisors on deploy, no
gha-infra changes needed)

## Test plan

- [x] 122 tests passing (19 new covering DispatchAdvisor,
execute_advisor, dispatch_advisors, signal pattern JSON, cap/dedup)
- [ ] Deploy and monitor shadow-mode dispatches via ClickHouse + HUD
ZainRizvi added a commit that referenced this pull request Mar 24, 2026
PR #177404 added the claude-autorevert-advisor workflow but missed the
`allowed_bots` input on the `anthropics/claude-code-action@v1` step.
Without this, the action rejects runs triggered by
`pytorch-auto-revert[bot]`, which is the bot that dispatches this
workflow.

Add `allowed_bots: "pytorch-auto-revert[bot]"` so the action accepts
these bot-triggered runs.
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 24, 2026
## Summary

Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures.

Returns a structured JSON verdict as an artifact:
- **revert** — causal chain found, proceed to revert immediately
- **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged)
- **not_related** — failures unrelated to the change, ignore this signal
- **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours

Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit

## Evaluation Results (13/13 correct verdicts)

Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types:

### Round 1 (2026-03-12) — 4/4 correct

| Test Case | PR | Failure | Expected | Actual | Job |
|-----------|-----|---------|----------|--------|-----|
| Doc-only change | pytorch#177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) |
| Dynamo einops fix | pytorch#177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) |
| MPS cdouble guard | pytorch#176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) |
| Lint missing import | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) |

### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled)

| Timestamp | PR | Signal Key | Expected | Actual | Job |
|-----------|-----|-----------|----------|--------|-----|
| 03:12Z | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) |
| 03:12Z | pytorch#176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) |
| 09:11Z | pytorch#177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) |
| 10:12Z | pytorch#176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) |
| 10:12Z | pytorch#176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) |
| 11:10Z | pytorch#176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) |
| 11:10Z | pytorch#176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) |
| 15:09Z | pytorch#177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) |
| 16:09Z | pytorch#176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) |
| 16:09Z | pytorch#176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) |

### Summary by verdict type

| Verdict | Count | Correct | Avg Confidence |
|---------|-------|---------|----------------|
| revert | 4 | 4/4 | 0.97 |
| garbage | 2 | 2/2 | 0.95 |
| not_related | 7 | 7/7 | 0.94 |

## Test plan

- [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases
- [x] Verified structured JSON output matches schema
- [x] Verified verdict artifact uploads correctly
- [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works
- [ ] Integrate dispatch call into autorevert lambda (follow-up)
Pull Request resolved: pytorch#177404
Approved by: https://github.com/wdvr
izaitsevfb added a commit to pytorch/test-infra that referenced this pull request Mar 24, 2026
## Summary

- Adds `'advisor' = 3` to the `action` Enum8 column in
`misc.autorevert_events_v2`

The autorevert AI advisor lambda ([pytorch/pytorch PR
#177404](pytorch/pytorch#177404)) writes
`action='advisor'` when logging dispatch events to ClickHouse. However,
the table's Enum8 only accepted `none`, `restart`, `revert` — causing
ClickHouse to silently store advisor rows as `action='none'`.

This broke `prior_advisor_exists()` and `advisor_count_for_commit()`
which query `WHERE action = 'advisor'` — they always returned false/0,
so the lambda re-dispatched the advisor workflow for the same (commit,
signal) every ~5 minutes indefinitely.

Already applied live:
```sql
ALTER TABLE misc.autorevert_events_v2
  MODIFY COLUMN `action` Enum8('none' = 0, 'restart' = 1, 'revert' = 2, 'advisor' = 3)
```

## Test plan
- [x] Verify schema file matches expected Enum values
- [x] Apply ALTER TABLE on the live ClickHouse instance
- [ ] Confirm subsequent advisor dispatches create rows with
`action='advisor'`
- [ ] Confirm `prior_advisor_exists()` returns true after first
dispatch, preventing duplicates
Copilot AI pushed a commit that referenced this pull request Mar 27, 2026
PR #177404 added the claude-autorevert-advisor workflow but missed the
`allowed_bots` input on the `anthropics/claude-code-action@v1` step.
Without this, the action rejects runs triggered by
`pytorch-auto-revert[bot]`, which is the bot that dispatches this
workflow.

Add `allowed_bots: "pytorch-auto-revert[bot]"` so the action accepts
these bot-triggered runs.

Co-authored-by: Xia-Weiwen <12522207+Xia-Weiwen@users.noreply.github.com>
can-gaa-hou pushed a commit to cosdt/test-infra that referenced this pull request Mar 28, 2026
## Summary

Adds shadow-mode AI advisor dispatch to the autorevert lambda. When a
clean failure partition is detected (2+ failures, 1+ success, no unknown
gap), the lambda dispatches the `claude-autorevert-advisor.yml` workflow
([pytorch/pytorch#177404](pytorch/pytorch#177404))
for AI-powered failure analysis.

Shadow mode: verdicts are not consumed — dispatch is fire-and-forget for
accuracy data collection.

## Changes

- `DispatchAdvisor` dataclass emitted from
`process_valid_autorevert_pattern()` (pure functional)
- `SignalActionProcessor.dispatch_advisors()` — shuffled dispatch with
per-signal dedup and cap (8 per workflow+commit)
- `AdvisorAction` enum (skip/log/run), CLI `--advisor-action`, env
`ADVISOR_ACTION`
- Logged to `misc.autorevert_events_v2` as `action='advisor'`
- Signal pattern JSON written to `/tmp/advisor-patterns/` for debugging
- HUD: "AI" badge on outcome cells + advisor dispatch summary table
- State JSON: optional `advisor_dispatches` key (forward/backward
compatible)
- Default: `AdvisorAction.RUN` (lambda dispatches advisors on deploy, no
gha-infra changes needed)

## Test plan

- [x] 122 tests passing (19 new covering DispatchAdvisor,
execute_advisor, dispatch_advisors, signal pattern JSON, cap/dedup)
- [ ] Deploy and monitor shadow-mode dispatches via ClickHouse + HUD
can-gaa-hou pushed a commit to cosdt/test-infra that referenced this pull request Mar 28, 2026
## Summary

- Adds `'advisor' = 3` to the `action` Enum8 column in
`misc.autorevert_events_v2`

The autorevert AI advisor lambda ([pytorch/pytorch PR
#177404](pytorch/pytorch#177404)) writes
`action='advisor'` when logging dispatch events to ClickHouse. However,
the table's Enum8 only accepted `none`, `restart`, `revert` — causing
ClickHouse to silently store advisor rows as `action='none'`.

This broke `prior_advisor_exists()` and `advisor_count_for_commit()`
which query `WHERE action = 'advisor'` — they always returned false/0,
so the lambda re-dispatched the advisor workflow for the same (commit,
signal) every ~5 minutes indefinitely.

Already applied live:
```sql
ALTER TABLE misc.autorevert_events_v2
  MODIFY COLUMN `action` Enum8('none' = 0, 'restart' = 1, 'revert' = 2, 'advisor' = 3)
```

## Test plan
- [x] Verify schema file matches expected Enum values
- [x] Apply ALTER TABLE on the live ClickHouse instance
- [ ] Confirm subsequent advisor dispatches create rows with
`action='advisor'`
- [ ] Confirm `prior_advisor_exists()` returns true after first
dispatch, preventing duplicates
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
## Summary

Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures.

Returns a structured JSON verdict as an artifact:
- **revert** — causal chain found, proceed to revert immediately
- **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged)
- **not_related** — failures unrelated to the change, ignore this signal
- **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours

Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit

## Evaluation Results (13/13 correct verdicts)

Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types:

### Round 1 (2026-03-12) — 4/4 correct

| Test Case | PR | Failure | Expected | Actual | Job |
|-----------|-----|---------|----------|--------|-----|
| Doc-only change | pytorch#177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) |
| Dynamo einops fix | pytorch#177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) |
| MPS cdouble guard | pytorch#176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) |
| Lint missing import | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) |

### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled)

| Timestamp | PR | Signal Key | Expected | Actual | Job |
|-----------|-----|-----------|----------|--------|-----|
| 03:12Z | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) |
| 03:12Z | pytorch#176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) |
| 09:11Z | pytorch#177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) |
| 10:12Z | pytorch#176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) |
| 10:12Z | pytorch#176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) |
| 11:10Z | pytorch#176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) |
| 11:10Z | pytorch#176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) |
| 15:09Z | pytorch#177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) |
| 16:09Z | pytorch#176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) |
| 16:09Z | pytorch#176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) |

### Summary by verdict type

| Verdict | Count | Correct | Avg Confidence |
|---------|-------|---------|----------------|
| revert | 4 | 4/4 | 0.97 |
| garbage | 2 | 2/2 | 0.95 |
| not_related | 7 | 7/7 | 0.94 |

## Test plan

- [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases
- [x] Verified structured JSON output matches schema
- [x] Verified verdict artifact uploads correctly
- [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works
- [ ] Integrate dispatch call into autorevert lambda (follow-up)
Pull Request resolved: pytorch#177404
Approved by: https://github.com/wdvr
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
PR pytorch#177404 added the claude-autorevert-advisor workflow but missed the
`allowed_bots` input on the `anthropics/claude-code-action@v1` step.
Without this, the action rejects runs triggered by
`pytorch-auto-revert[bot]`, which is the bot that dispatches this
workflow.

Add `allowed_bots: "pytorch-auto-revert[bot]"` so the action accepts
these bot-triggered runs.
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Apr 2, 2026
PR pytorch#177404 added the claude-autorevert-advisor workflow but missed the
`allowed_bots` input on the `anthropics/claude-code-action@v1` step.
Without this, the action rejects runs triggered by
`pytorch-auto-revert[bot]`, which is the bot that dispatches this
workflow.

Add `allowed_bots: "pytorch-auto-revert[bot]"` so the action accepts
these bot-triggered runs.
nklshy-aws pushed a commit to nklshy-aws/pytorch that referenced this pull request Apr 7, 2026
PR pytorch#177404 added the claude-autorevert-advisor workflow but missed the
`allowed_bots` input on the `anthropics/claude-code-action@v1` step.
Without this, the action rejects runs triggered by
`pytorch-auto-revert[bot]`, which is the bot that dispatches this
workflow.

Add `allowed_bots: "pytorch-auto-revert[bot]"` so the action accepts
these bot-triggered runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants