Skip to content

[Bugfix][Dynamo] Fix einops 0.6.1 with backwards patch#177165

Closed
Lucaskabela wants to merge 2 commits intomainfrom
lucaskabela/einops_fix_061
Closed

[Bugfix][Dynamo] Fix einops 0.6.1 with backwards patch#177165
Lucaskabela wants to merge 2 commits intomainfrom
lucaskabela/einops_fix_061

Conversation

@Lucaskabela
Copy link
Copy Markdown
Contributor

@Lucaskabela Lucaskabela commented Mar 11, 2026

Summary

Backports the fix from einops 0.7.0 into einops <=0.6.1 via monkeypatching

Test

python test/dynamo/test_einops.py

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @jataylo

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177165

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 6 Pending, 3 Unrelated Failures

As of commit 5c10795 with merge base f249065 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests module: dynamo labels Mar 11, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 11, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@Lucaskabela Lucaskabela added the ciflow/dynamo Trigger jobs ran periodically on main for dynamo tests label Mar 11, 2026
@Lucaskabela Lucaskabela marked this pull request as ready for review March 11, 2026 20:44
@Lucaskabela Lucaskabela changed the title Fix einops 0.6.1 with patch [Bugfix][Dynamo] Fix einops 0.6.1 with backwards patch Mar 11, 2026
@mlazos
Copy link
Copy Markdown
Contributor

mlazos commented Mar 11, 2026

Did you test this on different einops versions?

@Lucaskabela
Copy link
Copy Markdown
Contributor Author

Yep this is covered with the @parametrize("version", [einops_version_sanitized]) n the unit test

@Lucaskabela
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 11, 2026
@Lucaskabela Lucaskabela linked an issue Mar 11, 2026 that may be closed by this pull request
@williamwen42
Copy link
Copy Markdown
Member

Would like to get @guilhermeleobas's eyes on this too

@guilhermeleobas
Copy link
Copy Markdown
Collaborator

@Lucaskabela It seems that only einops 0.6.1 is crashing? If so, can you restrict the fix to this version?

@Lucaskabela
Copy link
Copy Markdown
Contributor Author

@Lucaskabela It seems that only einops 0.6.1 is crashing? If so, can you restrict the fix to this version?

I backtested with einops==0.6.0 - if we restrict to only patching for einops 0.6.1, it fails with the same symint error like:

    lambda: unimplemented(
  File "/data/users/lucaskabela/pytorch/torch/_dynamo/exc.py", line 632, in unimplemented
    raise Unsupported(
torch._dynamo.exc.TorchRuntimeError: RuntimeError when making fake tensor call
  Explanation: Dynamo failed to run FX node with fake tensors: call_function <function repeat at 0x7f895e60b7f0>(*(FakeTensor(..., size=(s48, s86, s93)), 'a b c -> a b c 4'), **{}): got TypeError('unhashable type: non-nested SymInt')
  Hint: Your code may result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled. You can do this by removing the `torch.compile` call, or by using `torch.compiler.set_stance("force_eager")`. 

  Developer debug context: 

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb4315.html

from user code:
   File "/data/users/lucaskabela/pytorch/test/dynamo/test_einops.py", line 59, in forward
    x_abcd = repeat(x_abc, suf("a b c -> a b c 4"))

So imo the correct thing to do is just backport this for all versions <=0.6.1

@guilhermeleobas
Copy link
Copy Markdown
Collaborator

Can you update test_einops in test.sh to test for older versions as well?

@Lucaskabela Lucaskabela requested a review from a team as a code owner March 12, 2026 17:39
else:
einops_version = "none"
einops_version_sanitized = einops_version.replace(".", "_")
HAS_EINOPS_PACK = HAS_EINOPS and hasattr(einops, "pack")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pack/unpack not defined on 0.5.0 so we have a sanity patch here

@Lucaskabela
Copy link
Copy Markdown
Contributor Author

@guilhermeleobas - updated with 0.5.0 on here so ready for another look!

@Lucaskabela Lucaskabela added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 12, 2026
@Lucaskabela
Copy link
Copy Markdown
Contributor Author

@pytorchmergebot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

pytorchmergebot pushed a commit that referenced this pull request Mar 13, 2026
## Summary

Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures.

Returns a structured JSON verdict as an artifact:
- **revert** — causal chain found, proceed to revert immediately
- **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged)
- **not_related** — failures unrelated to the change, ignore this signal
- **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours

Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit

## Evaluation Results (13/13 correct verdicts)

Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types:

### Round 1 (2026-03-12) — 4/4 correct

| Test Case | PR | Failure | Expected | Actual | Job |
|-----------|-----|---------|----------|--------|-----|
| Doc-only change | #177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) |
| Dynamo einops fix | #177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) |
| MPS cdouble guard | #176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) |
| Lint missing import | #176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) |

### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled)

| Timestamp | PR | Signal Key | Expected | Actual | Job |
|-----------|-----|-----------|----------|--------|-----|
| 03:12Z | #176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) |
| 03:12Z | #176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) |
| 09:11Z | #177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) |
| 10:12Z | #176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) |
| 10:12Z | #176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) |
| 11:10Z | #176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) |
| 11:10Z | #176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) |
| 15:09Z | #177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) |
| 16:09Z | #176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) |
| 16:09Z | #176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) |

### Summary by verdict type

| Verdict | Count | Correct | Avg Confidence |
|---------|-------|---------|----------------|
| revert | 4 | 4/4 | 0.97 |
| garbage | 2 | 2/2 | 0.95 |
| not_related | 7 | 7/7 | 0.94 |

## Test plan

- [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases
- [x] Verified structured JSON output matches schema
- [x] Verified verdict artifact uploads correctly
- [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works
- [ ] Integrate dispatch call into autorevert lambda (follow-up)
Pull Request resolved: #177404
Approved by: https://github.com/wdvr
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 24, 2026
## Summary

Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures.

Returns a structured JSON verdict as an artifact:
- **revert** — causal chain found, proceed to revert immediately
- **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged)
- **not_related** — failures unrelated to the change, ignore this signal
- **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours

Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit

## Evaluation Results (13/13 correct verdicts)

Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types:

### Round 1 (2026-03-12) — 4/4 correct

| Test Case | PR | Failure | Expected | Actual | Job |
|-----------|-----|---------|----------|--------|-----|
| Doc-only change | pytorch#177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) |
| Dynamo einops fix | pytorch#177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) |
| MPS cdouble guard | pytorch#176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) |
| Lint missing import | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) |

### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled)

| Timestamp | PR | Signal Key | Expected | Actual | Job |
|-----------|-----|-----------|----------|--------|-----|
| 03:12Z | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) |
| 03:12Z | pytorch#176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) |
| 09:11Z | pytorch#177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) |
| 10:12Z | pytorch#176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) |
| 10:12Z | pytorch#176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) |
| 11:10Z | pytorch#176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) |
| 11:10Z | pytorch#176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) |
| 15:09Z | pytorch#177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) |
| 16:09Z | pytorch#176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) |
| 16:09Z | pytorch#176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) |

### Summary by verdict type

| Verdict | Count | Correct | Avg Confidence |
|---------|-------|---------|----------------|
| revert | 4 | 4/4 | 0.97 |
| garbage | 2 | 2/2 | 0.95 |
| not_related | 7 | 7/7 | 0.94 |

## Test plan

- [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases
- [x] Verified structured JSON output matches schema
- [x] Verified verdict artifact uploads correctly
- [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works
- [ ] Integrate dispatch call into autorevert lambda (follow-up)
Pull Request resolved: pytorch#177404
Approved by: https://github.com/wdvr
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
## Summary

Backports the fix from einops 0.7.0 into einops <=0.6.1 via monkeypatching

## Test
```bash
python test/dynamo/test_einops.py
```

Pull Request resolved: pytorch#177165
Approved by: https://github.com/mlazos, https://github.com/guilhermeleobas
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
## Summary

Adds a `workflow_dispatch` workflow that the autorevert system can trigger when it detects an early failure pattern. Claude Opus 4.6 analyzes the suspect commit's diff, failed job logs, and PyTorch source code to determine whether the commit actually caused the CI failures.

Returns a structured JSON verdict as an artifact:
- **revert** — causal chain found, proceed to revert immediately
- **unsure** — inconclusive, continue with restart-to-confirm (default behavior unchanged)
- **not_related** — failures unrelated to the change, ignore this signal
- **garbage** — signal is unreliable (infra flake, driver crash), suppress for ~2 hours

Design doc: https://docs.google.com/document/d/1BA9B7cIIKiapI37fSFGDD7D0F-VwMyRKJW0PoS0KkbY/edit

## Evaluation Results (13/13 correct verdicts)

Prototyped and tested on [pytorch/ciforge](https://github.com/pytorch/ciforge). Results across diverse failure types:

### Round 1 (2026-03-12) — 4/4 correct

| Test Case | PR | Failure | Expected | Actual | Job |
|-----------|-----|---------|----------|--------|-----|
| Doc-only change | pytorch#177288 | pca_lowrank stride mismatch | not_related | **not_related @ 0.99** | [job](https://github.com/pytorch/ciforge/actions/runs/23016718498) |
| Dynamo einops fix | pytorch#177165 | detectron2 graph_breaks + test_is_nonzero_mps | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23016730498) |
| MPS cdouble guard | pytorch#176985 | test_is_nonzero_mps + pca_lowrank | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23016740133) |
| Lint missing import | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23013529685) |

### Round 2 (2026-03-13, automated hourly loop) — 9/9 correct (1 cancelled)

| Timestamp | PR | Signal Key | Expected | Actual | Job |
|-----------|-----|-----------|----------|--------|-----|
| 03:12Z | pytorch#176613 | Lint / lintrunner-noclang-all | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034497618) |
| 03:12Z | pytorch#176613 | fsdp/test_fully_shard_comm (test exec) | revert | **revert @ 0.98** | [job](https://github.com/pytorch/ciforge/actions/runs/23034499988) |
| 09:11Z | pytorch#177273 | test-timeout-270min (infra) | — | *cancelled* | [job](https://github.com/pytorch/ciforge/actions/runs/23043982417) |
| 10:12Z | pytorch#176019 | AllenaiLongformerBase fail_to_run (periodic) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046142800) |
| 10:12Z | pytorch#176019 | detectron2_fcos IMPROVED (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23046144261) |
| 11:10Z | pytorch#176019 | functorch_dp_cifar10 fail_accuracy (periodic) | not_related | **not_related @ 0.93** | [job](https://github.com/pytorch/ciforge/actions/runs/23048173319) |
| 11:10Z | pytorch#176019 | basic_gnn_edgecnn IMPROVED (periodic) | not_related | **not_related @ 0.92** | [job](https://github.com/pytorch/ciforge/actions/runs/23048174698) |
| 15:09Z | pytorch#177096 | S3 PutObject IAM denied - ROCm gfx950 (infra) | garbage | **garbage @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23057146500) |
| 16:09Z | pytorch#176019 | vit_base_patch16_siglip_256 fail_to_run (periodic) | not_related | **not_related @ 0.97** | [job](https://github.com/pytorch/ciforge/actions/runs/23059634364) |
| 16:09Z | pytorch#176019 | shufflenet_v2_x1_0 fail_accuracy (periodic) | not_related | **not_related @ 0.95** | [job](https://github.com/pytorch/ciforge/actions/runs/23059635765) |

### Summary by verdict type

| Verdict | Count | Correct | Avg Confidence |
|---------|-------|---------|----------------|
| revert | 4 | 4/4 | 0.97 |
| garbage | 2 | 2/2 | 0.95 |
| not_related | 7 | 7/7 | 0.94 |

## Test plan

- [x] Prototyped and tested on pytorch/ciforge with 13 real trunk failure cases
- [x] Verified structured JSON output matches schema
- [x] Verified verdict artifact uploads correctly
- [ ] Trigger via GitHub UI with `workflow_dispatch` on pytorch/pytorch to validate bedrock environment works
- [ ] Integrate dispatch call into autorevert lambda (follow-up)
Pull Request resolved: pytorch#177404
Approved by: https://github.com/wdvr
@github-actions github-actions bot deleted the lucaskabela/einops_fix_061 branch April 12, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/dynamo Trigger jobs ran periodically on main for dynamo tests ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

einops 0.6.1 x torch.compile broken in pytorch nightlies

5 participants