[pt2 bug bash] Fix nn.functional.pad compile crash with deterministic mode + replication padding#177166
[pt2 bug bash] Fix nn.functional.pad compile crash with deterministic mode + replication padding#177166ydwu4 wants to merge 4 commits intogh/ydwu4/389/basefrom
Conversation
… mode + replication padding Fixes #170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python @torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `@pw_cast_for_opmath`) ```python @torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python @torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `@pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `@register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `@pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `@nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `@nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. [ghstack-poisoned]
… mode + replication padding Fixes #170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `pw_cast_for_opmath`) ```python torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. ghstack-source-id: f488144 Pull Request resolved: #177166
…terministic mode + replication padding" Fixes #170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `pw_cast_for_opmath`) ```python torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx kadeng chauhang amjames Lucaskabela jataylo [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177166
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 3fb1255 with merge base 8e7898a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
… mode + replication padding Fixes #170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `pw_cast_for_opmath`) ```python torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. ghstack-source-id: f640706 Pull Request resolved: #177166
…terministic mode + replication padding" Fixes #170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `pw_cast_for_opmath`) ```python torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx kadeng chauhang amjames Lucaskabela jataylo [ghstack-poisoned]
… mode + replication padding Fixes #170079 `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ```python torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ```python torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ```python torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. ghstack-source-id: 9a2f0c8 Pull Request resolved: #177166
| # nonstrict_trace makes Dynamo skip the function body | ||
| # (which contains Dynamo-untraceable code) while | ||
| # AOTAutograd still traces into it for the backward. | ||
| return torch._dynamo.decorators.nonstrict_trace( |
There was a problem hiding this comment.
I'm curious what's in the body that makes it non-dynamo traceable but AOT autograd traceable? usually it's the other way around.
There was a problem hiding this comment.
It's the elementwise_dtypes returns non-Tensor (from @pw_cast_for_opmath) and torch._check with closure lambda, see the issue description for minimal repro! Potentially, we could make them working but seems need some additional efforts. Can do them as follow ups!
mlazos
left a comment
There was a problem hiding this comment.
Seems fine, just curious if we can make the code dynamo traceable
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command Details for Dev Infra teamRaised by workflow job |
…terministic mode + replication padding" Fixes #170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `pw_cast_for_opmath`) ```python torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx kadeng chauhang amjames Lucaskabela jataylo [ghstack-poisoned]
… mode + replication padding Fixes #170079 `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ```python torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ```python torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ```python torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. ghstack-source-id: 0e8021f Pull Request resolved: #177166
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
… mode + replication padding (pytorch#177166) Fixes pytorch#170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python @torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `@pw_cast_for_opmath`) ```python @torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python @torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation pytorch#2: `@pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `@register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `@pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation pytorch#3: `torch._check()` closure | | 4 | `@nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `@nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. Pull Request resolved: pytorch#177166 Approved by: https://github.com/mlazos, https://github.com/williamwen42
… mode + replication padding (pytorch#177166) Fixes pytorch#170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python @torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `@pw_cast_for_opmath`) ```python @torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python @torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation pytorch#2: `@pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `@register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `@pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation pytorch#3: `torch._check()` closure | | 4 | `@nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `@nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. Pull Request resolved: pytorch#177166 Approved by: https://github.com/mlazos, https://github.com/williamwen42
… mode + replication padding (pytorch#177166) Fixes pytorch#170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python @torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `@pw_cast_for_opmath`) ```python @torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python @torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation pytorch#2: `@pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `@register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `@pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation pytorch#3: `torch._check()` closure | | 4 | `@nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `@nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. Pull Request resolved: pytorch#177166 Approved by: https://github.com/mlazos, https://github.com/williamwen42
…entwise_dtypes during tracing" Fix issue 2 discovered in #177166. elementwise_dtypes was registered as TorchInGraphFunctionVariable via torch._higher_order_ops.out_dtype, causing dynamo to try putting it in the FX graph. Since it returns (dtype, dtype) rather than tensors, this failed with "torch.* op returned non-Tensor". The fix adds a handler that evaluates elementwise_dtypes eagerly on fake tensor metadata during compilation and returns the result as constants. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx kadeng chauhang amjames Lucaskabela jataylo azahed98 [ghstack-poisoned]
…uring tracing" Fix issue 2 discovered in #177166. elementwise_dtypes was registered as TorchInGraphFunctionVariable via torch._higher_order_ops.out_dtype, causing dynamo to try putting it in the FX graph. Since it returns (dtype, dtype) rather than tensors, this failed with "torch.* op returned non-Tensor". The fix adds a handler that evaluates elementwise_dtypes eagerly on fake tensor metadata during compilation and returns the result as constants. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx kadeng chauhang amjames Lucaskabela jataylo azahed98 [ghstack-poisoned]
…entwise_dtypes during tracing" Fix issue 2 discovered in #177166. elementwise_dtypes was registered as TorchInGraphFunctionVariable via torch._higher_order_ops.out_dtype, causing dynamo to try putting it in the FX graph. Since it returns (dtype, dtype) rather than tensors, this failed with "torch.* op returned non-Tensor". The fix adds a handler that evaluates elementwise_dtypes eagerly on fake tensor metadata during compilation and returns the result as constants. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx kadeng chauhang amjames Lucaskabela jataylo azahed98 [ghstack-poisoned]
…uring tracing" Fix issue 2 discovered in #177166. elementwise_dtypes was registered as TorchInGraphFunctionVariable via torch._higher_order_ops.out_dtype, causing dynamo to try putting it in the FX graph. Since it returns (dtype, dtype) rather than tensors, this failed with "torch.* op returned non-Tensor". The fix adds a handler that evaluates elementwise_dtypes eagerly on fake tensor metadata during compilation and returns the result as constants. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx kadeng chauhang amjames Lucaskabela jataylo azahed98 [ghstack-poisoned]
…ng (#177743) Fix issue 2 discovered in #177166. elementwise_dtypes was registered as TorchInGraphFunctionVariable via torch._higher_order_ops.out_dtype, causing dynamo to try putting it in the FX graph. Since it returns (dtype, dtype) rather than tensors, this failed with "torch.* op returned non-Tensor". The fix adds a handler that evaluates elementwise_dtypes eagerly on fake tensor metadata during compilation and returns the result as constants. Pull Request resolved: #177743 Approved by: https://github.com/anijain2305
Stack from ghstack (oldest at bottom):
Fixes #170079
Context
torch.compile(ReplicationPad1d(...), fullgraph=True)crashes whentorch.use_deterministic_algorithms(True)is set on CUDA. The error: Dynamo can't tracethrough
importlib.import_module.The deterministic code path exists because the native
replication_pad1d_backwardCUDAkernel uses
atomicAdd(non-deterministic).functional.pycalls_replication_pad— aPython decomposition using
_unsafe_index, whose backward usesindex_put(deterministic).Dynamo limitations encountered
Three separate Dynamo tracing barriers prevented calling
_replication_paddirectly:1.
importlib.import_moduleis marked as skipped2.
elementwise_dtypesreturns non-Tensor (from@pw_cast_for_opmath)3.
torch._checkwith closure lambdaIteration log
importlibwithfrom...import@pw_cast_for_opmathis_compiling(), rely on AOTAutograd's@register_decompositionbackend="eager"passesreplication_pad1d_backward(non-deterministic)@pw_cast_for_opmathvia__wrapped__torch._check()closure@nonstrict_trace— Dynamo skips body, AOTAutograd traces throughbackend="aot_eager", forward + backward underDeterministicGuard(True)Key insight
The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's
about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd
does.
@nonstrict_traceis exactly this boundary.Each "obvious" fix had passing tests that weren't testing the right thing. Only when the
reviewer pushed for backward determinism verification and AOT graph inspection did the
weaknesses surface. The backward completing without error under
DeterministicGuard(True)proves determinism — PyTorch explicitly raises
RuntimeErrorif any non-deterministic CUDAkernel executes under this mode.
Authored with Claude.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo