Skip to content

Add type annotations to verify_dynamo tool#2

Closed
bobrenjc93 wants to merge 1 commit intomainfrom
type-annotate-verify-dynamo
Closed

Add type annotations to verify_dynamo tool#2
bobrenjc93 wants to merge 1 commit intomainfrom
type-annotate-verify-dynamo

Conversation

@bobrenjc93
Copy link
Copy Markdown
Owner

@bobrenjc93 bobrenjc93 commented Mar 7, 2026

Summary

  • add explicit type aliases and protocols to tools/dynamo/verify_dynamo.py
  • annotate the previously untyped helper functions and local Dynamo sanity-check callables
  • keep runtime behavior unchanged while making the script easier to type-check and maintain

Validation

  • python3 -m py_compile tools/dynamo/verify_dynamo.py
  • git diff --check

bobrenjc93 pushed a commit that referenced this pull request Mar 16, 2026
…nces between x86 vs aarch64 (pytorch#176085)

In the test:

```
python  test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda
```
 it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues:
  - pytorch#119905
  - pytorch#134387

In the current setup when getting a stack trace string:
- x86 contains `C++ CapturedTraceback:`
- aarch64 contains `Exception raised from` + `frame #`

An example of the full string from an aarch64 system when :
```
AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe pytorch#5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe pytorch#6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe pytorch#7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#11: /usr/bin/python() [0x504a34]\nframe pytorch#12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe pytorch#13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe pytorch#14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe pytorch#15: /usr/bin/python() [0x52a070]\nframe pytorch#16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe pytorch#17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe pytorch#18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe pytorch#19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe pytorch#20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe pytorch#21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe pytorch#22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe pytorch#23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe pytorch#24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe pytorch#25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n'

To execute this test, run the following from the base repo dir:
    python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda
```

Pull Request resolved: pytorch#176085
Approved by: https://github.com/eqy
bobrenjc93 pushed a commit that referenced this pull request Mar 18, 2026
… mode + replication padding (pytorch#177166)

Fixes pytorch#170079

## Context

`torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when
`torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace
through `importlib.import_module`.

The deterministic code path exists because the native `replication_pad1d_backward` CUDA
kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a
Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic).

## Dynamo limitations encountered

Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly:

### 1. `importlib.import_module` is marked as skipped

```python
@torch.compile(fullgraph=True)
def fn(x):
    import importlib
    return importlib.import_module("torch").sin(x)
fn(torch.randn(3))  # Unsupported: function marked as skipped
```

### 2. `elementwise_dtypes` returns non-Tensor (from `@pw_cast_for_opmath`)

```python
@torch.compile(fullgraph=True)
def fn(x):
    from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND
    dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT)
    return x.to(dt)
fn(torch.randn(3))  # Unsupported: torch.* op returned non-Tensor
```

### 3. `torch._check` with closure lambda

```python
@torch.compile(fullgraph=True)
def fn(x):
    dim = x.dim()
    torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D")
    return x + 1
fn(torch.randn(3, 3))  # Unsupported: Can't extract message from torch._check()
```

## Iteration log

| # | Approach | Who | Tests | Reviewer pushback | Why it failed |
|---|----------|-----|-------|-------------------|---------------|
| 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `@pw_cast_for_opmath` |
| 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `@register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) |
| 3 | Unwrap `@pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure |
| 4 | `@nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A |

## Key insight

The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's
about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd
does. `@nonstrict_trace` is exactly this boundary.

Each "obvious" fix had passing tests that weren't testing the right thing. Only when the
reviewer pushed for backward determinism verification and AOT graph inspection did the
weaknesses surface. The backward completing without error under `DeterministicGuard(True)`
proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA
kernel executes under this mode.

Authored with Claude.

Pull Request resolved: pytorch#177166
Approved by: https://github.com/mlazos, https://github.com/williamwen42
@bobrenjc93 bobrenjc93 closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant