Problem Description
Calling torch._grouped_mm on a CUDA tensor crashes the Python process with a fatal access violation (0xC0000005) inside torch_hip.dll. The observed stack trace implicates JitDecompRegisterer in torch_cpu.dll dispatching into at::cuda::_grouped_mm, which then faults inside torch_hip.dll. _fused_adagrad_ also appears on the same crashing stack frame, though whether it is a co-trigger or incidental is unclear.
Theory: JitDecompRegisterer's constructor dispatches _grouped_mm through the CUDA backend before any Python-level code runs, and the HIP kernel behind this op faults on Windows. This is inferred from the stack trace, not confirmed from source.
The crash is reproducible with torch + ROCm SDK only — no other packages required.
Operating System
Windows 11 Pro for Workstations (10.0.26200)
CPU
AMD Ryzen 9 5950X
GPU
AMD Radeon RX 9070 XT (gfx1201)
ROCm Version
7.13.0a20260318 (nightly; torch.version.hip reports 7.2.0 — HIP runtime version, differs from ROCm SDK version in wheel filename)
ROCm Component
No response
Steps to Reproduce
All packages sourced from the TheRock nightly index at rocm.nightlies.amd.com.
pyproject.toml:
[project]
name = "repro-grouped-mm"
version = "0.1.0"
requires-python = "==3.12.*"
dependencies = [
"torch",
"rocm",
"rocm-sdk-core",
"rocm-sdk-libraries-gfx120x-all",
"typing_extensions", "filelock", "jinja2", "networkx", "sympy", "fsspec",
]
[tool.uv.sources]
torch = { url = "https://rocm.nightlies.amd.com/v2/gfx120X-all/torch-2.10.0%2Brocm7.13.0a20260318-cp312-cp312-win_amd64.whl" }
rocm = { url = "https://rocm.nightlies.amd.com/v2/gfx120X-all/rocm-7.13.0a20260318.tar.gz" }
rocm-sdk-core = { url = "https://rocm.nightlies.amd.com/v2/gfx120X-all/rocm_sdk_core-7.13.0a20260318-py3-none-win_amd64.whl" }
rocm-sdk-libraries-gfx120x-all = { url = "https://rocm.nightlies.amd.com/v2/gfx120X-all/rocm_sdk_libraries_gfx120x_all-7.13.0a20260318-py3-none-win_amd64.whl" }
[[tool.uv.dependency-metadata]]
name = "torch"
version = "2.10.0+rocm7.13.0a20260318"
requires-dist = []
crash_test.py:
import torch
A = torch.randn(4, 4, device="cuda")
B = torch.randn(4, 4, device="cuda")
torch._grouped_mm(A, B)
uv sync
uv run python crash_test.py
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
Observed stack trace:
Exception Code: 0xC0000005
torch_hip.dll + 0x25BEB89, ?_grouped_mm@cuda@at@@...
torch_hip.dll + 0x274BA11, ?_fused_adagrad_@cuda@at@@...
torch_cpu.dll + 0x3BEE612, ??0JitDecompRegisterer@impl@autograd@torch@@...
torch_cpu.dll + 0x12B8192, ?call@_grouped_mm@_ops@at@@...
torch._C._grouped_mm does not exist as a torch._C attribute (hasattr(torch._C, '_grouped_mm') returns False). The op is only reachable via torch._grouped_mm.
Workaround: avoid calling torch._grouped_mm on a CUDA tensor on Windows.
Problem Description
Calling
torch._grouped_mmon a CUDA tensor crashes the Python process with a fatal access violation (0xC0000005) insidetorch_hip.dll. The observed stack trace implicatesJitDecompRegistererintorch_cpu.dlldispatching intoat::cuda::_grouped_mm, which then faults insidetorch_hip.dll._fused_adagrad_also appears on the same crashing stack frame, though whether it is a co-trigger or incidental is unclear.The crash is reproducible with torch + ROCm SDK only — no other packages required.
Operating System
Windows 11 Pro for Workstations (10.0.26200)
CPU
AMD Ryzen 9 5950X
GPU
AMD Radeon RX 9070 XT (gfx1201)
ROCm Version
7.13.0a20260318 (nightly;
torch.version.hipreports7.2.0— HIP runtime version, differs from ROCm SDK version in wheel filename)ROCm Component
No response
Steps to Reproduce
All packages sourced from the TheRock nightly index at
rocm.nightlies.amd.com.pyproject.toml:crash_test.py:(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
Observed stack trace:
torch._C._grouped_mmdoes not exist as atorch._Cattribute (hasattr(torch._C, '_grouped_mm')returnsFalse). The op is only reachable viatorch._grouped_mm.Workaround: avoid calling
torch._grouped_mmon a CUDA tensor on Windows.