Skip to content

[Issue]: [Windows] torch._grouped_mm access violation (0xC0000005) in torch_hip.dll #4086

@mmis1000

Description

@mmis1000

Problem Description

Calling torch._grouped_mm on a CUDA tensor crashes the Python process with a fatal access violation (0xC0000005) inside torch_hip.dll. The observed stack trace implicates JitDecompRegisterer in torch_cpu.dll dispatching into at::cuda::_grouped_mm, which then faults inside torch_hip.dll. _fused_adagrad_ also appears on the same crashing stack frame, though whether it is a co-trigger or incidental is unclear.

Theory: JitDecompRegisterer's constructor dispatches _grouped_mm through the CUDA backend before any Python-level code runs, and the HIP kernel behind this op faults on Windows. This is inferred from the stack trace, not confirmed from source.

The crash is reproducible with torch + ROCm SDK only — no other packages required.

Operating System

Windows 11 Pro for Workstations (10.0.26200)

CPU

AMD Ryzen 9 5950X

GPU

AMD Radeon RX 9070 XT (gfx1201)

ROCm Version

7.13.0a20260318 (nightly; torch.version.hip reports 7.2.0 — HIP runtime version, differs from ROCm SDK version in wheel filename)

ROCm Component

No response

Steps to Reproduce

All packages sourced from the TheRock nightly index at rocm.nightlies.amd.com.

pyproject.toml:

[project]
name = "repro-grouped-mm"
version = "0.1.0"
requires-python = "==3.12.*"
dependencies = [
    "torch",
    "rocm",
    "rocm-sdk-core",
    "rocm-sdk-libraries-gfx120x-all",
    "typing_extensions", "filelock", "jinja2", "networkx", "sympy", "fsspec",
]

[tool.uv.sources]
torch                          = { url = "https://rocm.nightlies.amd.com/v2/gfx120X-all/torch-2.10.0%2Brocm7.13.0a20260318-cp312-cp312-win_amd64.whl" }
rocm                           = { url = "https://rocm.nightlies.amd.com/v2/gfx120X-all/rocm-7.13.0a20260318.tar.gz" }
rocm-sdk-core                  = { url = "https://rocm.nightlies.amd.com/v2/gfx120X-all/rocm_sdk_core-7.13.0a20260318-py3-none-win_amd64.whl" }
rocm-sdk-libraries-gfx120x-all = { url = "https://rocm.nightlies.amd.com/v2/gfx120X-all/rocm_sdk_libraries_gfx120x_all-7.13.0a20260318-py3-none-win_amd64.whl" }

[[tool.uv.dependency-metadata]]
name = "torch"
version = "2.10.0+rocm7.13.0a20260318"
requires-dist = []

crash_test.py:

import torch

A = torch.randn(4, 4, device="cuda")
B = torch.randn(4, 4, device="cuda")
torch._grouped_mm(A, B)
uv sync
uv run python crash_test.py

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

Observed stack trace:

Exception Code: 0xC0000005
torch_hip.dll + 0x25BEB89, ?_grouped_mm@cuda@at@@...
torch_hip.dll + 0x274BA11, ?_fused_adagrad_@cuda@at@@...
torch_cpu.dll + 0x3BEE612, ??0JitDecompRegisterer@impl@autograd@torch@@...
torch_cpu.dll + 0x12B8192, ?call@_grouped_mm@_ops@at@@...

torch._C._grouped_mm does not exist as a torch._C attribute (hasattr(torch._C, '_grouped_mm') returns False). The op is only reachable via torch._grouped_mm.

Workaround: avoid calling torch._grouped_mm on a CUDA tensor on Windows.

Metadata

Metadata

Assignees

Labels

status: fix submittedIndicates a fix has been submitted into the staging/develop branch of a repository.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions