[ROCm][CI] Fix Windows access violation in MIOpen CTC loss dispatch by mstankov-amd · Pull Request #178284 · pytorch/pytorch

mstankov-amd · 2026-03-24T15:28:01Z

Summary

Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors.

Problem

On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation:

Windows fatal exception: access violation
Exception Code: 0xC0000005
torch_hip.dll + 0x0 byte(s)

The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors.

Root Cause

The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss.

The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll):

The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [_imp?_use_miopen_ctc_loss@native@at@@...].

The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll):

The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an _imp thunk for these definitions.

The linker mismatch

When linking torch_hip.dll, the linker needs to resolve the _imp?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the _imp IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation.

Test plan

test_CTCLoss_critical_target_len passes on Windows
test_CTCLoss_cudnn_cuda no longer crashes on Windows
Linux ROCm builds are unaffected

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

pytorch-bot · 2026-03-24T15:28:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/178284

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b698712 with merge base 99dee05 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-24T15:28:09Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

mstankov-amd · 2026-03-24T15:28:37Z

@pytorchbot label "topic: not user facing"

Skylion007 · 2026-03-24T16:50:25Z

Why doesn't the test suite catch this lol?

Skylion007 · 2026-03-24T17:17:38Z

@pytorchbot merge

pytorch-bot · 2026-03-24T17:17:43Z

This PR needs to be approved by an authorized maintainer before merge.

Skylion007 · 2026-03-24T17:33:56Z

@pytorchbot merge

pytorchmergebot · 2026-03-24T17:36:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jeffdaily · 2026-03-24T23:01:58Z

Why doesn't the test suite catch this lol?

We don't have ROCm Windows PyTorch builds in CI (yet). Only over on TheRock project.

…178284) ### Summary Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors. ### Problem On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation: Windows fatal exception: access violation Exception Code: 0xC0000005 torch_hip.dll + 0x0 byte(s) The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors. ### Root Cause The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss. The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll): The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...]. The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll): The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions. ### The linker mismatch When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation. ### Test plan - test_CTCLoss_critical_target_len passes on Windows - test_CTCLoss_cudnn_cuda no longer crashes on Windows - Linux ROCm builds are unaffected Pull Request resolved: #178284 Approved by: https://github.com/Skylion007 Co-authored-by: Xia-Weiwen <12522207+Xia-Weiwen@users.noreply.github.com>

…ytorch#178284) ### Summary Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors. ### Problem On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation: Windows fatal exception: access violation Exception Code: 0xC0000005 torch_hip.dll + 0x0 byte(s) The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors. ### Root Cause The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss. The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll): The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...]. The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll): The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions. ### The linker mismatch When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation. ### Test plan - test_CTCLoss_critical_target_len passes on Windows - test_CTCLoss_cudnn_cuda no longer crashes on Windows - Linux ROCm builds are unaffected Pull Request resolved: pytorch#178284 Approved by: https://github.com/Skylion007

<h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: #179264 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

…3181) <h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](pytorch#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: pytorch#179264 Approved by: https://github.com/jeffdaily Co-authored-by: Milica Stankovic <mstankov@amd.com>

…3180) <h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](pytorch#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: pytorch#179264 Approved by: https://github.com/jeffdaily Co-authored-by: Milica Stankovic <mstankov@amd.com>

…ch (#3161) Cherry pick of pytorch#178284 Fixes ROCm/TheRock#3987 Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>

mstankov-amd requested review from jeffdaily and jithunnair-amd as code owners March 24, 2026 15:28

pytorch-bot Bot added the module: rocm AMD GPU support for Pytorch label Mar 24, 2026

pytorch-bot Bot added the topic: not user facing topic category label Mar 24, 2026

pytorchbot added the open source label Mar 24, 2026

mstankov-amd mentioned this pull request Mar 24, 2026

[Issue] Windows Pytorch nightly (2.12) test failures with Segmentation fault while running TestNN::test_CTCLoss_critical_target_len - gfx110x, gfx1151 ROCm/TheRock#3987

Open

Fix Windows access violation in MIOpen CTC loss dispatch

b698712

Skylion007 approved these changes Mar 24, 2026

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 24, 2026

pytorchmergebot added the merging label Mar 24, 2026

pytorchmergebot added the Merged label Mar 24, 2026

pytorchmergebot closed this in ecb2eba Mar 24, 2026

pytorchmergebot removed the merging label Mar 24, 2026

mstankov-amd mentioned this pull request Apr 3, 2026

[ROCm] Fix MIOpen CTC loss crash on Windows #179264

Closed

tvukovic-amd mentioned this pull request Apr 16, 2026

[release/2.11] Fix Windows access violation in MIOpen CTC loss dispatch ROCm/pytorch#3161

Merged

This was referenced Apr 25, 2026

[release/2.12] Fix MIOpen CTC loss crash on Windows (#179264) ROCm/pytorch#3180

Merged

[release/2.11] Fix MIOpen CTC loss crash on Windows (#179264) ROCm/pytorch#3181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][CI] Fix Windows access violation in MIOpen CTC loss dispatch#178284

[ROCm][CI] Fix Windows access violation in MIOpen CTC loss dispatch#178284
mstankov-amd wants to merge 1 commit intopytorch:mainfrom
mstankov-amd:fix_windows_access_violation_miopen_ctc

mstankov-amd commented Mar 24, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Mar 24, 2026

Uh oh!

mstankov-amd commented Mar 24, 2026

Uh oh!

Skylion007 commented Mar 24, 2026

Uh oh!

Skylion007 commented Mar 24, 2026

Uh oh!

pytorch-bot Bot commented Mar 24, 2026

Uh oh!

Skylion007 commented Mar 24, 2026

Uh oh!

pytorchmergebot commented Mar 24, 2026

Uh oh!

jeffdaily commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mstankov-amd commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

The linker mismatch

Test plan

Uh oh!

pytorch-bot Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/178284

✅ No Failures

Uh oh!

pytorch-bot Bot commented Mar 24, 2026

This PR needs a release notes: label

Uh oh!

mstankov-amd commented Mar 24, 2026

Uh oh!

Skylion007 commented Mar 24, 2026

Uh oh!

Skylion007 commented Mar 24, 2026

Uh oh!

pytorch-bot Bot commented Mar 24, 2026

Uh oh!

Skylion007 commented Mar 24, 2026

Uh oh!

pytorchmergebot commented Mar 24, 2026

Merge started

Uh oh!

jeffdaily commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mstankov-amd commented Mar 24, 2026 •

edited

Loading

pytorch-bot Bot commented Mar 24, 2026 •

edited

Loading

This PR needs a `release notes:` label