[ROCm][CI] Fix Windows access violation in MIOpen CTC loss dispatch#178284
[ROCm][CI] Fix Windows access violation in MIOpen CTC loss dispatch#178284mstankov-amd wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/178284
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b698712 with merge base 99dee05 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@pytorchbot label "topic: not user facing" |
|
Why doesn't the test suite catch this lol? |
|
@pytorchbot merge |
|
This PR needs to be approved by an authorized maintainer before merge. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
We don't have ROCm Windows PyTorch builds in CI (yet). Only over on TheRock project. |
…178284) ### Summary Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors. ### Problem On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation: Windows fatal exception: access violation Exception Code: 0xC0000005 torch_hip.dll + 0x0 byte(s) The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors. ### Root Cause The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss. The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll): The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...]. The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll): The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions. ### The linker mismatch When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation. ### Test plan - test_CTCLoss_critical_target_len passes on Windows - test_CTCLoss_cudnn_cuda no longer crashes on Windows - Linux ROCm builds are unaffected Pull Request resolved: #178284 Approved by: https://github.com/Skylion007 Co-authored-by: Xia-Weiwen <12522207+Xia-Weiwen@users.noreply.github.com>
…ytorch#178284) ### Summary Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors. ### Problem On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation: Windows fatal exception: access violation Exception Code: 0xC0000005 torch_hip.dll + 0x0 byte(s) The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors. ### Root Cause The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss. The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll): The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...]. The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll): The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions. ### The linker mismatch When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation. ### Test plan - test_CTCLoss_critical_target_len passes on Windows - test_CTCLoss_cudnn_cuda no longer crashes on Windows - Linux ROCm builds are unaffected Pull Request resolved: pytorch#178284 Approved by: https://github.com/Skylion007
…ytorch#178284) ### Summary Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors. ### Problem On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation: Windows fatal exception: access violation Exception Code: 0xC0000005 torch_hip.dll + 0x0 byte(s) The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors. ### Root Cause The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss. The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll): The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...]. The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll): The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions. ### The linker mismatch When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation. ### Test plan - test_CTCLoss_critical_target_len passes on Windows - test_CTCLoss_cudnn_cuda no longer crashes on Windows - Linux ROCm builds are unaffected Pull Request resolved: pytorch#178284 Approved by: https://github.com/Skylion007
…ytorch#178284) ### Summary Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors. ### Problem On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation: Windows fatal exception: access violation Exception Code: 0xC0000005 torch_hip.dll + 0x0 byte(s) The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors. ### Root Cause The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss. The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll): The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...]. The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll): The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions. ### The linker mismatch When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation. ### Test plan - test_CTCLoss_critical_target_len passes on Windows - test_CTCLoss_cudnn_cuda no longer crashes on Windows - Linux ROCm builds are unaffected Pull Request resolved: pytorch#178284 Approved by: https://github.com/Skylion007
…ytorch#178284) ### Summary Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors. ### Problem On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation: Windows fatal exception: access violation Exception Code: 0xC0000005 torch_hip.dll + 0x0 byte(s) The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors. ### Root Cause The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss. The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll): The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...]. The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll): The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions. ### The linker mismatch When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation. ### Test plan - test_CTCLoss_critical_target_len passes on Windows - test_CTCLoss_cudnn_cuda no longer crashes on Windows - Linux ROCm builds are unaffected Pull Request resolved: pytorch#178284 Approved by: https://github.com/Skylion007
<h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: #179264 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…3181) <h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](pytorch#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: pytorch#179264 Approved by: https://github.com/jeffdaily Co-authored-by: Milica Stankovic <mstankov@amd.com>
…3180) <h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](pytorch#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: pytorch#179264 Approved by: https://github.com/jeffdaily Co-authored-by: Milica Stankovic <mstankov@amd.com>
…ch (#3161) Cherry pick of pytorch#178284 Fixes ROCm/TheRock#3987 Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>
Summary
Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors.
Problem
On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation:
Windows fatal exception: access violation
Exception Code: 0xC0000005
torch_hip.dll + 0x0 byte(s)
The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors.
Root Cause
The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss.
The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll):
The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [_imp?_use_miopen_ctc_loss@native@at@@...].
The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll):
The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an _imp thunk for these definitions.
The linker mismatch
When linking torch_hip.dll, the linker needs to resolve the _imp?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the _imp IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation.
Test plan
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang