Skip to content

[ROCm][CI] Fix Windows access violation in MIOpen CTC loss dispatch#178284

Closed
mstankov-amd wants to merge 1 commit intopytorch:mainfrom
mstankov-amd:fix_windows_access_violation_miopen_ctc
Closed

[ROCm][CI] Fix Windows access violation in MIOpen CTC loss dispatch#178284
mstankov-amd wants to merge 1 commit intopytorch:mainfrom
mstankov-amd:fix_windows_access_violation_miopen_ctc

Conversation

@mstankov-amd
Copy link
Copy Markdown
Contributor

@mstankov-amd mstankov-amd commented Mar 24, 2026

Summary

Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors.

Problem

On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation:

Windows fatal exception: access violation
Exception Code: 0xC0000005
torch_hip.dll + 0x0 byte(s)

The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors.

Root Cause

The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss.

The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll):

The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [_imp?_use_miopen_ctc_loss@native@at@@...].

The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll):

The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an _imp thunk for these definitions.

The linker mismatch

When linking torch_hip.dll, the linker needs to resolve the _imp?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the _imp IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation.

Test plan

  • test_CTCLoss_critical_target_len passes on Windows
  • test_CTCLoss_cudnn_cuda no longer crashes on Windows
  • Linux ROCm builds are unaffected

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 24, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/178284

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b698712 with merge base 99dee05 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added the module: rocm AMD GPU support for Pytorch label Mar 24, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 24, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mstankov-amd
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "topic: not user facing"

@Skylion007
Copy link
Copy Markdown
Collaborator

Why doesn't the test suite catch this lol?

@Skylion007
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 24, 2026

This PR needs to be approved by an authorized maintainer before merge.

@Skylion007
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 24, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeffdaily
Copy link
Copy Markdown
Collaborator

Why doesn't the test suite catch this lol?

We don't have ROCm Windows PyTorch builds in CI (yet). Only over on TheRock project.

Copilot AI pushed a commit that referenced this pull request Mar 27, 2026
…178284)

### Summary

Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors.

### Problem

On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation:

Windows fatal exception: access violation
Exception Code: 0xC0000005
torch_hip.dll + 0x0 byte(s)

The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors.

### Root Cause

The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss.

The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll):

The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...].

The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll):

The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions.

### The linker mismatch

When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation.

### Test plan

- test_CTCLoss_critical_target_len passes on Windows
- test_CTCLoss_cudnn_cuda no longer crashes on Windows
- Linux ROCm builds are unaffected

Pull Request resolved: #178284
Approved by: https://github.com/Skylion007

Co-authored-by: Xia-Weiwen <12522207+Xia-Weiwen@users.noreply.github.com>
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
…ytorch#178284)

### Summary

Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors.

### Problem

On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation:

Windows fatal exception: access violation
Exception Code: 0xC0000005
torch_hip.dll + 0x0 byte(s)

The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors.

### Root Cause

The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss.

The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll):

The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...].

The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll):

The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions.

### The linker mismatch

When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation.

### Test plan

- test_CTCLoss_critical_target_len passes on Windows
- test_CTCLoss_cudnn_cuda no longer crashes on Windows
- Linux ROCm builds are unaffected

Pull Request resolved: pytorch#178284
Approved by: https://github.com/Skylion007
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Apr 2, 2026
…ytorch#178284)

### Summary

Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors.

### Problem

On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation:

Windows fatal exception: access violation
Exception Code: 0xC0000005
torch_hip.dll + 0x0 byte(s)

The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors.

### Root Cause

The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss.

The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll):

The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...].

The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll):

The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions.

### The linker mismatch

When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation.

### Test plan

- test_CTCLoss_critical_target_len passes on Windows
- test_CTCLoss_cudnn_cuda no longer crashes on Windows
- Linux ROCm builds are unaffected

Pull Request resolved: pytorch#178284
Approved by: https://github.com/Skylion007
nklshy-aws pushed a commit to nklshy-aws/pytorch that referenced this pull request Apr 7, 2026
…ytorch#178284)

### Summary

Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors.

### Problem

On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation:

Windows fatal exception: access violation
Exception Code: 0xC0000005
torch_hip.dll + 0x0 byte(s)

The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors.

### Root Cause

The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss.

The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll):

The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...].

The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll):

The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions.

### The linker mismatch

When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation.

### Test plan

- test_CTCLoss_critical_target_len passes on Windows
- test_CTCLoss_cudnn_cuda no longer crashes on Windows
- Linux ROCm builds are unaffected

Pull Request resolved: pytorch#178284
Approved by: https://github.com/Skylion007
tvukovic-amd pushed a commit to ROCm/pytorch that referenced this pull request Apr 16, 2026
…ytorch#178284)

### Summary

Add the missing #include <ATen/ops/_use_miopen_ctc_loss_native.h> to LossCTC_miopen.cpp. Without this include, the _use_miopen_ctc_loss and _use_miopen_ctc_loss_tensor functions are defined without DLL linkage attributes on Windows, causing an unresolved Import Address Table (IAT) entry that crashes with an access violation (0xC0000005) at torch_hip.dll base address when CTC loss is called with CUDA tensors.

### Problem

On Windows ROCm builds, calling torch.nn.functional.ctc_loss with CUDA tensors crashes with a fatal access violation:

Windows fatal exception: access violation
Exception Code: 0xC0000005
torch_hip.dll + 0x0 byte(s)

The crash occurs in test_CTCLoss_critical_target_len and any other test that invokes ctc_loss with cudnn.flags(enabled=True) on CUDA tensors.

### Root Cause

The issue is a Windows DLL linkage mismatch between the caller and the callee of at::native::_use_miopen_ctc_loss.

The caller (RegisterCUDA_0.cpp, auto-generated, compiled into torch_hip.dll):

The generated CUDA dispatch wrapper includes <ATen/ops/_use_miopen_ctc_loss_native.h>, which declares the function with TORCH_API. When building torch_hip.dll, TORCH_API expands to __declspec(dllimport). MSVC generates an indirect call through the Import Address Table: call [__imp_?_use_miopen_ctc_loss@native@at@@...].

The callee (LossCTC_miopen.cpp, compiled into torch_hip.dll):

The implementation file does NOT include <ATen/ops/_use_miopen_ctc_loss_native.h>. The functions are defined without any DLL linkage attribute — just plain bool _use_miopen_ctc_loss(...). The compiler does not generate an __imp_ thunk for these definitions.

### The linker mismatch

When linking torch_hip.dll, the linker needs to resolve the __imp_?_use_miopen_ctc_loss@native@at@@... symbol (referenced by RegisterCUDA_0.cpp.obj). This is a different symbol from ?_use_miopen_ctc_loss@native@at@@... (provided by LossCTC_miopen.cpp.obj). Since no import library (.lib) exports this function, the __imp_ IAT entry remains unresolved at RVA=0. At runtime, the indirect call jumps to DLL_base + 0x0 (the PE header), which is not executable code, causing the access violation.

### Test plan

- test_CTCLoss_critical_target_len passes on Windows
- test_CTCLoss_cudnn_cuda no longer crashes on Windows
- Linux ROCm builds are unaffected

Pull Request resolved: pytorch#178284
Approved by: https://github.com/Skylion007
pytorchmergebot pushed a commit that referenced this pull request Apr 25, 2026
<h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2>

<h3>Problem</h3>

<p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p>

<p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p>

<pre><code>Windows fatal exception: access violation
Exception Code: 0xC0000005
#0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4)
#1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912)
#2 at::native::miopen_ctc_loss (torch_hip.dll)
</code></pre>

<h3>Root Cause</h3>

<p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p>

<pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt);
// ... hipMemcpy to GPU ...
MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(...,
    labels_gpu.data_ptr&lt;int&gt;(),          // device pointer
    label_lengths_gpu.data_ptr&lt;int&gt;(),   // device pointer
    input_lengths_gpu.data_ptr&lt;int&gt;()    // device pointer
));
</code></pre>

<p>This works on:</p>
<ul>
<li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li>
<li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li>
</ul>

<p>This crashes on:</p>
<ul>
<li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li>
</ul>

<h3>Verification</h3>

<p>Tested on gfx1201:</p>

<table border="1" cellpadding="6" cellspacing="0">
<tr><th>Check</th><th>Result</th></tr>
<tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr>
<tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr>
<tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr>
<tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr>
<tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr>
<tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr>
</table>

<h3>Fix</h3>

<p>Use host pointers since this is what MIOpen expects should be used.</p>

<h3>Testing</h3>

<p>Run all existing CTCLoss unit tests.</p>

Pull Request resolved: #179264
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 25, 2026
…3181)

<h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2>

<h3>Problem</h3>

<p>A failing unit test on Windows started showing a couple weeks ago and
a missing <code>#include</code> was added in
[](pytorch#178284), but CI on TheRock
kept failing. The fix was tested on gfx1151 (APU), where the test
passed, but CI showed failures on gfx1100. </p>

<p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting
<code>miopen_ctc_loss</code>) crashes with a fatal access violation on
Windows systems with discrete AMD GPUs:</p>

<pre><code>Windows fatal exception: access violation Exception Code:
0xC0000005
#0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize
(MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize
(MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll)
</code></pre>

<h3>Root Cause</h3>

<p><code>miopenGetCTCLossWorkspaceSize</code> and
<code>miopenCTCLoss</code> read the <code>labels</code>,
<code>label_lengths</code>, and <code>input_lengths</code> arrays
<strong>on the host side</strong> to plan the computation and calculate
workspace requirements. The existing code copies these arrays to GPU
memory and passes device pointers:</p>

<pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA),
at::kInt); // ... hipMemcpy to GPU ...
MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(...,
    labels_gpu.data_ptr&lt;int&gt;(),          // device pointer
    label_lengths_gpu.data_ptr&lt;int&gt;(),   // device pointer
    input_lengths_gpu.data_ptr&lt;int&gt;()    // device pointer
));
</code></pre>

<p>This works on:</p>
<ul>
<li><strong>Linux</strong> — HSA (Heterogeneous System Architecture)
maps GPU allocations into the process virtual address space, making
device pointers host-readable</li> <li><strong>Windows APUs</strong> —
CPU and iGPU share system RAM, so device pointers point to
host-accessible memory</li> </ul>

<p>This crashes on:</p>
<ul>
<li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe;
device pointers are opaque handles that cannot be dereferenced from host
code</li> </ul>

<h3>Verification</h3>

<p>Tested on gfx1201:</p>

<table border="1" cellpadding="6" cellspacing="0">
<tr><th>Check</th><th>Result</th></tr>

<tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code>
(discrete GPU)</td></tr>
<tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr>
<tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr>
<tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code>
pointer via <code>ctypes</code></td><td>Access violation</td></tr>
<tr><td>CTC loss with CPU pointers</td><td>Pass (forward +
backward)</td></tr> </table>

<h3>Fix</h3>

<p>Use host pointers since this is what MIOpen expects should be
used.</p>

<h3>Testing</h3>

<p>Run all existing CTCLoss unit tests.</p>

Pull Request resolved: pytorch#179264
Approved by: https://github.com/jeffdaily

Co-authored-by: Milica Stankovic <mstankov@amd.com>
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 25, 2026
…3180)

<h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2>

<h3>Problem</h3>

<p>A failing unit test on Windows started showing a couple weeks ago and
a missing <code>#include</code> was added in
[](pytorch#178284), but CI on TheRock
kept failing. The fix was tested on gfx1151 (APU), where the test
passed, but CI showed failures on gfx1100. </p>

<p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting
<code>miopen_ctc_loss</code>) crashes with a fatal access violation on
Windows systems with discrete AMD GPUs:</p>

<pre><code>Windows fatal exception: access violation Exception Code:
0xC0000005
#0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize
(MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize
(MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll)
</code></pre>

<h3>Root Cause</h3>

<p><code>miopenGetCTCLossWorkspaceSize</code> and
<code>miopenCTCLoss</code> read the <code>labels</code>,
<code>label_lengths</code>, and <code>input_lengths</code> arrays
<strong>on the host side</strong> to plan the computation and calculate
workspace requirements. The existing code copies these arrays to GPU
memory and passes device pointers:</p>

<pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA),
at::kInt); // ... hipMemcpy to GPU ...
MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(...,
    labels_gpu.data_ptr&lt;int&gt;(),          // device pointer
    label_lengths_gpu.data_ptr&lt;int&gt;(),   // device pointer
    input_lengths_gpu.data_ptr&lt;int&gt;()    // device pointer
));
</code></pre>

<p>This works on:</p>
<ul>
<li><strong>Linux</strong> — HSA (Heterogeneous System Architecture)
maps GPU allocations into the process virtual address space, making
device pointers host-readable</li> <li><strong>Windows APUs</strong> —
CPU and iGPU share system RAM, so device pointers point to
host-accessible memory</li> </ul>

<p>This crashes on:</p>
<ul>
<li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe;
device pointers are opaque handles that cannot be dereferenced from host
code</li> </ul>

<h3>Verification</h3>

<p>Tested on gfx1201:</p>

<table border="1" cellpadding="6" cellspacing="0">
<tr><th>Check</th><th>Result</th></tr>

<tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code>
(discrete GPU)</td></tr>
<tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr>
<tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr>
<tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code>
pointer via <code>ctypes</code></td><td>Access violation</td></tr>
<tr><td>CTC loss with CPU pointers</td><td>Pass (forward +
backward)</td></tr> </table>

<h3>Fix</h3>

<p>Use host pointers since this is what MIOpen expects should be
used.</p>

<h3>Testing</h3>

<p>Run all existing CTCLoss unit tests.</p>

Pull Request resolved: pytorch#179264
Approved by: https://github.com/jeffdaily

Co-authored-by: Milica Stankovic <mstankov@amd.com>
jeffdaily pushed a commit to ROCm/pytorch that referenced this pull request Apr 27, 2026
…ch (#3161)

Cherry pick of pytorch#178284

Fixes ROCm/TheRock#3987

Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants