Skip to content

[ROCm] Fix MIOpen CTC loss crash on Windows#179264

Closed
mstankov-amd wants to merge 4 commits intopytorch:mainfrom
mstankov-amd:fix-miopen-ctc-loss-dgpu-on-windows
Closed

[ROCm] Fix MIOpen CTC loss crash on Windows#179264
mstankov-amd wants to merge 4 commits intopytorch:mainfrom
mstankov-amd:fix-miopen-ctc-loss-dgpu-on-windows

Conversation

@mstankov-amd
Copy link
Copy Markdown
Contributor

@mstankov-amd mstankov-amd commented Apr 3, 2026

Fix MIOpen CTC loss access violation on Windows discrete GPUs

Problem

A failing unit test on Windows started showing a couple weeks ago and a missing #include was added in [](https://github.com//pull/178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100.

test_CTCLoss_no_batch_dim (and any code path hitting miopen_ctc_loss) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:

Windows fatal exception: access violation
Exception Code: 0xC0000005
#0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4)
#1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912)
#2 at::native::miopen_ctc_loss (torch_hip.dll)

Root Cause

miopenGetCTCLossWorkspaceSize and miopenCTCLoss read the labels, label_lengths, and input_lengths arrays on the host side to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:

Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt);
// ... hipMemcpy to GPU ...
MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(...,
    labels_gpu.data_ptr<int>(),          // device pointer
    label_lengths_gpu.data_ptr<int>(),   // device pointer
    input_lengths_gpu.data_ptr<int>()    // device pointer
));

This works on:

  • Linux — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable
  • Windows APUs — CPU and iGPU share system RAM, so device pointers point to host-accessible memory

This crashes on:

  • Windows dGPUs — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code

Verification

Tested on gfx1201:

CheckResult
hipDeviceAttributeIntegrated0 (discrete GPU)
hipDeviceAttributeCanUseHostPointerForRegisteredMem0
hipDeviceAttributeManagedMemory0x7FFFFFFF (unsupported)
hipDeviceAttributeUnifiedAddressing0x7FFFFFFF (unsupported)
Host read of hipMalloc pointer via ctypesAccess violation
CTC loss with CPU pointersPass (forward + backward)

Fix

Use host pointers since this is what MIOpen expects should be used.

Testing

Run all existing CTCLoss unit tests.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 3, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/179264

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8f725fe with merge base 2db14fe (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added the module: rocm AMD GPU support for Pytorch label Apr 3, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 3, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Copy Markdown
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mstankov-amd is this really a WIN32 vs non-WIN32 issue, or is it whether largeBar is supported? See #177023 for a recent PR that tests for the largeBar feature before reading a device pointer directly by the host.

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 6, 2026
@mstankov-amd
Copy link
Copy Markdown
Contributor Author

@mstankov-amd is this really a WIN32 vs non-WIN32 issue, or is it whether largeBar is supported? See #177023 for a recent PR that tests for the largeBar feature before reading a device pointer directly by the host.

I used yesterday to analyze the code whether this is largeBar issue. First, I ran the tests on the same machine, just booted Linux and the tests pass without any exceptions. On Linux, the GPU memory is host-accessible regardless of largeBar, while it crashes on Windows.

The core issue is: MIOpen's miopenGetCTCLossWorkspaceSize and miopenCTCLoss dereference the labels/lengths pointers on the host side. If those pointers point to GPU VRAM that isn't host-accessible, we get an access violation.

So, this is WIN32 specific issue.

@jeffdaily
Copy link
Copy Markdown
Collaborator

@mstankov-amd I really think this is a largeBAR issue. Instead of the WIN32 check, it should be a largeBAR check like I linked to in the other PR. Can you test that change?

@ikalinic
Copy link
Copy Markdown
Contributor

I tested the isLargeBar approach on a Windows dGPU system (gfx1201, isLargeBar=false). Built via TheRock's build_prod_wheels.py against ROCm 7.12.0a.

Following the pattern from
#177023, the #ifdef _WIN32 ... #else ... #endif block in aten/src/ATen/native/miopen/LossCTC_miopen.cpp (lines 210-242) can be replaced with a runtime isLargeBar check:

  // MIOpen reads these buffers from host memory unless large BAR makes
  // device allocations directly host-accessible.
  Tensor labels_host = targets_t;
  Tensor labels_device;
  Tensor label_lengths_device;
  Tensor input_lengths_device;
  int* labels_ptr = labels_host.data_ptr<int>();
  int* label_lengths_ptr = target_lengths.data();
  int* input_lengths_ptr = input_lengths.data();
#if defined(USE_ROCM) && (ROCM_VERSION >= 70200)
  if (at::cuda::getCurrentDeviceProperties()->isLargeBar) {
    labels_device = labels_host.to(Device(at::kCUDA), at::kInt);
    label_lengths_device = at::empty(
        {static_cast<int64_t>(target_lengths.size())},
        at::TensorOptions().dtype(at::kInt).device(at::kCUDA));
    input_lengths_device = at::empty(
        {static_cast<int64_t>(input_lengths.size())},
        at::TensorOptions().dtype(at::kInt).device(at::kCUDA));
    C10_CUDA_CHECK(hipMemcpy(
        label_lengths_device.data_ptr<int>(),
        target_lengths.data(),
        target_lengths.size() * sizeof(int),
        hipMemcpyHostToDevice));
    C10_CUDA_CHECK(hipMemcpy(
        input_lengths_device.data_ptr<int>(),
        input_lengths.data(),
        input_lengths.size() * sizeof(int),
        hipMemcpyHostToDevice));
    labels_ptr = labels_device.data_ptr<int>();
    label_lengths_ptr = label_lengths_device.data_ptr<int>();
    input_lengths_ptr = input_lengths_device.data_ptr<int>();
  }
#endif

This defaults to host pointers (safe on all platforms) and only switches to device pointers when isLargeBar is true. On the gfx1201 (isLargeBar=false) confirmed the host-pointer path is taken and batched MIOpen CTC loss matches the native fallback.

@jeffdaily
Copy link
Copy Markdown
Collaborator

@mstankov-amd based on @ikalinic's analysis, please update the PR to use the largeBAR check instead.

@pytorch-bot pytorch-bot Bot added the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Apr 22, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 22, 2026

Workflows were awaiting approval. CI has now been triggered for the ciflow labels on this PR.

@mstankov-amd
Copy link
Copy Markdown
Contributor Author

@mstankov-amd based on @ikalinic's analysis, please update the PR to use the largeBAR check instead.

@jeffdaily The PR has been updated

@mstankov-amd mstankov-amd requested a review from jeffdaily April 22, 2026 13:54
@mstankov-amd mstankov-amd changed the title [ROCm] Fix MIOpen CTC loss crash on Windows dGPU systems [ROCm] Fix MIOpen CTC loss crash on Windows Apr 24, 2026
MIOpen's miopenGetCTCLossWorkspaceSize and miopenCTCLoss dereference the
labels, labelLengths, and inputLengths arrays on the host: they're
subscripted directly in miopen/src/ctc.cpp and used as the source of
hipMemcpyHostToDevice in miopen/src/ocl/ctcocl.cpp. Pass host pointers
unconditionally. The previous largeBAR branch worked only because VRAM
happened to be CPU-addressable there, and it added a redundant H2D copy
in that case.

Authored with Claude.
@jeffdaily jeffdaily dismissed their stale review April 24, 2026 18:54

I'm dismissing my own review.

@jeffdaily
Copy link
Copy Markdown
Collaborator

@mstankov-amd I failed during my review of the original CTC loss integration PR #170749. It's comment "MIOpen requires labels and lengths on GPU" was wrong, and after looking at the MIOpen sources with claude it confirmed it. My commit log 8f725fe summarizes why:

MIOpen's miopenGetCTCLossWorkspaceSize and miopenCTCLoss dereference the
labels, labelLengths, and inputLengths arrays on the host: they're
subscripted directly in miopen/src/ctc.cpp and used as the source of
hipMemcpyHostToDevice in miopen/src/ocl/ctcocl.cpp. Pass host pointers
unconditionally. The previous largeBAR branch worked only because VRAM
happened to be CPU-addressable there, and it added a redundant H2D copy
in that case.

@jeffdaily jeffdaily added the topic: not user facing topic category label Apr 24, 2026
@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 24, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 25, 2026
…3181)

<h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2>

<h3>Problem</h3>

<p>A failing unit test on Windows started showing a couple weeks ago and
a missing <code>#include</code> was added in
[](pytorch#178284), but CI on TheRock
kept failing. The fix was tested on gfx1151 (APU), where the test
passed, but CI showed failures on gfx1100. </p>

<p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting
<code>miopen_ctc_loss</code>) crashes with a fatal access violation on
Windows systems with discrete AMD GPUs:</p>

<pre><code>Windows fatal exception: access violation Exception Code:
0xC0000005
#0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize
(MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize
(MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll)
</code></pre>

<h3>Root Cause</h3>

<p><code>miopenGetCTCLossWorkspaceSize</code> and
<code>miopenCTCLoss</code> read the <code>labels</code>,
<code>label_lengths</code>, and <code>input_lengths</code> arrays
<strong>on the host side</strong> to plan the computation and calculate
workspace requirements. The existing code copies these arrays to GPU
memory and passes device pointers:</p>

<pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA),
at::kInt); // ... hipMemcpy to GPU ...
MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(...,
    labels_gpu.data_ptr&lt;int&gt;(),          // device pointer
    label_lengths_gpu.data_ptr&lt;int&gt;(),   // device pointer
    input_lengths_gpu.data_ptr&lt;int&gt;()    // device pointer
));
</code></pre>

<p>This works on:</p>
<ul>
<li><strong>Linux</strong> — HSA (Heterogeneous System Architecture)
maps GPU allocations into the process virtual address space, making
device pointers host-readable</li> <li><strong>Windows APUs</strong> —
CPU and iGPU share system RAM, so device pointers point to
host-accessible memory</li> </ul>

<p>This crashes on:</p>
<ul>
<li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe;
device pointers are opaque handles that cannot be dereferenced from host
code</li> </ul>

<h3>Verification</h3>

<p>Tested on gfx1201:</p>

<table border="1" cellpadding="6" cellspacing="0">
<tr><th>Check</th><th>Result</th></tr>

<tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code>
(discrete GPU)</td></tr>
<tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr>
<tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr>
<tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code>
pointer via <code>ctypes</code></td><td>Access violation</td></tr>
<tr><td>CTC loss with CPU pointers</td><td>Pass (forward +
backward)</td></tr> </table>

<h3>Fix</h3>

<p>Use host pointers since this is what MIOpen expects should be
used.</p>

<h3>Testing</h3>

<p>Run all existing CTCLoss unit tests.</p>

Pull Request resolved: pytorch#179264
Approved by: https://github.com/jeffdaily

Co-authored-by: Milica Stankovic <mstankov@amd.com>
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 25, 2026
…3180)

<h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2>

<h3>Problem</h3>

<p>A failing unit test on Windows started showing a couple weeks ago and
a missing <code>#include</code> was added in
[](pytorch#178284), but CI on TheRock
kept failing. The fix was tested on gfx1151 (APU), where the test
passed, but CI showed failures on gfx1100. </p>

<p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting
<code>miopen_ctc_loss</code>) crashes with a fatal access violation on
Windows systems with discrete AMD GPUs:</p>

<pre><code>Windows fatal exception: access violation Exception Code:
0xC0000005
#0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize
(MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize
(MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll)
</code></pre>

<h3>Root Cause</h3>

<p><code>miopenGetCTCLossWorkspaceSize</code> and
<code>miopenCTCLoss</code> read the <code>labels</code>,
<code>label_lengths</code>, and <code>input_lengths</code> arrays
<strong>on the host side</strong> to plan the computation and calculate
workspace requirements. The existing code copies these arrays to GPU
memory and passes device pointers:</p>

<pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA),
at::kInt); // ... hipMemcpy to GPU ...
MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(...,
    labels_gpu.data_ptr&lt;int&gt;(),          // device pointer
    label_lengths_gpu.data_ptr&lt;int&gt;(),   // device pointer
    input_lengths_gpu.data_ptr&lt;int&gt;()    // device pointer
));
</code></pre>

<p>This works on:</p>
<ul>
<li><strong>Linux</strong> — HSA (Heterogeneous System Architecture)
maps GPU allocations into the process virtual address space, making
device pointers host-readable</li> <li><strong>Windows APUs</strong> —
CPU and iGPU share system RAM, so device pointers point to
host-accessible memory</li> </ul>

<p>This crashes on:</p>
<ul>
<li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe;
device pointers are opaque handles that cannot be dereferenced from host
code</li> </ul>

<h3>Verification</h3>

<p>Tested on gfx1201:</p>

<table border="1" cellpadding="6" cellspacing="0">
<tr><th>Check</th><th>Result</th></tr>

<tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code>
(discrete GPU)</td></tr>
<tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr>
<tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr>
<tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code>
pointer via <code>ctypes</code></td><td>Access violation</td></tr>
<tr><td>CTC loss with CPU pointers</td><td>Pass (forward +
backward)</td></tr> </table>

<h3>Fix</h3>

<p>Use host pointers since this is what MIOpen expects should be
used.</p>

<h3>Testing</h3>

<p>Run all existing CTCLoss unit tests.</p>

Pull Request resolved: pytorch#179264
Approved by: https://github.com/jeffdaily

Co-authored-by: Milica Stankovic <mstankov@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants