[ROCm] Fix MIOpen CTC loss crash on Windows by mstankov-amd · Pull Request #179264 · pytorch/pytorch

mstankov-amd · 2026-04-03T16:21:59Z

Fix MIOpen CTC loss access violation on Windows discrete GPUs

Problem

A failing unit test on Windows started showing a couple weeks ago and a missing #include was added in [](https://github.com//pull/178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100.

test_CTCLoss_no_batch_dim (and any code path hitting miopen_ctc_loss) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:

Windows fatal exception: access violation
Exception Code: 0xC0000005
#0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4)
#1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912)
#2 at::native::miopen_ctc_loss (torch_hip.dll)

Root Cause

miopenGetCTCLossWorkspaceSize and miopenCTCLoss read the labels, label_lengths, and input_lengths arrays on the host side to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:

Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt);
// ... hipMemcpy to GPU ...
MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(...,
    labels_gpu.data_ptr<int>(),          // device pointer
    label_lengths_gpu.data_ptr<int>(),   // device pointer
    input_lengths_gpu.data_ptr<int>()    // device pointer
));

This works on:

Linux — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable
Windows APUs — CPU and iGPU share system RAM, so device pointers point to host-accessible memory

This crashes on:

Windows dGPUs — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code

Verification

Tested on gfx1201:

Check	Result
`hipDeviceAttributeIntegrated`	`0` (discrete GPU)
`hipDeviceAttributeCanUseHostPointerForRegisteredMem`	`0`
`hipDeviceAttributeManagedMemory`	`0x7FFFFFFF` (unsupported)
`hipDeviceAttributeUnifiedAddressing`	`0x7FFFFFFF` (unsupported)
Host read of `hipMalloc` pointer via `ctypes`	Access violation
CTC loss with CPU pointers	Pass (forward + backward)

Fix

Use host pointers since this is what MIOpen expects should be used.

Testing

Run all existing CTCLoss unit tests.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

pytorch-bot · 2026-04-03T16:22:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/179264

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8f725fe with merge base 2db14fe ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-04-03T16:22:14Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

jeffdaily

@mstankov-amd is this really a WIN32 vs non-WIN32 issue, or is it whether largeBar is supported? See #177023 for a recent PR that tests for the largeBar feature before reading a device pointer directly by the host.

mstankov-amd · 2026-04-07T10:47:19Z

@mstankov-amd is this really a WIN32 vs non-WIN32 issue, or is it whether largeBar is supported? See #177023 for a recent PR that tests for the largeBar feature before reading a device pointer directly by the host.

I used yesterday to analyze the code whether this is largeBar issue. First, I ran the tests on the same machine, just booted Linux and the tests pass without any exceptions. On Linux, the GPU memory is host-accessible regardless of largeBar, while it crashes on Windows.

The core issue is: MIOpen's miopenGetCTCLossWorkspaceSize and miopenCTCLoss dereference the labels/lengths pointers on the host side. If those pointers point to GPU VRAM that isn't host-accessible, we get an access violation.

So, this is WIN32 specific issue.

jeffdaily · 2026-04-07T17:00:05Z

@mstankov-amd I really think this is a largeBAR issue. Instead of the WIN32 check, it should be a largeBAR check like I linked to in the other PR. Can you test that change?

ikalinic · 2026-04-16T11:50:03Z

I tested the isLargeBar approach on a Windows dGPU system (gfx1201, isLargeBar=false). Built via TheRock's build_prod_wheels.py against ROCm 7.12.0a.

Following the pattern from
#177023, the #ifdef _WIN32 ... #else ... #endif block in aten/src/ATen/native/miopen/LossCTC_miopen.cpp (lines 210-242) can be replaced with a runtime isLargeBar check:

  // MIOpen reads these buffers from host memory unless large BAR makes
  // device allocations directly host-accessible.
  Tensor labels_host = targets_t;
  Tensor labels_device;
  Tensor label_lengths_device;
  Tensor input_lengths_device;
  int* labels_ptr = labels_host.data_ptr<int>();
  int* label_lengths_ptr = target_lengths.data();
  int* input_lengths_ptr = input_lengths.data();
#if defined(USE_ROCM) && (ROCM_VERSION >= 70200)
  if (at::cuda::getCurrentDeviceProperties()->isLargeBar) {
    labels_device = labels_host.to(Device(at::kCUDA), at::kInt);
    label_lengths_device = at::empty(
        {static_cast<int64_t>(target_lengths.size())},
        at::TensorOptions().dtype(at::kInt).device(at::kCUDA));
    input_lengths_device = at::empty(
        {static_cast<int64_t>(input_lengths.size())},
        at::TensorOptions().dtype(at::kInt).device(at::kCUDA));
    C10_CUDA_CHECK(hipMemcpy(
        label_lengths_device.data_ptr<int>(),
        target_lengths.data(),
        target_lengths.size() * sizeof(int),
        hipMemcpyHostToDevice));
    C10_CUDA_CHECK(hipMemcpy(
        input_lengths_device.data_ptr<int>(),
        input_lengths.data(),
        input_lengths.size() * sizeof(int),
        hipMemcpyHostToDevice));
    labels_ptr = labels_device.data_ptr<int>();
    label_lengths_ptr = label_lengths_device.data_ptr<int>();
    input_lengths_ptr = input_lengths_device.data_ptr<int>();
  }
#endif

This defaults to host pointers (safe on all platforms) and only switches to device pointers when isLargeBar is true. On the gfx1201 (isLargeBar=false) confirmed the host-pointer path is taken and batched MIOpen CTC loss matches the native fallback.

jeffdaily · 2026-04-17T02:36:40Z

@mstankov-amd based on @ikalinic's analysis, please update the PR to use the largeBAR check instead.

pytorch-bot · 2026-04-22T12:48:22Z

~~Workflows were awaiting approval.~~ CI has now been triggered for the ciflow labels on this PR.

…-dgpu-on-windows

mstankov-amd · 2026-04-22T12:56:46Z

@mstankov-amd based on @ikalinic's analysis, please update the PR to use the largeBAR check instead.

@jeffdaily The PR has been updated

MIOpen's miopenGetCTCLossWorkspaceSize and miopenCTCLoss dereference the labels, labelLengths, and inputLengths arrays on the host: they're subscripted directly in miopen/src/ctc.cpp and used as the source of hipMemcpyHostToDevice in miopen/src/ocl/ctcocl.cpp. Pass host pointers unconditionally. The previous largeBAR branch worked only because VRAM happened to be CPU-addressable there, and it added a redundant H2D copy in that case. Authored with Claude.

I'm dismissing my own review.

jeffdaily · 2026-04-24T19:00:25Z

@mstankov-amd I failed during my review of the original CTC loss integration PR #170749. It's comment "MIOpen requires labels and lengths on GPU" was wrong, and after looking at the MIOpen sources with claude it confirmed it. My commit log 8f725fe summarizes why:

MIOpen's miopenGetCTCLossWorkspaceSize and miopenCTCLoss dereference the
labels, labelLengths, and inputLengths arrays on the host: they're
subscripted directly in miopen/src/ctc.cpp and used as the source of
hipMemcpyHostToDevice in miopen/src/ocl/ctcocl.cpp. Pass host pointers
unconditionally. The previous largeBAR branch worked only because VRAM
happened to be CPU-addressable there, and it added a redundant H2D copy
in that case.

jeffdaily · 2026-04-24T20:39:48Z

@pytorchbot merge

pytorchmergebot · 2026-04-24T20:42:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-04-24T21:03:36Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / lintrunner-clang-all / lint

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

jeffdaily · 2026-04-24T21:36:06Z

@pytorchbot merge

pytorchmergebot · 2026-04-24T21:38:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…3181) <h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](pytorch#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: pytorch#179264 Approved by: https://github.com/jeffdaily Co-authored-by: Milica Stankovic <mstankov@amd.com>

…3180) <h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](pytorch#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: pytorch#179264 Approved by: https://github.com/jeffdaily Co-authored-by: Milica Stankovic <mstankov@amd.com>

[ROCm] Fix MIOpen CTC loss crash on Windows dGPU systems

b89f46b

mstankov-amd requested review from jeffdaily and jithunnair-amd as code owners April 3, 2026 16:22

pytorch-bot Bot added the module: rocm AMD GPU support for Pytorch label Apr 3, 2026

mstankov-amd mentioned this pull request Apr 3, 2026

[Issue] Windows Pytorch nightly (2.12) test failures with Segmentation fault while running TestNN::test_CTCLoss_critical_target_len - gfx110x, gfx1151 ROCm/TheRock#3987

Open

pytorchbot added the open source label Apr 3, 2026

jeffdaily approved these changes Apr 3, 2026

View reviewed changes

jeffdaily previously requested changes Apr 3, 2026

View reviewed changes

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 6, 2026

Fix to largeBar usage

2d1e268

pytorch-bot Bot added the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Apr 22, 2026

Merge remote-tracking branch 'upstream/main' into fix-miopen-ctc-loss…

260023f

…-dgpu-on-windows

mstankov-amd requested a review from jeffdaily April 22, 2026 13:54

mstankov-amd changed the title ~~[ROCm] Fix MIOpen CTC loss crash on Windows dGPU systems~~ [ROCm] Fix MIOpen CTC loss crash on Windows Apr 24, 2026

jeffdaily added the topic: not user facing topic category label Apr 24, 2026

jeffdaily approved these changes Apr 24, 2026

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 24, 2026

pytorchmergebot added the merging label Apr 24, 2026

pytorchmergebot removed the merging label Apr 24, 2026

pytorchmergebot added the merging label Apr 24, 2026

pytorchmergebot added the Merged label Apr 25, 2026

pytorchmergebot closed this in f91370d Apr 25, 2026

pytorchmergebot removed the merging label Apr 25, 2026

This was referenced Apr 25, 2026

[release/2.12] Fix MIOpen CTC loss crash on Windows (#179264) ROCm/pytorch#3180

Merged

[release/2.11] Fix MIOpen CTC loss crash on Windows (#179264) ROCm/pytorch#3181

Merged

Conversation

mstankov-amd commented Apr 3, 2026 • edited by jeffdaily Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix MIOpen CTC loss access violation on Windows discrete GPUs

Problem

Root Cause

Verification

Fix

Testing

Uh oh!

pytorch-bot Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/179264

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

pytorch-bot Bot commented Apr 3, 2026

This PR needs a release notes: label

Uh oh!

jeffdaily left a comment

Choose a reason for hiding this comment

Uh oh!

mstankov-amd commented Apr 7, 2026

Uh oh!

jeffdaily commented Apr 7, 2026

Uh oh!

ikalinic commented Apr 16, 2026

Uh oh!

jeffdaily commented Apr 17, 2026

Uh oh!

pytorch-bot Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstankov-amd commented Apr 22, 2026

Uh oh!

jeffdaily commented Apr 24, 2026

Uh oh!

jeffdaily commented Apr 24, 2026

Uh oh!

pytorchmergebot commented Apr 24, 2026

Merge started

Uh oh!

pytorchmergebot commented Apr 24, 2026

Merge failed

Uh oh!

jeffdaily commented Apr 24, 2026

Uh oh!

pytorchmergebot commented Apr 24, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mstankov-amd commented Apr 3, 2026 •

edited by jeffdaily

Loading

pytorch-bot Bot commented Apr 3, 2026 •

edited

Loading

This PR needs a `release notes:` label

pytorch-bot Bot commented Apr 22, 2026 •

edited

Loading