[ROCm] Fix MIOpen CTC loss crash on Windows#179264
[ROCm] Fix MIOpen CTC loss crash on Windows#179264mstankov-amd wants to merge 4 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/179264
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (1 Unrelated Failure)As of commit 8f725fe with merge base 2db14fe ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
jeffdaily
left a comment
There was a problem hiding this comment.
@mstankov-amd is this really a WIN32 vs non-WIN32 issue, or is it whether largeBar is supported? See #177023 for a recent PR that tests for the largeBar feature before reading a device pointer directly by the host.
I used yesterday to analyze the code whether this is largeBar issue. First, I ran the tests on the same machine, just booted Linux and the tests pass without any exceptions. On Linux, the GPU memory is host-accessible regardless of largeBar, while it crashes on Windows. The core issue is: MIOpen's miopenGetCTCLossWorkspaceSize and miopenCTCLoss dereference the labels/lengths pointers on the host side. If those pointers point to GPU VRAM that isn't host-accessible, we get an access violation. So, this is WIN32 specific issue. |
|
@mstankov-amd I really think this is a largeBAR issue. Instead of the WIN32 check, it should be a largeBAR check like I linked to in the other PR. Can you test that change? |
|
I tested the isLargeBar approach on a Windows dGPU system (gfx1201, isLargeBar=false). Built via TheRock's build_prod_wheels.py against ROCm 7.12.0a. Following the pattern from This defaults to host pointers (safe on all platforms) and only switches to device pointers when isLargeBar is true. On the gfx1201 (isLargeBar=false) confirmed the host-pointer path is taken and batched MIOpen CTC loss matches the native fallback. |
|
@mstankov-amd based on @ikalinic's analysis, please update the PR to use the largeBAR check instead. |
|
|
@jeffdaily The PR has been updated |
MIOpen's miopenGetCTCLossWorkspaceSize and miopenCTCLoss dereference the labels, labelLengths, and inputLengths arrays on the host: they're subscripted directly in miopen/src/ctc.cpp and used as the source of hipMemcpyHostToDevice in miopen/src/ocl/ctcocl.cpp. Pass host pointers unconditionally. The previous largeBAR branch worked only because VRAM happened to be CPU-addressable there, and it added a redundant H2D copy in that case. Authored with Claude.
|
@mstankov-amd I failed during my review of the original CTC loss integration PR #170749. It's comment "MIOpen requires labels and lengths on GPU" was wrong, and after looking at the MIOpen sources with claude it confirmed it. My commit log 8f725fe summarizes why:
|
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…3181) <h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](pytorch#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: pytorch#179264 Approved by: https://github.com/jeffdaily Co-authored-by: Milica Stankovic <mstankov@amd.com>
…3180) <h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](pytorch#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: pytorch#179264 Approved by: https://github.com/jeffdaily Co-authored-by: Milica Stankovic <mstankov@amd.com>
Fix MIOpen CTC loss access violation on Windows discrete GPUs
Problem
A failing unit test on Windows started showing a couple weeks ago and a missing
#includewas added in [](https://github.com//pull/178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100.test_CTCLoss_no_batch_dim(and any code path hittingmiopen_ctc_loss) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:Root Cause
miopenGetCTCLossWorkspaceSizeandmiopenCTCLossread thelabels,label_lengths, andinput_lengthsarrays on the host side to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:This works on:
This crashes on:
Verification
Tested on gfx1201:
hipDeviceAttributeIntegrated0(discrete GPU)hipDeviceAttributeCanUseHostPointerForRegisteredMem0hipDeviceAttributeManagedMemory0x7FFFFFFF(unsupported)hipDeviceAttributeUnifiedAddressing0x7FFFFFFF(unsupported)hipMallocpointer viactypesFix
Use host pointers since this is what MIOpen expects should be used.
Testing
Run all existing CTCLoss unit tests.
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang